# Tutorial 2: Analysis of Non-Text Data Handling with PySpark
This project demonstrates advanced binary-optimized data type handling in PySpark for geospatial and sensor data analysis. It showcases how using appropriate data types (DoubleType, FloatType, StructType, MapType) provides significant performance benefits and enables complex analytical workflows.
## Overview
This notebook demonstrates binary-optimized type handling for:
- **Geospatial Data**: Using `DoubleType` for lat/lon coordinates
- **Sensor Metrics**: Using `MapType` and `StructType` for high-precision sensor readings
- **Predictive Maintenance**: Distance calculations and anomaly detection
## Key Features
1. Binary-Optimized Types
Geospatial Data: DoubleType for latitude/longitude
- No casting required for distance calculations
- Direct mathematical operations (Haversine formula)
- Improved query performance
Sensor Metrics: FloatType in structured formats
- High-precision storage for temperature, voltage, pressure
- Efficient aggregations and statistical analysis
- Optimized for predictive maintenance models
2. Data Structure Patterns
- StructType: Schema-enforced nested structures for critical metrics
- MapType: Flexible key-value storage for dynamic attributes
- Comparison and use cases for both approaches
3. Analysis Capabilities
- Geospatial distance calculations (Haversine formula)
- Anomaly detection for predictive maintenance
- Regional clustering and analysis
- Distance-based priority routing
- Performance benchmarking
## 1. Setup and Imports
This cell imports essential PySpark modules, data types, and functions required for schema definition, data generation, and analysis. It also imports Python standard libraries for date/time manipulation and random number generation.


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType, 
    IntegerType, TimestampType, MapType, FloatType, ArrayType
)
from pyspark.sql.functions import (
    col, udf, lit, sqrt, pow, sin, cos, radians, acos,
    when, avg, stddev, abs as abs_func, explode, map_keys, map_values,
    count, sum as sum_func, max as max_func, min as min_func
)
from datetime import datetime, timedelta
import random

## 2. Define Optimized Schema with Binary Types
This code block defines a PySpark schema (`sensor_schema`) optimized for non-text sensor data. It uses binary-efficient types such as `DoubleType` for geospatial coordinates, `StructType` for structured sensor metrics, and `MapType` for flexible additional metrics. The schema supports high-precision analysis and efficient storage for predictive maintenance and anomaly detection tasks.

In [0]:
# Schema with binary-optimized types for non-text data
sensor_schema = StructType([
    StructField("sensor_id", StringType(), False),
    StructField("device_type", StringType(), False),
    StructField("timestamp", TimestampType(), False),
    
    # Geospatial: DoubleType for precise coordinates (no casting needed)
    StructField("latitude", DoubleType(), False),
    StructField("longitude", DoubleType(), False),
    
    # Sensor Metrics as StructType (binary-optimized)
    StructField("metrics", StructType([
        StructField("temperature_celsius", FloatType(), True),
        StructField("voltage_volts", FloatType(), True),
        StructField("pressure_psi", FloatType(), True),
        StructField("vibration_hz", FloatType(), True),
        StructField("rpm", IntegerType(), True)
    ]), True),
    
    # Alternative: Sensor Metrics as MapType (flexible schema)
    StructField("additional_metrics", MapType(StringType(), FloatType()), True),
    
    # Status and operational data
    StructField("operational_hours", DoubleType(), True),
    StructField("maintenance_due", StringType(), True)
])

## 3. Generate Sample Sensor Data
This section generates a sample dataset of industrial sensor readings using a custom Python function. The generator is simulating a comprehensive Industrial Internet of Things (IIoT) dataset. It tracks the physical health and operational environment of industrial machinery. Here is a breakdown of what the sensor is collecting and how to interpret the data:

**1. The Core Telemetry (Physical Health)**
The data stored in the main tuple represents the "vital signs" of the machinery:
- Temperature (°C): Monitors for overheating. The logic suggests a normal range of 15–45°C, while anything 60–85°C is flagged as a potential failure.
- Voltage (V): Measures electrical input. Normal operation is 220–240V. Significant drops (180–200V) trigger anomalies, suggesting power surges or electrical faults.
- Pressure (PSI/Bar): Critical for the Compressors and Pumps in the list. It looks for spikes (up to 55) against a baseline of 25–35.
- Vibration: Perhaps the most important metric for predictive maintenance. High vibration (5.0–8.0) usually indicates mechanical wear, misalignment, or a loose component.
- RPM (Revolutions Per Minute): Measures the speed of the Turbines or Motors. An "anomaly" here is defined by high-speed over-rotation (2500–3000 RPM).

**2. Environmental & Auxiliary Data**
The dictionary nested in the record provides context about the surroundings:
- Humidity (%): High humidity in Singapore can lead to corrosion or electrical shorts in industrial equipment.
- Noise (dB): Monitors the acoustic environment; sudden increases in decibels can indicate mechanical grinding or failure.
- Power (kW): The actual energy consumption of the device. 

The generated data includes:
- **Sensor ID and Device Type**: Unique identifiers and equipment categories.
- **Timestamp**: Simulated time series for each record.
- **Geospatial Coordinates**: Latitude and longitude (DoubleType) for precise location.
- **Sensor Metrics**: Structured readings (temperature, voltage, pressure, vibration, rpm) with realistic values and occasional anomalies.
- **Additional Metrics**: Flexible MapType for extra measurements (humidity, noise, power).
- **Operational Data**: Total hours and maintenance status.

The data is created as a list of tuples, then loaded into a PySpark DataFrame using the previously defined binary-optimized schema. This enables efficient storage and high-precision analysis for downstream predictive maintenance and anomaly detection tasks.

In [0]:
import random
from datetime import datetime, timedelta

def generate_sensor_data(num_records=1000):
    """Generate realistic sensor data with geospatial and metric information"""
    
    # Define sensor locations (major industrial areas in Singapore)
    locations = [
        ("Jurong Island", 1.2658, 103.6961),
        ("Tuas", 1.2673, 103.6421),
        ("Changi Business Park", 1.3332, 103.9599),
        ("Woodlands Industrial Estate", 1.4304, 103.7854),
        ("Seletar Aerospace Park", 1.4098, 103.8772)
    ]
    
    device_types = ["Turbine", "Compressor", "Pump", "Generator", "Motor"]
    
    data = []
    base_time = datetime(2024, 1, 1)
    
    for i in range(num_records):
        location = random.choice(locations)
        device = random.choice(device_types)
        
        # Add slight variation to coordinates 
        # Note: 0.1 offset is quite large for SG; reduced slightly for better accuracy
        lat_offset = random.uniform(-0.01, 0.01) 
        lon_offset = random.uniform(-0.01, 0.01)
        
        # Generate realistic sensor metrics with occasional anomalies
        is_anomaly = random.random() < 0.05  # 5% anomaly rate
        
        temp = random.uniform(15, 45) if not is_anomaly else random.uniform(60, 85)
        voltage = random.uniform(220, 240) if not is_anomaly else random.uniform(180, 200)
        pressure = random.uniform(25, 35) if not is_anomaly else random.uniform(45, 55)
        vibration = random.uniform(0.5, 2.0) if not is_anomaly else random.uniform(5.0, 8.0)
        rpm = random.randint(1200, 1800) if not is_anomaly else random.randint(2500, 3000)
        
        record = (
            f"SENSOR_{i+1:04d}",
            device,
            base_time + timedelta(hours=i),
            location[1] + lat_offset,  # latitude as Double
            location[2] + lon_offset,  # longitude as Double
            (
                float(temp),
                float(voltage),
                float(pressure),
                float(vibration),
                rpm
            ),
            {
                "humidity": float(random.uniform(30, 70)),
                "noise_db": float(random.uniform(50, 80)),
                "power_kw": float(random.uniform(10, 50))
            },
            float(random.uniform(100, 10000)),
            "yes" if is_anomaly else "no"
        )
        data.append(record)
    
    return data

# Generate data
sensor_data = generate_sensor_data(1000)

# Create DataFrame with optimized schema
# (Assumes spark and sensor_schema are already defined in your environment)
df_sensors = spark.createDataFrame(sensor_data, schema=sensor_schema)

print(f"Generated {df_sensors.count()} sensor records")
df_sensors.printSchema()

Generated 1000 sensor records
root
 |-- sensor_id: string (nullable = false)
 |-- device_type: string (nullable = false)
 |-- timestamp: timestamp (nullable = false)
 |-- latitude: double (nullable = false)
 |-- longitude: double (nullable = false)
 |-- metrics: struct (nullable = true)
 |    |-- temperature_celsius: float (nullable = true)
 |    |-- voltage_volts: float (nullable = true)
 |    |-- pressure_psi: float (nullable = true)
 |    |-- vibration_hz: float (nullable = true)
 |    |-- rpm: integer (nullable = true)
 |-- additional_metrics: map (nullable = true)
 |    |-- key: string
 |    |-- value: float (valueContainsNull = true)
 |-- operational_hours: double (nullable = true)
 |-- maintenance_due: string (nullable = true)



## 4. Display Sample Data


In [0]:

display(df_sensors.limit(10))


sensor_id,device_type,timestamp,latitude,longitude,metrics,additional_metrics,operational_hours,maintenance_due
SENSOR_0001,Generator,2024-01-01T00:00:00.000Z,1.3404496371038976,103.95924434436904,"List(26.902512, 228.20522, 28.198591, 1.6310289, 1429)","Map(humidity -> 62.065247, noise_db -> 51.67869, power_kw -> 21.097378)",4514.409016850724,no
SENSOR_0002,Compressor,2024-01-01T01:00:00.000Z,1.261365025201776,103.6467586381995,"List(40.64004, 238.25906, 31.617815, 1.9927775, 1221)","Map(humidity -> 35.392204, noise_db -> 65.68636, power_kw -> 49.350536)",2608.21286474121,no
SENSOR_0003,Compressor,2024-01-01T02:00:00.000Z,1.272390773847493,103.70420270601473,"List(23.335516, 221.9154, 32.66729, 1.431783, 1764)","Map(humidity -> 65.263756, noise_db -> 62.728394, power_kw -> 36.258533)",2422.1784686027213,no
SENSOR_0004,Pump,2024-01-01T03:00:00.000Z,1.2727806458938589,103.69299290609112,"List(22.823877, 231.58864, 26.181293, 1.3588053, 1763)","Map(humidity -> 59.70751, noise_db -> 53.850395, power_kw -> 10.829429)",1225.1011948100102,no
SENSOR_0005,Turbine,2024-01-01T04:00:00.000Z,1.2636843180461677,103.6918622801885,"List(19.727316, 234.83705, 34.234333, 0.7372836, 1216)","Map(humidity -> 34.41885, noise_db -> 79.61616, power_kw -> 13.656902)",9515.352639831422,no
SENSOR_0006,Motor,2024-01-01T05:00:00.000Z,1.2598869840275084,103.68753395336932,"List(25.42748, 239.37228, 28.451494, 1.2980691, 1738)","Map(humidity -> 68.41259, noise_db -> 58.841522, power_kw -> 41.209885)",6737.724286855522,no
SENSOR_0007,Pump,2024-01-01T06:00:00.000Z,1.3381053115953248,103.95094068906688,"List(34.899498, 226.14616, 28.212383, 0.7819135, 1502)","Map(humidity -> 56.625618, noise_db -> 58.301243, power_kw -> 12.0885725)",782.5211668267249,no
SENSOR_0008,Compressor,2024-01-01T07:00:00.000Z,1.2661375184413035,103.69244513041782,"List(20.783693, 234.80145, 32.969788, 1.5309527, 1633)","Map(humidity -> 40.715492, noise_db -> 55.61878, power_kw -> 34.152706)",4009.4506998600577,no
SENSOR_0009,Compressor,2024-01-01T08:00:00.000Z,1.3301726753437553,103.96719104502333,"List(37.419617, 221.03995, 34.471157, 1.9047004, 1529)","Map(humidity -> 40.230625, noise_db -> 79.28151, power_kw -> 43.210903)",7292.944798773835,no
SENSOR_0010,Compressor,2024-01-01T09:00:00.000Z,1.27302080758999,103.68957317053494,"List(16.029634, 234.85802, 28.25782, 0.74129665, 1451)","Map(humidity -> 30.587555, noise_db -> 78.0067, power_kw -> 33.19153)",9138.78763966981,no


## 5. Geospatial Analysis - Haversine Distance Calculation
Using `DoubleType` coordinates allows direct distance calculations without type casting

In [0]:
def haversine_distance(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    Returns distance in kilometers
    """
    # Convert decimal degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    
    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    
    # Radius of earth in kilometers
    r = 6371
    return c * r

# Register UDF for distance calculation
@udf(returnType=DoubleType())
def calculate_distance(lat1, lon1, lat2, lon2):
    """UDF wrapper for Haversine distance"""
    import math
    
    # Convert to radians
    lat1_rad = math.radians(lat1)
    lon1_rad = math.radians(lon1)
    lat2_rad = math.radians(lat2)
    lon2_rad = math.radians(lon2)
    
    # Haversine formula
    dlat = lat2_rad - lat1_rad
    dlon = lon2_rad - lon1_rad
    a = math.sin(dlat/2)**2 + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(dlon/2)**2
    c = 2 * math.asin(math.sqrt(a))
    
    return c * 6371  # km

# Reference point (Houston headquarters)
ref_lat = 29.7604
ref_lon = -95.3698

# Calculate distance from reference point for all sensors
df_with_distance = df_sensors.withColumn(
    "distance_from_hq_km",
    calculate_distance(col("latitude"), col("longitude"), lit(ref_lat), lit(ref_lon))
)

print("Distance calculations completed (no type casting required due to DoubleType)")
display(df_with_distance.select("sensor_id", "latitude", "longitude", "distance_from_hq_km").limit(10))


Distance calculations completed (no type casting required due to DoubleType)


sensor_id,latitude,longitude,distance_from_hq_km
SENSOR_0001,1.3404496371038976,103.95924434436904,15993.93212682039
SENSOR_0002,1.261365025201776,103.6467586381995,16018.453258510504
SENSOR_0003,1.272390773847493,103.70420270601473,16014.298828908337
SENSOR_0004,1.2727806458938589,103.69299290609112,16014.86246871218
SENSOR_0005,1.2636843180461677,103.6918622801885,16015.808968742658
SENSOR_0006,1.2598869840275084,103.68753395336932,16016.411027573227
SENSOR_0007,1.3381053115953248,103.95094068906688,15994.609243089191
SENSOR_0008,1.2661375184413035,103.69244513041782,16015.538794621934
SENSOR_0009,1.3301726753437553,103.96719104502333,15994.499887620268
SENSOR_0010,1.27302080758999,103.68957317053494,16015.022556130843


## 6. Sensor Metrics Analysis - StructType Approach



In [0]:
# Extract and analyze structured metrics
df_metrics_struct = df_with_distance.select(
    "sensor_id",
    "device_type",
    "timestamp",
    "latitude",              # Keep geospatial columns
    "longitude",             # Keep geospatial columns
    "distance_from_hq_km",
    col("metrics.temperature_celsius").alias("temperature"),
    col("metrics.voltage_volts").alias("voltage"),
    col("metrics.pressure_psi").alias("pressure"),
    col("metrics.vibration_hz").alias("vibration"),
    col("metrics.rpm").alias("rpm"),
    "operational_hours"
)

# Calculate statistics - high precision maintained with FloatType
metrics_stats = df_metrics_struct.select(
    avg("temperature").alias("avg_temperature"),
    stddev("temperature").alias("std_temperature"),
    avg("voltage").alias("avg_voltage"),
    stddev("voltage").alias("std_voltage"),
    avg("pressure").alias("avg_pressure"),
    stddev("pressure").alias("std_pressure"),
    avg("vibration").alias("avg_vibration"),
    stddev("vibration").alias("std_vibration")
)

print("Sensor Metrics Statistics (High-Precision Float Analysis):")
display(metrics_stats)

Sensor Metrics Statistics (High-Precision Float Analysis):


avg_temperature,std_temperature,avg_voltage,std_voltage,avg_pressure,std_pressure,avg_vibration,std_vibration
31.229822244644165,12.34591948413334,228.5242730102539,10.189733550758945,30.81783812713623,4.856247536817991,1.4448029735088348,1.1317675204847717


## 7. Sensor Metrics Analysis - MapType Approach

In [0]:
# Explode MapType metrics for flexible analysis
df_map_metrics = df_with_distance.select(
    "sensor_id",
    "device_type",
    explode("additional_metrics").alias("metric_name", "metric_value")
)

# Aggregate by metric type
map_metrics_summary = df_map_metrics.groupBy("metric_name").agg(
    avg("metric_value").alias("avg_value"),
    stddev("metric_value").alias("std_value")
)

print("Additional Metrics Analysis (MapType flexibility):")
display(map_metrics_summary)

Additional Metrics Analysis (MapType flexibility):


metric_name,avg_value,std_value
power_kw,30.177333779335022,11.450220489043888
noise_db,64.68204336547852,8.776723828989672
humidity,50.0050239868164,11.46160093327735


## 8. Predictive Maintenance - Anomaly Detection

In [0]:
# Define normal operating ranges (using high-precision thresholds)
normal_ranges = {
    "temperature": (15.0, 45.0),
    "voltage": (220.0, 240.0),
    "pressure": (25.0, 35.0),
    "vibration": (0.5, 2.0),
    "rpm": (1200, 1800)
}

# Detect anomalies based on multiple sensor readings
df_anomaly_detection = df_metrics_struct.withColumn(
    "temp_anomaly",
    when(
        (col("temperature") < normal_ranges["temperature"][0]) | 
        (col("temperature") > normal_ranges["temperature"][1]),
        1
    ).otherwise(0)
).withColumn(
    "voltage_anomaly",
    when(
        (col("voltage") < normal_ranges["voltage"][0]) | 
        (col("voltage") > normal_ranges["voltage"][1]),
        1
    ).otherwise(0)
).withColumn(
    "pressure_anomaly",
    when(
        (col("pressure") < normal_ranges["pressure"][0]) | 
        (col("pressure") > normal_ranges["pressure"][1]),
        1
    ).otherwise(0)
).withColumn(
    "vibration_anomaly",
    when(
        (col("vibration") < normal_ranges["vibration"][0]) | 
        (col("vibration") > normal_ranges["vibration"][1]),
        1
    ).otherwise(0)
).withColumn(
    "rpm_anomaly",
    when(
        (col("rpm") < normal_ranges["rpm"][0]) | 
        (col("rpm") > normal_ranges["rpm"][1]),
        1
    ).otherwise(0)
)

# Calculate total anomaly score
df_anomaly_detection = df_anomaly_detection.withColumn(
    "anomaly_score",
    col("temp_anomaly") + col("voltage_anomaly") + 
    col("pressure_anomaly") + col("vibration_anomaly") + col("rpm_anomaly")
)

# Classify maintenance priority
df_maintenance = df_anomaly_detection.withColumn(
    "maintenance_priority",
    when(col("anomaly_score") >= 3, "HIGH")
    .when(col("anomaly_score") >= 2, "MEDIUM")
    .when(col("anomaly_score") >= 1, "LOW")
    .otherwise("NORMAL")
)

print("Anomaly Detection Results:")
display(df_maintenance.filter(col("anomaly_score") > 0).orderBy(col("anomaly_score").desc()).limit(20))


Anomaly Detection Results:


sensor_id,device_type,timestamp,latitude,longitude,distance_from_hq_km,temperature,voltage,pressure,vibration,rpm,operational_hours,temp_anomaly,voltage_anomaly,pressure_anomaly,vibration_anomaly,rpm_anomaly,anomaly_score,maintenance_priority
SENSOR_0574,Motor,2024-01-24T21:00:00.000Z,1.438512851457136,103.78037485624776,15994.034146114893,76.41545,183.83025,51.523422,6.7871113,2975,6842.534291486947,1,1,1,1,1,5,HIGH
SENSOR_0263,Generator,2024-01-11T22:00:00.000Z,1.2738782375707756,103.63282971944987,16017.979829394351,84.60635,184.34383,53.68689,7.160364,2968,6033.669692911504,1,1,1,1,1,5,HIGH
SENSOR_0018,Pump,2024-01-01T17:00:00.000Z,1.426148391448319,103.792833039644,15994.569560315074,77.70602,196.91405,45.31422,5.9170365,2628,5167.510150533783,1,1,1,1,1,5,HIGH
SENSOR_0270,Pump,2024-01-12T05:00:00.000Z,1.4043778562515496,103.8677466518737,15992.659497459275,68.15532,181.84782,50.6027,5.7334485,2835,8689.090747011334,1,1,1,1,1,5,HIGH
SENSOR_0450,Turbine,2024-01-19T17:00:00.000Z,1.269719231370781,103.69730000511028,16014.929467583564,71.28087,194.95865,46.852787,5.412242,2801,8826.960511256344,1,1,1,1,1,5,HIGH
SENSOR_0311,Pump,2024-01-13T22:00:00.000Z,1.2564337988668477,103.70582453709896,16015.765426117188,65.37905,182.32286,48.785927,6.725228,2660,4980.492115568107,1,1,1,1,1,5,HIGH
SENSOR_0039,Pump,2024-01-02T14:00:00.000Z,1.3240587223490414,103.95893199463868,15995.540865244424,66.851295,181.6522,53.58431,7.0132914,2507,8800.673981437969,1,1,1,1,1,5,HIGH
SENSOR_0322,Generator,2024-01-14T09:00:00.000Z,1.4184708084732065,103.87079177961846,15991.124665674964,63.928905,186.20209,53.540592,7.2248306,2936,3246.516156455815,1,1,1,1,1,5,HIGH
SENSOR_0738,Pump,2024-01-31T17:00:00.000Z,1.4265435698280025,103.78585566258344,15994.905628449513,84.01,183.25172,48.12637,5.8647428,2574,8000.351280963247,1,1,1,1,1,5,HIGH
SENSOR_0335,Pump,2024-01-14T22:00:00.000Z,1.2733507609750203,103.64443074426543,16017.410097360773,81.80053,181.72179,49.926647,6.0079556,2843,5119.970878684026,1,1,1,1,1,5,HIGH


## 9. Geospatial Clustering and Regional Analysis

In [0]:
# Perform regional analysis by grouping nearby sensors
# Using precise DoubleType coordinates for clustering

from pyspark.sql.window import Window

# Create regional clusters based on distance
df_regional = df_maintenance.withColumn(
    "region",
    when((col("latitude") >= 40) & (col("longitude") >= -90), "Northeast")
    .when((col("latitude") >= 40) & (col("longitude") < -90), "Northwest")
    .when((col("latitude") < 40) & (col("longitude") >= -100), "Southeast")
    .otherwise("Southwest")
)

# Regional statistics
regional_analysis = df_regional.groupBy("region", "device_type").agg(
    avg("temperature").alias("avg_temp"),
    avg("voltage").alias("avg_voltage"),
    avg("distance_from_hq_km").alias("avg_distance_from_hq"),
    avg("anomaly_score").alias("avg_anomaly_score"),
    count("*").alias("sensor_count")
).orderBy("region", "device_type")

print("Regional Sensor Analysis:")
display(regional_analysis)

Regional Sensor Analysis:


region,device_type,avg_temp,avg_voltage,avg_distance_from_hq,avg_anomaly_score,sensor_count
Southeast,Compressor,32.42432556503929,228.27104798769625,16004.31263253518,0.2073732718894009,217
Southeast,Generator,30.37650180480641,228.27272753147264,16004.01616340637,0.155440414507772,193
Southeast,Motor,31.14369222852919,228.32480448110093,16003.611405288364,0.217391304347826,207
Southeast,Pump,31.54024659181253,228.6305924635667,16003.40356464102,0.2307692307692307,195
Southeast,Turbine,30.499927510606483,229.1841442838628,16003.05244056075,0.2127659574468085,188


## 10. Distance-Based Priority Routing

In [0]:
# Prioritize maintenance based on anomaly score AND distance
# Closer high-priority sensors should be serviced first

df_service_routing = df_maintenance.filter(
    col("maintenance_priority").isin(["HIGH", "MEDIUM"])
).withColumn(
    "service_priority_score",
    (col("anomaly_score") * 100) / (col("distance_from_hq_km") + 1)  # +1 to avoid division by zero
).orderBy(col("service_priority_score").desc())

print("Optimized Service Routing (High-precision distance and anomaly calculations):")
display(df_service_routing.select(
    "sensor_id", "device_type", "latitude", "longitude",
    "distance_from_hq_km", "anomaly_score", "maintenance_priority",
    "service_priority_score"
).limit(20))

Optimized Service Routing (High-precision distance and anomaly calculations):


sensor_id,device_type,latitude,longitude,distance_from_hq_km,anomaly_score,maintenance_priority,service_priority_score
SENSOR_0167,Pump,1.4189074191330369,103.88482646467946,15990.32611680121,5,HIGH,0.0312669503672167
SENSOR_0883,Generator,1.41523315678983,103.88179129761689,15990.847015498122,5,HIGH,0.0312659319161468
SENSOR_0113,Turbine,1.4163636223646177,103.8781852333886,15990.931362318648,5,HIGH,0.0312657670091141
SENSOR_0322,Generator,1.4184708084732065,103.87079177961846,15991.124665674964,5,HIGH,0.0312653890869914
SENSOR_0432,Compressor,1.408959032493945,103.87329643808312,15991.914936535532,5,HIGH,0.0312638441449944
SENSOR_0743,Compressor,1.4001802463247792,103.88680882193265,15992.040470480197,5,HIGH,0.031263598746148
SENSOR_0411,Generator,1.4098943445967993,103.86910339859918,15992.049838944771,5,HIGH,0.031263580432448
SENSOR_0336,Compressor,1.4011392827620262,103.8800715951088,15992.310376312693,5,HIGH,0.0312630711363257
SENSOR_0270,Pump,1.4043778562515496,103.8677466518737,15992.659497459275,5,HIGH,0.0312623887034377
SENSOR_0434,Motor,1.4017793989170992,103.86857953035224,15992.86736996092,5,HIGH,0.0312619823857662


## 11. Performance Comparison: Binary Types vs String Types


In [0]:
from pyspark.sql.functions import current_timestamp
import time

# Create comparison DataFrame with string-based coordinates (inefficient)
df_string_coords = df_sensors.withColumn(
    "lat_string", col("latitude").cast(StringType())
).withColumn(
    "lon_string", col("longitude").cast(StringType())
)

print("Performance Comparison: DoubleType vs StringType for Distance Calculations")
print("=" * 70)

# Test 1: Distance calculation with DoubleType (optimized)
start_time = time.time()
df_double_distance = df_sensors.withColumn(
    "distance",
    calculate_distance(col("latitude"), col("longitude"), lit(ref_lat), lit(ref_lon))
).count()
double_time = time.time() - start_time

print(f"DoubleType Performance: {double_time:.4f} seconds")

# Test 2: Distance calculation with StringType (requires casting)
start_time = time.time()
df_string_distance = df_string_coords.withColumn(
    "distance",
    calculate_distance(
        col("lat_string").cast(DoubleType()), 
        col("lon_string").cast(DoubleType()), 
        lit(ref_lat), 
        lit(ref_lon)
    )
).count()
string_time = time.time() - start_time

print(f"StringType Performance: {string_time:.4f} seconds (with casting)")
print(f"Performance Improvement: {((string_time - double_time) / string_time * 100):.2f}%")


Performance Comparison: DoubleType vs StringType for Distance Calculations
DoubleType Performance: 0.1997 seconds
StringType Performance: 0.2637 seconds (with casting)
Performance Improvement: 24.25%


## 12. Save Processed Data (Optimized Parquet Format)


In [0]:


# Save with partitioning for efficient queries
output_path = "/Volumes/workspace/rebu/sensor-analysis"
# Use df_regional which has the 'region' column
df_regional.write.mode("overwrite") \
    .partitionBy("region", "maintenance_priority") \
    .parquet(f"{output_path}/maintenance_data")

print(f"Data saved to {output_path}/maintenance_data")
print("Partitioned by region and maintenance_priority for optimized querying")


Data saved to /Volumes/workspace/rebu/sensor-analysis/maintenance_data
Partitioned by region and maintenance_priority for optimized querying


## 13. Summary Statistics

In [0]:
from pyspark.sql.functions import count, sum as sum_func

# Use df_regional which has all columns including region
summary_stats = df_regional.groupBy("maintenance_priority").agg(
    count("*").alias("total_sensors"),
    avg("anomaly_score").alias("avg_anomaly_score"),
    avg("distance_from_hq_km").alias("avg_distance"),
    avg("temperature").alias("avg_temperature"),
    avg("voltage").alias("avg_voltage")
).orderBy(
    when(col("maintenance_priority") == "HIGH", 1)
    .when(col("maintenance_priority") == "MEDIUM", 2)
    .when(col("maintenance_priority") == "LOW", 3)
    .otherwise(4)
)

print("Final Summary - Maintenance Priority Distribution:")
display(summary_stats)

Final Summary - Maintenance Priority Distribution:


maintenance_priority,total_sensors,avg_anomaly_score,avg_distance,avg_temperature,avg_voltage
HIGH,41,5.0,16001.907410212769,73.87686334005217,188.0996975782441
NORMAL,959,0.0,16003.772546198074,29.406538944423385,230.2525395302877


# Key Takeaways

### Binary-Optimized Type Benefits:

1. **DoubleType for Geospatial**:
   - No casting required for distance calculations
   - Direct use in mathematical operations (Haversine formula)
   - Improved performance vs string-based coordinates

2. **FloatType for Sensor Metrics**:
   - High-precision storage for critical measurements
   - Efficient aggregations and statistical calculations
   - Reduced memory footprint vs string storage

3. **StructType vs MapType**:
   - **StructType**: Better performance, schema enforcement, type safety
   - **MapType**: Flexibility for dynamic/variable metrics

4. **Performance Improvements**:
   - Faster distance calculations (no type conversion overhead)
   - Efficient anomaly detection with native numeric comparisons
   - Optimized aggregations for predictive maintenance models