# Generate Sensor Data

This notebook generates a large CSV dataset of simulated IoT sensor readings.

The schema is designed so that Parquet's columnar compression will dramatically
outperform CSV — most columns are low-cardinality strings, small-range integers,
or sequential timestamps that compress extremely well.

**Run this before the Week 2 lab.**

In [None]:
dbutils.widgets.text("num_rows", "10000000", "Number of rows to generate")
num_rows = int(dbutils.widgets.get("num_rows"))
print(f"Generating {num_rows:,} rows")

## Schema

| Column | Type | Values | Why it compresses well |
|--------|------|--------|------------------------|
| sensor_id | STRING | sensor-0001 to sensor-0500 | 500 distinct → dictionary encoding |
| sensor_type | STRING | temperature, humidity, pressure, light, motion | 5 distinct → tiny dictionary |
| location | STRING | building-01-floor-1 through building-10-floor-5 | 50 distinct → dictionary |
| reading_timestamp | TIMESTAMP | Sequential from 2024-01-01 | Sequential → delta encoding |
| reading_value | DOUBLE | Random 0–100 | Random — does not compress well |
| unit | STRING | celsius, percent, hpa, lux, count | 5 distinct → tiny dictionary |
| battery_pct | INT | 0–100 | Small range → bit packing |
| signal_strength | INT | -100 to -30 | Small range → bit packing |
| status | STRING | normal, warning, critical | 3 distinct → tiny dictionary |
| firmware_version | STRING | v1.0, v1.1, v2.0 | 3 distinct → tiny dictionary |
| deployed_date | DATE | 365 dates in 2023 | Dictionary encoding |
| maintenance_flag | BOOLEAN | true/false | Run-length encoding |

## Generate

In [None]:
csv_path = "/FileStore/hwe-data/week2/sensor_readings"

df = (
    spark.range(num_rows)
    .selectExpr(
        "CONCAT('sensor-', LPAD(CAST((id % 500 + 1) AS STRING), 4, '0')) AS sensor_id",
        "ARRAY('temperature', 'humidity', 'pressure', 'light', 'motion')[CAST(id % 5 AS INT)] AS sensor_type",
        "CONCAT('building-', LPAD(CAST((id % 10 + 1) AS STRING), 2, '0'), '-floor-', CAST((id % 5 + 1) AS STRING)) AS location",
        "CAST('2024-01-01' AS TIMESTAMP) + MAKE_INTERVAL(0, 0, 0, 0, 0, 0, CAST(id AS INT)) AS reading_timestamp",
        "ROUND(RAND(42) * 100, 2) AS reading_value",
        "ARRAY('celsius', 'percent', 'hpa', 'lux', 'count')[CAST(id % 5 AS INT)] AS unit",
        "CAST(ABS(HASH(id, 1)) % 101 AS INT) AS battery_pct",
        "CAST(-(ABS(HASH(id, 2)) % 71 + 30) AS INT) AS signal_strength",
        "ARRAY('normal', 'normal', 'normal', 'normal', 'warning', 'critical')[CAST(ABS(HASH(id, 3)) % 6 AS INT)] AS status",
        "ARRAY('v1.0', 'v1.1', 'v2.0')[CAST(ABS(HASH(id, 4)) % 3 AS INT)] AS firmware_version",
        "DATE_ADD(CAST('2023-01-01' AS DATE), CAST(ABS(HASH(id, 5)) % 365 AS INT)) AS deployed_date",
        "CAST(ABS(HASH(id, 6)) % 10 = 0 AS BOOLEAN) AS maintenance_flag"
    )
)

df.printSchema()
df.show(5, truncate=False)

## Write to CSV

We coalesce to 1 partition to produce a single file.

In [None]:
(
    df.coalesce(1)
    .write
    .mode("overwrite")
    .option("header", "true")
    .csv(csv_path)
)

# Show file size
csv_files = dbutils.fs.ls(csv_path)
csv_bytes = sum(f.size for f in csv_files if f.name.endswith(".csv"))
print(f"CSV written to: {csv_path}")
print(f"CSV size: {csv_bytes / 1024 / 1024:.1f} MB")