# Lab: Curating the STEDI Step Test Dataset
In this lab, I am combining raw sensor data with step test records. My goal is to label each sensor reading as either a "step" (if it happened during a test) or "no_step" (if it happened outside a test window).

In [0]:
%python
# Load the raw data from the Bronze layer
df_messages = spark.read.table("workspace.bronze.device_message_raw")
df_tests = spark.read.table("workspace.bronze.Rapid_step_test_raw")

# Show what we loaded
print("Device Messages Count:", df_messages.count())
print("Step Tests Count:", df_tests.count())

## Part 1 â€“ Understand bronze tables

First, I preview the two raw STEDI tables to understand their columns and data:
- `workspace.bronze.device_message_raw`
- `workspace.bronze.rapid_step_test_raw`


In [0]:
%sql
SELECT * 
FROM workspace.bronze.device_message_raw
LIMIT 20;


In [0]:
%sql
SELECT * 
FROM workspace.bronze.rapid_step_test_raw
LIMIT 20;


In [0]:
%sql
DESCRIBE TABLE workspace.bronze.device_message_raw;

In [0]:
%sql
DESCRIBE TABLE workspace.bronze.rapid_step_test_raw;


## Load tables as Spark DataFrames


In [0]:
device_df = spark.table("workspace.bronze.device_message_raw")
rapid_df = spark.table("workspace.bronze.rapid_step_test_raw")

display(device_df.limit(5))
display(rapid_df.limit(5))


## Prepare device_message_raw

Convert distance to a numeric column and make sure timestamps and device IDs are clean.


In [0]:
device_df.printSchema()


In [0]:
from pyspark.sql import functions as F

# Convert "1cm" -> 1.0 and convert long date/timestamp to proper timestamp
device_clean = (
    device_df
    .withColumn("distance_cm", F.regexp_replace("distance", "cm", "").cast("double"))
    .withColumn("timestamp_ts", F.col("timestamp").cast("timestamp"))
    .withColumn("device_id_clean", F.col("device_id"))
)

display(device_clean.limit(5))


In [0]:
%python
rapid_df.printSchema()

In [0]:
from pyspark.sql import functions as F

rapid_clean = (
    rapid_df
    # Cast start and stop times (they are long) to timestamps
    .withColumn("start_ts", F.col("start_time").cast("timestamp"))
    .withColumn("stop_ts", F.col("stop_time").cast("timestamp"))
    # Standardize device id name to match device_clean
    .withColumn("device_id_clean", F.col("device_id"))
)

display(rapid_clean.limit(5))


In [0]:
device_clean = (
    device_df
    .withColumn("distance_cm", F.regexp_replace("distance", "cm", "").cast("double"))
    .withColumn("timestamp_ts", F.col("timestamp").cast("timestamp"))
    .withColumn("device_id_clean", F.col("device_id"))
)

display(device_clean.limit(5))


In [0]:
rapid_clean = (
    rapid_df
    .withColumn("start_ts", F.col("start_time").cast("timestamp"))
    .withColumn("stop_ts", F.col("stop_time").cast("timestamp"))
    # keep device_id as-is here, do NOT create device_id_clean again
)
display(rapid_clean.limit(5))


In [0]:
joined = (
    device_clean.alias("d")
    .join(
        rapid_clean.alias("r"),
        (F.col("d.device_id_clean") == F.col("r.device_id")) &
        (F.col("d.timestamp_ts") >= F.col("r.start_ts")) &
        (F.col("d.timestamp_ts") <= F.col("r.stop_ts")),
        how="left"
    )
)

display(joined.limit(20))


In [0]:
curated_df = joined.withColumn(
    "step_label",
    F.when(F.col("r.start_ts").isNotNull(), F.lit("step")).otherwise(F.lit("no_step"))
)

curated_selected = curated_df.select(
    F.col("device_id_clean").alias("device_id"),
    F.col("timestamp_ts"),
    F.col("distance_cm"),
    F.col("sensor_type"),
    F.col("step_label")
)

display(curated_selected.limit(20))


## Save to Silver Layer

Save the curated dataset as a Delta table in the silver layer for later ML work.


In [0]:
target_table = "workspace.silver.stedi_curated_steps"

(curated_selected
    .write
    .mode("overwrite")  # overwrite if table exists
    .format("delta")
    .saveAsTable(target_table)
)

print(f"âœ… Saved {target_table}")


## Verification

Confirm the table was created and both labels exist.


In [0]:
%sql
SELECT * 
FROM workspace.silver.stedi_curated_steps 
LIMIT 20;


In [0]:
%sql
SELECT 
    step_label, 
    COUNT(*) as row_count,
    ROUND(AVG(distance_cm), 2) as avg_distance_cm
FROM workspace.silver.stedi_curated_steps 
GROUP BY step_label;


## Ethics Check

**Are we labeling data fairly?**  
We label `step` only when sensor readings fall inside documented test windows. If timestamps are slightly off or tests overlap, some real steps might be mislabeled as `no_step`, potentially biasing step detection models.

**Are we protecting identity?**  
Device IDs are included but not personally identifiable names. Still, if device IDs map to individuals, we should anonymize them before public sharing or apply strict access controls.

**Are we avoiding medical claims?**  
This dataset only captures steps and sensor distance during tests. It should **not** be used for medical diagnosis, fitness scoring, or health recommendations without clinical validation.
