In [0]:
from pyspark.sql import functions as F

# 1. Load
dm = spark.read.table("workspace.bronze.device_message_raw").withColumn("source_label", F.lit("device"))
rt = spark.read.table("workspace.bronze.Rapid_step_test_raw").withColumn("source_label", F.lit("step"))

# 2. Clean
dm_clean = dm.withColumn("distance_cm", F.regexp_replace(F.col("distance").cast("string"), "[^0-9.]", "").cast("double")) \
             .withColumn("ts_ms", F.col("timestamp").cast("bigint"))

# 3. Join & Label (Dropping BOTH duplicate columns)
final_df = dm_clean.alias("m").join(
    rt.alias("t"),
    (F.col("m.device_id") == F.col("t.device_id")) & 
    (F.col("m.ts_ms").between(F.col("t.start_time"), F.col("t.stop_time"))),
    how="left"
).drop(rt.device_id).drop(rt.source_label) # Drop both duplicates here

final_df = final_df.withColumn("step_label", 
    F.when(F.col("t.start_time").isNotNull(), "step").otherwise("no_step")
)

# 4. Save to the final table name required by the validation query
final_df.write.mode("overwrite").saveAsTable("labeled_step_test")

In [0]:
%sql
-- Steps vs. No-Steps
SELECT step_label, COUNT(*)
FROM labeled_step_test
GROUP BY step_label;

-- Invalid or missing labels
SELECT *
FROM labeled_step_test
WHERE step_label NOT IN ('step', 'no_step')
  OR step_label IS NULL


LIMIT 50;

-- Source label counts
SELECT source_label, COUNT(*)
FROM labeled_step_test
GROUP BY source_label;

-- Invalid source labels
SELECT *
FROM labeled_step_test
WHERE source_label NOT IN ('device', 'step')
   OR source_label IS NULL
LIMIT 50;

##The Ethics Reflection

Automating health data pipelines introduces a high responsibility for accuracy, as errors in the ETL process can propagate silently without human intervention. Engineers must implement rigorous validation checks to ensure that automated labels like 'step' or 'no_step' remain accurate over time, preventing biased datasets that could lead to incorrect health insights. We also have a duty to protect individual privacy by ensuring that automated workflows do not inadvertently expose PII or make diagnostic claims that exceed the data's intent. Ultimately, automation should serve to protect the integrity of the person's health narrative, not just the efficiency of the data flow.