
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>




# Streaming ETL Lab
## Process Workouts

In this lab, you will configure a query to consume and parse raw data from a single topic as it lands in a multiplex bronze table. You'll also validate and quarantine these records before loading them into a silver table.

## Learning Objectives
By the end of this lesson, you should be able to:
- Describe how filters are applied to streaming jobs
- Use built-in functions to flatten nested JSON data
- Parse and save binary-encoded strings to native types
- Describe and implement a quarantine table

## Hint

Each step of this lab has you define one additional table in the pipeline. Work on this lab one step at a time. Once you think you have a step implemented, run the pipeline to verify whether the table has the correct results. If not, update the code in this notebook. Then in the dp Pipeline interface, select only the table you're working on for a refresh, and perform a **full refresh** on that table.

In [0]:
from pyspark import pipelines as dp
import pyspark.sql.functions as F




## Step 1: Stream Workouts from Multiplex Bronze

Stream records for the **`workout`** topic from the multiplex bronze table to create a **`workouts_bronze`** table.
1. Start a stream against the **`bronze`** table
1. Filter all records by **`topic = 'workout'`**
1. Parse and flatten JSON fields

In [0]:
workouts_schema = "user_id INT, workout_id INT, timestamp FLOAT, action STRING, session_id INT"
<FILL_IN>

In [0]:
%skip
workouts_schema = "user_id INT, workout_id INT, timestamp FLOAT, action STRING, session_id INT"

@dp.table(
    table_properties={"quality": "bronze"}
)
def workouts_bronze():
    return (
        dp.read_stream("bronze")
          .filter("topic = 'workout'")
          .select(F.from_json(F.col("value").cast("string"), workouts_schema).alias("v"))
          .select("v.*")
        )


## Step 2: Promote Workouts to Silver

Process workouts bronze data into a **`workouts_silver`** table with the following schema.

| field | type |
| --- | --- |
| user_id | INT | 
| workout_id | INT | 
| time | TIMESTAMP |
| action | STRING |
| session_id | INT | 

Validate and promote records from **`workouts_bronze`** to silver by implementing a **`workouts_silver`** table.
1. Check that **`user_id`** and **`workout_id`** fields are not null
1. Cast **`timestamp`** to timestamp field named **`time`**
1. Deduplicate on **`user_id`** and **`time`**

In [0]:
rules = <FILL_IN>

In [0]:
%skip
rules = {
  "valid_user_id": "user_id IS NOT NULL",
  "valid_workout_id": "workout_id IS NOT NULL"
}

@dp.table(
    table_properties={"quality": "silver"}
)
@dp.expect_all_or_drop(rules)
def workouts_silver():
    return (
        dp.read_stream("workouts_bronze")
          .select("user_id", "workout_id", F.col("timestamp").cast("timestamp").alias("time"), "action", "session_id")
          .withWatermark("time", "30 seconds")
          .dropDuplicates(["user_id", "time"])
    )


## Step 3: Quarantine Invalid Records

Implement a **`workouts_quarantine`** table for invalid records from **`workouts_bronze`**.

In [0]:
quarantine_rules = <FILL_IN>

In [0]:
%skip
quarantine_rules = {"invalid_record": f"NOT({' AND '.join(rules.values())})"}
@dp.table
@dp.expect_all_or_drop(quarantine_rules)
def workouts_quarantine():
    return (
        dp.read_stream("workouts_bronze")    
    )

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>