-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Streaming from Multiplex Bronze

In this notebook, you will configure a query to consume and parse raw data from a single topic as it lands in the multiplex bronze table configured in the last lesson. We'll continue refining this query in the following notebooks.

## Learning Objectives
By the end of this lesson, you should be able to:
- Describe how filters are applied to streaming jobs
- Use built-in functions to flatten nested JSON data
- Parse and save binary-encoded strings to native types

Declare database and set all path variables.

In [0]:
%run ../Includes/Classroom-Setup-3.2

## Define a Batch Read

Before building our streams, we'll start with a static view of our data. Working with static data can be easier during interactive development as no streams will be triggered. 

Because we're working with Delta Lake as our source, we'll still get the most up-to-date version of our table each time we execute a query.

If you're working with SQL, you can just directly query the **`bronze`** table registered in the previous lesson. 

Python and Scala users can easily create a Dataframe from a registered table.

In [0]:
batch_df = spark.table("bronze")
display(batch_df)

key,value,topic,partition,offset,timestamp,date,week_part
MTE5NzE1,eyJkZXZpY2VfaWQiOiAxMTk3MTUsICJ0aW1lIjogMTU3NTE1ODQwMCwgImhlYXJ0cmF0ZSI6IDY2LjYyMjQ0OTczMTU3OTc2fQ==,bpm,0,65087128,1575158410818,2019-12-01,2019-48
MTA3OTUy,eyJkZXZpY2VfaWQiOiAxMDc5NTIsICJ0aW1lIjogMTU3NTE1ODQwMSwgImhlYXJ0cmF0ZSI6IDExNi4wMzEyNTg5ODAxMjA4N30=,bpm,2,65102257,1575158411394,2019-12-01,2019-48
MTYxOTEw,eyJkZXZpY2VfaWQiOiAxNjE5MTAsICJ0aW1lIjogMTU3NTE1ODQwMywgImhlYXJ0cmF0ZSI6IDgyLjMzMzg1OTEyMDg3MzE3fQ==,bpm,0,65087131,1575158417678,2019-12-01,2019-48
MTE2NzA4,eyJkZXZpY2VfaWQiOiAxMTY3MDgsICJ0aW1lIjogMTU3NTE1ODQxMiwgImhlYXJ0cmF0ZSI6IDgzLjEzNDUyNzA2MDg3NzA5fQ==,bpm,3,64899138,1575158421868,2019-12-01,2019-48
MTYxMzQz,eyJkZXZpY2VfaWQiOiAxNjEzNDMsICJ0aW1lIjogMTU3NTE1ODQyMCwgImhlYXJ0cmF0ZSI6IDU5LjcxODMyNzY1MTYyODA0fQ==,bpm,3,64899148,1575158430841,2019-12-01,2019-48
MTEzMDI0,eyJkZXZpY2VfaWQiOiAxMTMwMjQsICJ0aW1lIjogMTU3NTE1ODQyMywgImhlYXJ0cmF0ZSI6IDk0LjcyNTkyNjkxMDg1MjAxfQ==,bpm,4,65002565,1575158432217,2019-12-01,2019-48
MTQxNjg3,eyJkZXZpY2VfaWQiOiAxNDE2ODcsICJ0aW1lIjogMTU3NTE1ODQzMCwgImhlYXJ0cmF0ZSI6IDg0LjM4NzM2MzY4NjEzMDU0fQ==,bpm,2,65102289,1575158437927,2019-12-01,2019-48
MTQyODM4,eyJkZXZpY2VfaWQiOiAxNDI4MzgsICJ0aW1lIjogMTU3NTE1ODQzNCwgImhlYXJ0cmF0ZSI6IDg1Ljc2MzkzNjgwMDQ1MTAzfQ==,bpm,3,64899162,1575158445876,2019-12-01,2019-48
MTk3OTMw,eyJkZXZpY2VfaWQiOiAxOTc5MzAsICJ0aW1lIjogMTU3NTE1ODQ1MCwgImhlYXJ0cmF0ZSI6IDg0LjYyMDk2MzIyMTczNTA3fQ==,bpm,0,65087190,1575158460427,2019-12-01,2019-48
MTEwMzY3,eyJkZXZpY2VfaWQiOiAxMTAzNjcsICJ0aW1lIjogMTU3NTE1ODQ3MSwgImhlYXJ0cmF0ZSI6IDc2LjUwOTk5NTg3NjA2Njc0fQ==,bpm,2,65102354,1575158482354,2019-12-01,2019-48


Delta Lake stores our schema information. 

Let's print it out, just to recall it's structure:

In [0]:
%sql
DESCRIBE bronze

col_name,data_type,comment
key,binary,
value,binary,
topic,string,
partition,bigint,
offset,bigint,
timestamp,bigint,
date,date,
week_part,string,
,,
# Partitioning,,


Preview your data.

In [0]:
%sql
SELECT *
FROM bronze
LIMIT 20

key,value,topic,partition,offset,timestamp,date,week_part
MTE5NzE1,eyJkZXZpY2VfaWQiOiAxMTk3MTUsICJ0aW1lIjogMTU3NTE1ODQwMCwgImhlYXJ0cmF0ZSI6IDY2LjYyMjQ0OTczMTU3OTc2fQ==,bpm,0,65087128,1575158410818,2019-12-01,2019-48
MTA3OTUy,eyJkZXZpY2VfaWQiOiAxMDc5NTIsICJ0aW1lIjogMTU3NTE1ODQwMSwgImhlYXJ0cmF0ZSI6IDExNi4wMzEyNTg5ODAxMjA4N30=,bpm,2,65102257,1575158411394,2019-12-01,2019-48
MTYxOTEw,eyJkZXZpY2VfaWQiOiAxNjE5MTAsICJ0aW1lIjogMTU3NTE1ODQwMywgImhlYXJ0cmF0ZSI6IDgyLjMzMzg1OTEyMDg3MzE3fQ==,bpm,0,65087131,1575158417678,2019-12-01,2019-48
MTE2NzA4,eyJkZXZpY2VfaWQiOiAxMTY3MDgsICJ0aW1lIjogMTU3NTE1ODQxMiwgImhlYXJ0cmF0ZSI6IDgzLjEzNDUyNzA2MDg3NzA5fQ==,bpm,3,64899138,1575158421868,2019-12-01,2019-48
MTYxMzQz,eyJkZXZpY2VfaWQiOiAxNjEzNDMsICJ0aW1lIjogMTU3NTE1ODQyMCwgImhlYXJ0cmF0ZSI6IDU5LjcxODMyNzY1MTYyODA0fQ==,bpm,3,64899148,1575158430841,2019-12-01,2019-48
MTEzMDI0,eyJkZXZpY2VfaWQiOiAxMTMwMjQsICJ0aW1lIjogMTU3NTE1ODQyMywgImhlYXJ0cmF0ZSI6IDk0LjcyNTkyNjkxMDg1MjAxfQ==,bpm,4,65002565,1575158432217,2019-12-01,2019-48
MTQxNjg3,eyJkZXZpY2VfaWQiOiAxNDE2ODcsICJ0aW1lIjogMTU3NTE1ODQzMCwgImhlYXJ0cmF0ZSI6IDg0LjM4NzM2MzY4NjEzMDU0fQ==,bpm,2,65102289,1575158437927,2019-12-01,2019-48
MTQyODM4,eyJkZXZpY2VfaWQiOiAxNDI4MzgsICJ0aW1lIjogMTU3NTE1ODQzNCwgImhlYXJ0cmF0ZSI6IDg1Ljc2MzkzNjgwMDQ1MTAzfQ==,bpm,3,64899162,1575158445876,2019-12-01,2019-48
MTk3OTMw,eyJkZXZpY2VfaWQiOiAxOTc5MzAsICJ0aW1lIjogMTU3NTE1ODQ1MCwgImhlYXJ0cmF0ZSI6IDg0LjYyMDk2MzIyMTczNTA3fQ==,bpm,0,65087190,1575158460427,2019-12-01,2019-48
MTEwMzY3,eyJkZXZpY2VfaWQiOiAxMTAzNjcsICJ0aW1lIjogMTU3NTE1ODQ3MSwgImhlYXJ0cmF0ZSI6IDc2LjUwOTk5NTg3NjA2Njc0fQ==,bpm,2,65102354,1575158482354,2019-12-01,2019-48


There are multiple topics being ingested. So, we'll need to define logic for each of these topics separately.

In [0]:
%sql
SELECT DISTINCT(topic)
FROM bronze

topic
workout
user_info
bpm


We'll cast our binary fields as strings, as this will allow us to manually review their contents.

In [0]:
%sql
SELECT cast(key AS STRING), cast(value AS STRING)
FROM bronze
LIMIT 20

key,value
119715,"{""device_id"": 119715, ""time"": 1575158400, ""heartrate"": 66.62244973157976}"
107952,"{""device_id"": 107952, ""time"": 1575158401, ""heartrate"": 116.03125898012087}"
161910,"{""device_id"": 161910, ""time"": 1575158403, ""heartrate"": 82.33385912087317}"
116708,"{""device_id"": 116708, ""time"": 1575158412, ""heartrate"": 83.13452706087709}"
161343,"{""device_id"": 161343, ""time"": 1575158420, ""heartrate"": 59.71832765162804}"
113024,"{""device_id"": 113024, ""time"": 1575158423, ""heartrate"": 94.72592691085201}"
141687,"{""device_id"": 141687, ""time"": 1575158430, ""heartrate"": 84.38736368613054}"
142838,"{""device_id"": 142838, ""time"": 1575158434, ""heartrate"": 85.76393680045103}"
197930,"{""device_id"": 197930, ""time"": 1575158450, ""heartrate"": 84.62096322173507}"
110367,"{""device_id"": 110367, ""time"": 1575158471, ""heartrate"": 76.50999587606674}"


## Parse Heart Rate Recordings

Let's start by defining logic to parse our heart rate recordings. We'll write this logic against our static data. Note that there are some <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations" target="_blank">unsupported operations</a> in Structured Streaming, so we may need to refactor some of our logic if we don't build our current queries with these limitations in mind.

Together, we'll iteratively develop a single query that parses our **`bpm`** topic to the following schema.

| field | type |
| --- | --- |
| device_id | LONG | 
| time | TIMESTAMP | 
| heartrate | DOUBLE |

We'll be creating the table **`heartrate_silver`** in our architectural diagram.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_heartrate_silver.png" width="60%" />

In [0]:
%sql
SELECT v.*
FROM (
  SELECT from_json(cast(value AS STRING), "device_id LONG, time TIMESTAMP, heartrate DOUBLE") v
  FROM bronze
  WHERE topic = "bpm")

device_id,time,heartrate
116527,2019-12-01T20:20:22.000+0000,77.08200373807904
180027,2019-12-01T20:20:31.000+0000,76.69983509282119
115643,2019-12-01T20:20:34.000+0000,119.46346372281636
161910,2019-12-01T20:20:44.000+0000,77.25766807945315
182045,2019-12-01T20:21:04.000+0000,89.42140280992295
116708,2019-12-01T20:21:03.000+0000,78.60701992694962
145943,2019-12-01T20:21:01.000+0000,83.92139704034682
145943,2019-12-01T20:21:11.000+0000,81.31487773309162
109290,2019-12-01T20:21:13.000+0000,65.91387045512424
119715,2019-12-01T20:21:14.000+0000,63.24737129781295


## Convert Logic for Streaming Read

We can define a streaming read directly against our Delta table. Note that most configuration for streaming queries is done on write rather than read, so here we see little change to our above logic.

The cell below shows how to convert a static table into a streaming temp view (if you wish to write streaming queries with Spark SQL).

In [0]:
(spark.readStream
      .table("bronze")
      .createOrReplaceTempView("TEMP_bronze"))

Updating our above query to refer to this temp view gives us a streaming result.

In [0]:
%sql
SELECT v.*
FROM (
  SELECT from_json(cast(value AS STRING), "device_id LONG, time TIMESTAMP, heartrate DOUBLE") v
  FROM TEMP_bronze
  WHERE topic = "bpm")

device_id,time,heartrate
116527,2019-12-01T20:20:22.000+0000,77.08200373807904
180027,2019-12-01T20:20:31.000+0000,76.69983509282119
115643,2019-12-01T20:20:34.000+0000,119.46346372281636
161910,2019-12-01T20:20:44.000+0000,77.25766807945315
182045,2019-12-01T20:21:04.000+0000,89.42140280992295
116708,2019-12-01T20:21:03.000+0000,78.60701992694962
145943,2019-12-01T20:21:01.000+0000,83.92139704034682
145943,2019-12-01T20:21:11.000+0000,81.31487773309162
109290,2019-12-01T20:21:13.000+0000,65.91387045512424
119715,2019-12-01T20:21:14.000+0000,63.24737129781295


Note that anytime a streaming read is displayed to a notebook, a streaming job will begin and if allowed to run forever this will prevent the cluster from auto-terminating.  You can stop the stream clicking the "Cancel" link in the cell above, clicking "Stop Execution" at the top of the notebook, or running the code below.

Stop the streaming display above before continuing.

In [0]:
for stream in spark.streams.active:
    stream.stop()
    stream.awaitTermination()

To persist results to disk, a streaming write will need to be performed using Python.  We can switch from SQL to Python by using a temporary view as an intermediary to capture the query we want to apply.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW TEMP_SILVER AS
  SELECT v.*
  FROM (
    SELECT from_json(cast(value AS STRING), "device_id LONG, time TIMESTAMP, heartrate DOUBLE") v
    FROM TEMP_bronze
    WHERE topic = "bpm")

Read from the streaming **`TEMP_SILVER`** temporary view and write to the **`heart_rate_silver`** delta table.

Using the **`trigger(availableNow=True)`** option will process all records (in multiple batches if needed) until no more data is available and then stop the stream.

In [0]:
query = (spark.table("TEMP_SILVER").writeStream
               .option("checkpointLocation", f"{DA.paths.checkpoints}/heart_rate")
               .option("path", f"{DA.paths.user_db}/heart_rate_silver.delta")
               .trigger(availableNow=True)
               .table("heart_rate_silver"))

query.awaitTermination()

Alternatively, instead of using SQL, the entire job can be expressed using Python Dataframes API.  The cell below has this logic refactored to Python.

In [0]:
from pyspark.sql import functions as F

json_schema = "device_id LONG, time TIMESTAMP, heartrate DOUBLE"

(spark
   .readStream.table("bronze")
   .filter("topic = 'bpm'")
   .select(F.from_json(F.col("value").cast("string"), json_schema).alias("v"))
   .select("v.*")
   .writeStream
       .option("checkpointLocation", f"{DA.paths.checkpoints}/heart_rate")
       .option("path", f"{DA.paths.user_db}/heart_rate_silver.delta")
       .trigger(availableNow=True)
       .table("heart_rate_silver"))

query.awaitTermination()

<img src="https://files.training.databricks.com/images/icon_warn_32.png"> Before continuing, make sure you cancel any streams. The **`Run All`** button at the top of the screen will say **`Stop Execution`** if you have a stream still running.  Or run the code below to stop all streaming currently running on this cluster.

In [0]:
for stream in spark.streams.active:
    stream.stop()
    stream.awaitTermination()

## Silver Table Motivations

In addition to parsing records and flattening and changing our schema, we should also check the quality of our data before writing to our silver tables.

In the following notebooks, we'll review various quality checks.

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>