d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
# Databricks Delta Streaming
Databricks&reg; Delta allows you to work with streaming data.

## In this lesson you:
* Read and write streaming data into a data lake

## Audience
* Primary Audience: Data Engineers 
* Secondary Audience: Data Analyst sand Data Scientists

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.2**
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - ETL Part 1
  - Spark-SQL
  - Structured Streaming

## Datasets Used
Data from 
`/mnt/training/definitive-guide/data/activity-data`
contains smartphone accelerometer samples from all devices and users. 

The file consists of the following columns:

| Field           | Description                  |
|-----------------|------------------------------|
| `Arrival_Time`  | when record came in          |
| `Creation_Time` | when record was created      |
| `Index`         | unique identifier of event   |
| `x`             | acceleration in x-dir        |
| `y`             | acceleration in y-dir        |
| `z`             | acceleration in z-dir        |
| `User`          | unique user identifier       |
| `Model`         | i.e Nexus4                   |
| `Device`        | type of Model                |
| `gt`            | ground truth                 |

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The last column, `gt`, "ground truth" means
* What the person was ACTUALLY doing when the measurement was taken, in this case, walking or going up stairs, etc..
* Wikipedia: <a href="https://en.wikipedia.org/wiki/Ground_truth" target="_blank">Ground Truth</a>

## CAUTION
* Streaming jobs will quickly exhaust your disk quota, if you are using Community Edition. If you find yourself running out of disk space, 
  you can [mount your own S3 bucket](https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-an-s3-bucket) and change
  your streaming paths accordingly.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/cpilwjdr38?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/cpilwjdr38?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

Set up relevant paths.

In [7]:
dataPath = "dbfs:/mnt/training/definitive-guide/data/activity-data"

In [8]:

print(dataPath)


-sandbox
##Streaming Concepts

<b>Stream processing</b> is where you continuously incorporate new data into a data lake and compute results.

The data is coming in faster than it can be consumed.

<div><img src="https://files.training.databricks.com/images/eLearning/Delta/firehose.jpeg" style="height: 200px"/></div><br/>

Treat a <b>stream</b> of data as a table to which data is continously appended. 

In this course we are assuming Databricks Structured Streaming, which uses the DataFrame API. 

There are other kinds of streaming systems.

<div><img src="https://files.training.databricks.com/images/eLearning/Delta/stream2rows.png" style="height: 300px"/></div><br/>

Examples are bank card transactions, Internet of Things (IoT) device data, and video game play events. 

Data coming from a stream is typically not ordered in any way.

A streaming system consists of 
* <b>Input source</b> such as Kafka, Azure Event Hub, files on a distributed system or TCP-IP sockets
* <b>Sinks</b> such as Kafka, Azure Event Hub, various file formats, `foreach` sinks, console sinks or memory sinks

### Streaming and Databricks Delta

In streaming, the problems of traditional data pipelines are exacerbated. 

Specifically, with frequent meta data refreshes, table repairs and accumulation of small files on a secondly- or minutely-basis!

Many small files result because data (may be) streamed in at low volumes with short triggers.

Databricks Delta is uniquely designed to address these needs.

-sandbox
### READ Stream using Databricks Delta

The `readStream` method is similar to a transformation that outputs a DataFrame with specific schema specified by `.schema()`. 

Each line of the streaming data becomes a row in the DataFrame once `writeStream` is invoked.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> In this lesson, we limit flow of stream to one file per trigger with `option("maxFilesPerTrigger", 1)` so that you do not exceed file quotas you may have on your end. The default value is 1000.

Notice that nothing happens until you engage an action, i.e. a `writeStream` operation a few cells down.

Do some data normalization as well:
* Convert `Arrival_Time` to `timestamp` format.
* Rename `Index` to `User_ID`.

In [12]:

static = spark.read.json(dataPath)
dataSchema = static.schema

deltaStreamWithTimestampDF = (spark
  .readStream
  .option("maxFilesPerTrigger", 1)
  .schema(dataSchema)
  .json(dataPath)
  .withColumnRenamed('Index', 'User_ID')
  .selectExpr("*","cast(cast(Arrival_Time as double)/1000 as timestamp) as event_time")
)


-sandbox
### WRITE Stream using Databricks Delta

#### General Notation
Use this format to write a streaming job to a Databricks Delta table.

> `(myDF` <br>
  `.writeStream` <br>
  `.format("delta")` <br>
  `.option("checkpointLocation", somePath)` <br>
  `.outputMode("append")` <br>
  `.table("my_table")` or `.start(path)` <br>
`)`

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> If you use the `.table()` notation, it will write output to a default location. 
* This would be in parquet files under `/user/hive/warehouse/default.db/my_table`

In this course, we want everyone to write data to their own directory; so, instead, we use the `.start()` notation.

#### Output Modes
Notice, besides the "obvious" parameters, specify `outputMode`, which can take on these values
* `append`: add only new records to output sink
* `complete`: rewrite full output - applicable to aggregations operations
* `update`: update changed records in place

#### Checkpointing

When defining a Delta streaming query, one of the options that you need to specify is the location of a checkpoint directory.

`.writeStream.format("delta").option("checkpointLocation", <path-to-checkpoint-directory>) ...`

This is actually a structured streaming feature. It stores the current state of your streaming job.

Should your streaming job stop for some reason and you restart it, it will continue from where it left off.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> If you do not have a checkpoint directory, when the streaming job stops, you lose all state around your streaming job and upon restart, you start from scratch.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Also note that every streaming job should have its own checkpoint directory: no sharing.

-sandbox
### Let's Do Some Streaming

In the cell below, we write streaming query to a Databricks Delta table. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice how we do not need to specify a schema: it is inferred from the data!

And to help us manage our streams better, we will make use of **`untilStreamIsReady()`**, **`stopAllStreams()`** and define the following, **`myStreamName`**:

In [16]:
myStreamName = "lesson05_ps"

In [17]:
writePath =      workingDir + "/output.delta"
checkpointPath = workingDir + "/output.checkpoint"

deltaStreamingQuery = (deltaStreamWithTimestampDF
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath)
  .outputMode("append")
  .queryName(myStreamName)
  .start(writePath)
)

See list of active streams.

In [19]:
for s in spark.streams.active:
  print("{}: {}".format(s.name, s.id))

Wait until stream is done initializing...

In [21]:
untilStreamIsReady(myStreamName)

In [22]:
stopAllStreams()

# LAB

-sandbox
## Step 1: Table-to-Table Stream

Here we read a stream of data from from `writePath` and write another stream to `activityPath`.

The data consists of a grouped count of `gt` events.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Make sure the stream using `deltaStreamingQuery` is still running!

To perform an aggregate operation, what kind of `outputMode` should you use?

In [25]:
# ANSWER
activityPath   = workingDir + "/activityCount.delta"
checkpointPath = workingDir + "/activityCount.checkpoint"

activityCountsQuery = (spark.readStream
  .format("delta")
  .load(str(writePath))   
  .groupBy("gt")
  .count()
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath)
  .outputMode("complete")
  .queryName(myStreamName)
  .start(activityPath)
)


Wait until stream is done initializing...

In [27]:
untilStreamIsReady(myStreamName)

In [28]:
# TEST - Run this cell to test your solution.
activityQueryTruth = spark.streams.get(activityCountsQuery.id).isActive

dbTest("Delta-05-activityCountsQuery", True, activityQueryTruth)

print("Tests passed!")

In [29]:
stopAllStreams()

-sandbox
## Step 2

Plot the occurrence of all events grouped by `gt`.

Under <b>Plot Options</b>, use the following:
* <b>Series Groupings:</b> `gt`
* <b>Values:</b> `count`

In <b>Display type</b>, use <b>Bar Chart</b> and click <b>Apply</b>.


<div><img src="https://files.training.databricks.com/images/eLearning/Delta/ch5-plot-options.png" style="height: 300px;"/></div><br/>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> In the cell below, we use the `withWatermark` and `window` methods, which aren't covered in this course. 

To learn more about watermarking, please see <a href="https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html" target="_blank">Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming</a>.

In [31]:
# ANSWER
from pyspark.sql.functions import hour, window, col

countsDF = (deltaStreamWithTimestampDF      
  .withWatermark("event_time", "180 minutes")
  .groupBy(window("event_time", "60 minute"), "gt")
  .count()
)

display(countsDF.withColumn('hour',hour(col('window.start'))), streamName = myStreamName)

window,gt,count,hour
"List(2015-02-23T10:00:00.000+0000, 2015-02-23T11:00:00.000+0000)",walk,9513,10
"List(2015-02-23T10:00:00.000+0000, 2015-02-23T11:00:00.000+0000)",stairsup,5790,10
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T14:00:00.000+0000)",sit,6552,13
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T14:00:00.000+0000)",,7272,13
"List(2015-02-23T14:00:00.000+0000, 2015-02-23T15:00:00.000+0000)",stairsdown,6384,14
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T14:00:00.000+0000)",walk,9726,13
"List(2015-02-23T10:00:00.000+0000, 2015-02-23T11:00:00.000+0000)",bike,6672,10
"List(2015-02-24T14:00:00.000+0000, 2015-02-24T15:00:00.000+0000)",bike,18281,14
"List(2015-02-24T14:00:00.000+0000, 2015-02-24T15:00:00.000+0000)",walk,9480,14
"List(2015-02-23T13:00:00.000+0000, 2015-02-23T14:00:00.000+0000)",stairsdown,6271,13


In [32]:
untilStreamIsReady(myStreamName)

In [33]:
# TEST - Run this cell to test your solution.
schemaStr = str(countsDF.schema)

dbTest("Assertion #1", 3, len(countsDF.columns))
dbTest("Assertion #2", True, "(gt,StringType,true)" in schemaStr) 
dbTest("Assertion #3", True, "(count,LongType,false)" in schemaStr) 

dbTest("Assertion #5", True, "window,StructType" in schemaStr)
dbTest("Assertion #6", True, "(start,TimestampType,true)" in schemaStr) 
dbTest("Assertion #7", True, "(end,TimestampType,true)" in schemaStr) 

print("Tests passed!")

In [34]:
stopAllStreams()

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [36]:
%run "./Includes/Classroom-Cleanup"

## Summary

In this lesson, we:
* Treated a stream of data as a table to which data is continously appended. 
* Learned we could write Databricks Delta output to a directory or directly to a table.
* Used Databricks Delta's `complete` mode with aggregation queries.
* Saw that `display` will produce LIVE graphs if we feed it a streaming DataFrame

## Review Questions
**Q:** Why is Databricks Delta so important for a data lake that incorporates streaming data?<br>
**A:** Frequent meta data refreshes, table repairs and accumulation of small files on a secondly- or minutely-basis!

**Q:** What happens if you shut off your stream before it has fully initialized and started and you try to `CREATE TABLE .. USING DELTA` ? <br>
**A:** You will get this: `Error in SQL statement: AnalysisException: The user specified schema is empty;`.

**Q:** When you do a write stream command, what does this option do `outputMode("append")` ?<br>
**A:** This option takes on the following values and their respective meanings:
* <b>append</b>: add only new records to output sink
* <b>complete</b>: rewrite full output - applicable to aggregations operations
* <b>update</b>: update changed records in place

**Q:** What happens if you do not specify `option("checkpointLocation", pointer-to-checkpoint directory)`?<br>
**A:** When the streaming job stops, you lose all state around your streaming job and upon restart, you start from scratch.

**Q:** How do you view the list of active streams?<br>
**A:** Invoke `spark.streams.active`.

**Q:** How do you verify whether `streamingQuery` is running (boolean output)?<br>
**A:** Invoke `spark.streams.get(streamingQuery.id).isActive`.

## Next Steps

Start the next lesson, [Optimization]($./Delta 06 - Optimization).

## Additional Topics & Resources

* <a href="https://docs.databricks.com/delta/delta-streaming.html#as-a-sink" target="_blank">Delta Streaming Write Notation</a>
* <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#" target="_blank">Structured Streaming Programming Guide</a>
* <a href="https://www.youtube.com/watch?v=rl8dIzTpxrI" target="_blank">A Deep Dive into Structured Streaming</a> by Tagatha Das. This is an excellent video describing how Structured Streaming works.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>