# Databricks Delta Streaming
Databricks&reg; Delta allows you to work with streaming data.

In this lesson you, you will use Databricks Delta to read and write streaming data into a data lake.

### CAUTION
* Do not use <b>RunAll</b> mode (next to <b>Permissions</b>, in the menu at the top).

-sandbox
### Getting Started

Run the following cell to configure your module.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Remember to attach your notebook to a cluster before running any cells in your notebook. In the notebook's toolbar, select the drop down arrow next to Detached, then select your cluster under Attach to.

![Attached to cluster](https://databricksdemostore.blob.core.windows.net/images/03/03/databricks-cluster-attach.png)

In [3]:
%run "./includes/Module-Setup"

Set up relevant paths.

In [5]:
dataPath = "/mnt/training/definitive-guide/data/activity-data"
outputPath = userhome + "/gaming"
basePath = userhome + "/advanced-streaming"
checkpointPath = basePath + "/checkpoints"
activityPath = basePath + "/activityCount"

-sandbox
##Streaming Concepts

<b>Stream processing</b> is where you continuously incorporate new data into a data lake and compute results.

The data is coming in faster than it can be consumed.

Treat a <b>stream</b> of data as a table to which data is continously appended. 

In this course we are assuming Databricks Structured Streaming, which uses the DataFrame API. 

There are other kinds of streaming systems.

<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/Delta/stream2rows.png" style="height: 300px"/></div><br/>

Examples are bank card transactions, Internet of Things (IoT) device data, and video game play events. 

Data coming from a stream is typically not ordered in any way.

A streaming system consists of 
* <b>Input source</b> such as Kafka, Azure Event Hub, files on a distributed system or TCP-IP sockets
* <b>Sinks</b> such as Kafka, Azure Event Hub, various file formats, `forEach` sinks, console sinks or memory sinks

### Streaming and Databricks Delta

In streaming, the problems of traditional data pipelines are exacerbated. 

Specifically, with frequent meta data refreshes, table repairs and accumulation of small files on a secondly- or minutely-basis!

Many small files result because data (may be) streamed in at low volumes with short triggers.

Databricks Delta is uniquely designed to address these needs.

-sandbox
### READ Stream using Databricks Delta

The `readStream` method is a <b>transformation</b> that outputs a DataFrame with specific schema specified by `.schema()`. 

Each line of the streaming data becomes a row in the DataFrame once an <b>action</b> such as `writeStream` is invoked.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> In this lesson, we limit flow of stream to one file per trigger with `option("maxFilesPerTrigger", 1)` so that you do not exceed file quotas you may have on your end. The default value is 1000.

Notice that nothing happens until you engage an action, i.e. a `writeStream` operation a few cells down.

Do some data normalization as well:
* Convert `Arrival_Time` to `timestamp` format.
* Rename `Index` to `User_ID`.

In [8]:
static = spark.read.json(dataPath)
dataSchema = static.schema

deltaStreamWithTimestampDF = (spark
  .readStream
  .format("delta")
  .option("maxFilesPerTrigger", 1)
  .schema(dataSchema)
  .json(dataPath)
  .withColumnRenamed('Index', 'User_ID')
  .selectExpr("*","cast(cast(Arrival_Time as double)/1000000000 as timestamp) as event_time")
)

-sandbox
### WRITE Stream using Databricks Delta

#### General Notation
Use this format to write a streaming job to a Databricks Delta table.

> `(myDF` <br>
  `.writeStream` <br>
  `.format("delta")` <br>
  `.option("checkpointLocation", checkpointPath)` <br>
  `.outputMode("append")` <br>
  `.table("my_table")` or `.start(path)` <br>
`)`

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> If you use the `.table()` notation, it will write output to a default location. 

In this course, we want everyone to write data to their own directory; so, instead, we use the `.start()` notation.

#### Output Modes
Notice, besides the "obvious" parameters, specify `outputMode`, which can take on these values
* `append`: add only new records to output sink
* `complete`: rewrite full output - applicable to aggregations operations
* `update`: update changed records in place

#### Checkpointing

When defining a Delta streaming query, one of the options that you need to specify is the location of a checkpoint directory.

`.writeStream.format("delta").option("checkpointLocation", <path-to-checkpoint-directory>) ...`

This is actually a structured streaming feature. It stores the current state of your streaming job.

Should your streaming job stop for some reason and you restart it, it will continue from where it left off.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> If you do not have a checkpoint directory, when the streaming job stops, you lose all state around your streaming job and upon restart, you start from scratch.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Also note that every streaming job should have its own checkpoint directory: no sharing.

-sandbox
### Let's Do Some Streaming

In the cell below, we write streaming query to a Databricks Delta table. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice how we do not need to specify a schema: it is inferred from the data!

In [11]:
deltaStreamingQuery = (deltaStreamWithTimestampDF
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath)
  .outputMode("append")
  .start(basePath)
)

See list of active streams.

In [13]:
spark.streams.active

-sandbox
### Table-to-Table Stream

Here we read a stream of data from from `basePath` and write another stream to `activityPath`.

The data consists of a grouped count of `gt` events.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Make sure the stream using `deltaStreamingQuery` is still running!

To perform an aggregate operation, what kind of `outputMode` should you use?

In [15]:
activityCountsQuery = (spark.readStream
  .format("delta")
  .load(str(basePath))   
  .groupBy("gt")
  .count()
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath + "/activityCount")
  .outputMode("complete")
  .start(activityPath)
)

Once the steam has started, run the cell below to test your solution.

In [17]:
# Run this cell to test that your stream has started.
activityQueryTruth = spark.streams.get(activityCountsQuery.id).isActive

dbTest("Delta-05-activityCountsQuery", True, activityQueryTruth)

print("Tests passed!")

-sandbox
Plot the occurrence of all events grouped by `gt`.

In order to create a LIVE bar chart of the data, you'll need to fill out the <b>Plot Options</b> as shown:

<div><img src="https://s3-us-west-2.amazonaws.com/files.training.databricks.com/images/eLearning/Delta/ch5-plot-options.png"/></div><br/>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> In the cell below, we use the `withWatermark` and `window` methods, which aren't covered in this course. 

To learn more about watermarking, please see <a href="https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html" target="_blank">Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming</a>.

In [19]:
from pyspark.sql.functions import hour, window, col

countsDF = (deltaStreamWithTimestampDF      
  .withWatermark("event_time", "180 minutes")
  .groupBy(window("event_time", "60 minute"),"gt")
  .count()
  .withColumn('hour',hour(col('window.start')))     
)

display(countsDF.withColumn('hour',hour(col('window.start'))))

In [20]:
# compare schemas of countsDF
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType, TimestampType

expectedSchema = StructType([
   StructField("gt",StringType(), True),
   StructField("count",LongType(), False),
   StructField("hour",IntegerType(), True),
   StructField("window", StructType([
       StructField("start",TimestampType(), True),
       StructField("end",TimestampType(), True)
   ]), False)
])

dbTest("Delta-05-schemas", set(expectedSchema), set(countsDF.schema))

print("Tests passed!")

### Shut Down

Stop all the running streams.

In [22]:
for streamingQuery in spark.streams.active:
  streamingQuery.stop()

In [23]:
# TEST - Run this cell to test that all the streams where stopped.
numActiveStreams = len(spark.streams.active)
dbTest("Delta-05-numActiveStreams", 0, numActiveStreams)

print("Tests passed!")

## Summary

Databricks Delta is ideally suited for use in streaming data lake contexts.

## Review Questions
**Q:** Why is Databricks Delta so important for a data lake that incorporates streaming data?<br>
**A:** Frequent meta data refreshes, table repairs and accumulation of small files on a secondly- or minutely-basis!

**Q:** What happens if you shut off your stream before it has fully initialized and started and you try to `CREATE TABLE .. USING DELTA` ? <br>
**A:** You will get this: `Error in SQL statement: AnalysisException: The user specified schema is empty;`.

**Q:** When you do a write stream command, what does this option do `outputMode("append")` ?<br>
**A:** This option takes on the following values and their respective meanings:
* <b>append</b>: add only new records to output sink
* <b>complete</b>: rewrite full output - applicable to aggregations operations
* <b>update</b>: update changed records in place

**Q:** What happens if you do not specify `option("checkpointLocation", pointer-to-checkpoint directory)`?<br>
**A:** When the streaming job stops, you lose all state around your streaming job and upon restart, you start from scratch.

**Q:** How do you view the list of active streams?<br>
**A:** Invoke `spark.streams.active`.

**Q:** How do you verify whether `streamingQuery` is running (boolean output)?<br>
**A:** Invoke `spark.streams.get(streamingQuery.id).isActive`.

### Congratulations! 

You have completed the Streaming module.

#### Additional Topics & Resources

* <a href="https://docs.azuredatabricks.net/delta/delta-streaming.html#as-a-sink" target="_blank">Delta Streaming Write Notation</a>
* <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#" target="_blank">Structured Streaming Programming Guide</a>
* <a href="https://www.youtube.com/watch?v=rl8dIzTpxrI" target="_blank">A Deep Dive into Structured Streaming</a> by Tagatha Das. This is an excellent video describing how Structured Streaming works.