d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Structured Streaming

<p>
Structured Streaming is an efficient way to ingest large quantities of data from a variety of sources.<br> 
This course is intended to teach you how how to use Structured Streaming to ingest data from files and<br>
publisher-subscribe systems. Starting with the fundamentals of streaming systems, we introduce concepts<br> 
such as reading streaming data, writing out streaming data to directories, displaying streaming data and<br>
Triggers. We discuss the problems associated with trying to aggregate streaming data and then teach<br>
how to solve this problem using structures called windows and expiring old data using watermarking.<br>
Finally, we examine how to connect Structured Streaming with popular publish-subscribe systems to <br>
stream data from Wikipedia.
</p>

## Upon completion of this course, students should be able to

* Read, write and display streaming data.	
* Apply time windows and watermarking to aggregate streaming data.
* Use a publish-subscribe system to stream wikipedia data in order to visualize meaningful analytics	

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.3**
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - ETL Part 1
  - Spark-SQL

-sandbox

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Before You Start</h2>

Before starting this course, you will need to create a cluster and attach it to this notebook.

Please configure your cluster to use Databricks Runtime version **6.3** which includes:
- Python Version 3.x
- Scala Version 2.11
- Apache Spark 2.4.4

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Do not use an ML or GPU accelerated runtimes

Step-by-step instructions for creating a cluster are included here:
- <a href="https://www.databricks.training/step-by-step/creating-clusters-on-azure" target="_blank">Azure Databricks</a>
- <a href="https://www.databricks.training/step-by-step/creating-clusters-on-aws" target="_blank">Databricks on AWS</a>
- <a href="https://www.databricks.training/step-by-step/creating-clusters-on-ce" target="_blank">Databricks Community Edition (CE)</a>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This courseware has been tested against the specific DBR listed above. Using an untested DBR may yield unexpected results and/or various errors. If the required DBR has been deprecated, please <a href="https://academy.databricks.com/" target="_blank">download an updated version of this course</a>.

-sandbox
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>
In general, all courses are designed to run on one of the following Databricks platforms:
* Databricks Community Edition (CE)
* Databricks (an AWS hosted service)
* Azure-Databricks (an Azure-hosted service)

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Some features are not available on the Community Edition, which limits the ability of some courses to be executed in that environment. Please see the course's prerequisites for specific information on this topic.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Additionally, private installations of Databricks (e.g., accounts provided by your employer) may have other limitations imposed, such as aggressive permissions and or language restrictions such as prohibiting the use of Scala which will further inhibit some courses from being executed in those environments.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** All courses provided by Databricks Academy rely on custom variables, functions, and settings to provide you with the best experience possible.

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [5]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/mkslslo1zl?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/mkslslo1zl?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> The Problem</h2>

We have a stream of data coming in from a TCP-IP socket, Kafka, Kinesis or other sources...

The data is coming in faster than it can be consumed

How do we solve this problem?

<img src="https://files.training.databricks.com/images/drinking-from-the-fire-hose.png"  style="height: 300px;"/>

-sandbox

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> The Micro-Batch Model</h2>

Many APIs solve this problem by employing a Micro-Batch model.

In this model, we take our firehose of data and collect data for a set interval of time (the **Trigger Interval**).

In our example, the **Trigger Interval** is two seconds.

<img src="https://files.training.databricks.com/images/streaming-timeline.png" style="height: 150px;"/>

-sandbox

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Processing the Micro-Batch</h2>

For each interval, our job is to process the data from the previous [two-second] interval.

As we are processing data, the next batch of data is being collected for us.

In our example, we are processing two seconds worth of data in about one second.

<img src="https://files.training.databricks.com/images/streaming-timeline-1-sec.png" style="height: 150px;">

### What happens if we don't process the data fast enough when reading from a TCP/IP Stream?

-sandbox
<html>
  <head>
    <script src="https://files.training.databricks.com/static/assets/spark-ilt/labs.js"></script>
    <link rel="stylesheet" type="text/css" href="https://files.training.databricks.com/static/assets/spark-ilt/labs.css">
  </head>
  <body>
    <div id="the-button"><button style="width:15em" onclick="block('the-answer', 'the-button')">Continue Reading</button></div>

    <div id="the-answer" style="display:none">
      <p>In the case of a TCP/IP stream, we will most likely drop packets.</p>
      <p>In other words, we would be losing data.</p>
      <p>If this is an IoT device measuring the outside temperature every 15 seconds, this might be OK.</p>
      <p>If this is a critical shift in stock prices, you could be out thousands of dollars.</p>
    </div>
  </body>
</html>

### What happens if we don't process the data fast enough when reading from a pubsub system like Kafka?

-sandbox
<html>
  <head>
    <script src="https://files.training.databricks.com/static/assets/spark-ilt/labs.js"></script>
    <link rel="stylesheet" type="text/css" href="https://files.training.databricks.com/static/assets/spark-ilt/labs.css">
  </head>
  <body>
    <div id="the-button"><button style="width:15em" onclick="block('the-answer', 'the-button')">Continue Reading</button></div>

    <div id="the-answer" style="display:none">
      <p>In the case of a pubsub system, it simply means we fall further behind.</p>
      <p>Eventually, the pubsub system would reach resource limits inducing other problems.</p>
      <p>However, we can always re-launch the cluster with enough cores to catch up and stay current.</p>
    </div>
  </body>
</html>

Our goal is simply to process the data for the previous interval before data from the next interval arrives.

-sandbox

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> From Micro-Batch to Table</h2>

In Apache Spark, we treat such a stream of **micro-batches** as continuous updates to a table.

The developer then defines a query on this **input table**, as if it were a static table.

The computation on the input table is then pushed to a **results table**.

And finally, the results table is written to an output **sink**. 

<img src="https://files.training.databricks.com/images/eLearning/Delta/stream2rows.png" style="height: 300px"/>

In general, Spark Structured Streams consist of two parts:
* The **Input source** such as 
  * Kafka
  * Azure Event Hub
  * Files on a distributed system
  * TCP-IP sockets
* And the **Sinks** such as
  * Kafka
  * Azure Event Hub
  * Various file formats
  * The system console
  * Apache Spark tables (memory sinks)
  * The completely custom `foreach()` iterator

### Update Triggers
Developers define **triggers** to control how frequently the **input table** is updated. 

Each time a trigger fires, Spark checks for new data (new rows for the input table), and updates the result.

From the docs for `DataStreamWriter.trigger(Trigger)`:
> The default value is ProcessingTime(0) and it will run the query as fast as possible.

And the process repeats in perpetuity.

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Summary</h2>

Use cases for streaming include bank card transactions, log files, Internet of Things (IoT) device data, video game play events and countless others.

Some key properties of streaming data include:
* Data coming from a stream is typically not ordered in any way
* The data is streamed into a **data lake**
* The data is coming in faster than it can be consumed
* Streams are often chained together to form a data pipeline
* Streams don't have to run 24/7:
  * Consider the new log files that are processed once an hour
  * Or the financial statement that is processed once a month

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Review Questions</h2>

**Question:** What is Structured Streaming?<br>
**Answer:** A stream is a sequence of data that is made available over time.<br>
Structured Streaming where we treat a <b>stream</b> of data as a table to which data is continously appended.<br>
The developer then defines a query on this input table, as if it were a static table, to compute a final result table that will be written to an output <b>sink</b>. 
.

**Question:** What purpose do triggers serve?<br>
**Answer:** Developers define triggers to control how frequently the input table is updated.

**Question:** How does micro batch work?<br>
**Answer:** We take our firehose of data and collect data for a set interval of time (the Trigger Interval).<br>
For each interval, our job is to process the data from the previous time interval.<br>
As we are processing data, the next batch of data is being collected for us.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

During the course of this lesson, files, tables, and other artifacts may have been created.

These resources create clutter, consume resources (generally in the form of storage), and may potentially incur some [minor] long-term expense.

You can remove these artifacts by running the **`Classroom-Cleanup`** cell below.

In [21]:
%run ./Includes/Classroom-Cleanup

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Next Steps</h2>

Start the next lesson, [Streaming Concepts]($./SS 02 - Streaming Concepts).

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>