d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Streaming ETL

Apache Spark&trade; and Databricks&reg; makes it easy to build scalable and fault-tolerant streaming ETL applications.

## In this lesson you:
* Define logic to read from a stream of data
* Perform a basic ETL job on a streaming data source
* Join a stream to historical data
* Write to an always up-to-date Databricks Delta table

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Chrome
* A cluster configured with **8 cores** and **DBR 6.2**
* Course: ETL Part 1 from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>
* Course: ETL Part 2 from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Deprecation Warning

This lesson as been replaced by the Structured Streaming course.

If you have not completed an Instructor-Led or Self-Paced course on Structured Stream, we encourage you to do so now at <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>.

In addition to the Structured Streaming API and concepts, our prerequisites course also introduce a number of utility methods for controlling streams which are employed in this lesson:
* **`untilStreamIsReady(name)`**
* **`stopAllStreams()`**

Both of these methods are defined above in the call to **`Classroom-Setup`** and can be found in the notebook **`./Includes/Common-Notebooks/Utility-Methods`**.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [5]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/y36vjvhx5e?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/y36vjvhx5e?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### ETL on Streaming Data

Spark Streaming enables scalable and fault-tolerant ETL operations that continuously clean and aggregate data before pushing it to data stores.  Streaming applications can also incorporate machine learning and other Spark features to trigger actions in real time, such as flagging potentially fraudulent user activity.  This lesson is meant as an introduction to streaming applications as they pertain to production ETL jobs.  

Streaming poses a number of specific obstacles. These obstacles include:<br><br>

* *End-to-end reliability and correctness:* Applications must be resilient to failures of any element of the pipeline caused by network issues, traffic spikes, and/or hardware malfunctions
* *Handle complex transformations:* applications receive many data formats that often involve complex business logic
* *Late and out-of-order data:* network issues can result in data that arrives late and out of its intended order
* *Integrate with other systems:* Applications must integrate with the rest of a data infrastructure

Streaming data sources in Spark offer the same DataFrames API for interacting with your data.  The crucial difference is that in structured streaming, the DataFrame is unbounded.  In other words, data arrives in an input stream and new records are appended to the input DataFrame.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/structured-streamining-model.png" style="height: 400px; margin: 20px"/></div>

-sandbox
### Connecting to the Stream

As data technology matures, the industry has been converging on a set of technologies.  Apache Kafka and the Azure managed alternative Event Hubs has become the ingestion engine at the heart of many pipelines.  

This technology brokers messages between producers, such as an IoT device writing data, and consumers, such as a Spark cluster reading data to perform real time analytics. There can be a many-to-many relationship between producers and consumers and the broker itself is scalable and fault tolerant.

Connect to a Kafka topic that is streaming Wikipedia event data.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/>  There are a number of ways to stream data.  One other common design pattern is to stream from an Azure Blob Container where any new files that appear will be read by the stream.  In this example, we'll be streaming directly from Kafka.

Start by defining the schema of the data in the stream.

In [10]:
from pyspark.sql.types import BooleanType, IntegerType, StructType, StringType, StructField, TimestampType

schema = (StructType()
  .add("timestamp", TimestampType())
  .add("url", StringType())
  .add("userURL", StringType())
  .add("pageURL", StringType())
  .add("isNewPage", BooleanType())
  .add("geocoding", StructType()
    .add("countryCode2", StringType())
    .add("city", StringType())
    .add("latitude", StringType())
    .add("country", StringType())
    .add("longitude", StringType())
    .add("stateProvince", StringType())
    .add("countryCode3", StringType())
    .add("user", StringType())
    .add("namespace", StringType()))
)

To read from a stream, use `spark.readStream()`, which returns a `DataStreamReader` class.  Then, configure the stream by adding the following options:<br><br>

* The server endpoint: `server1.databricks.training:9092`
* The topic to subscribe to: `en`
* A location to log checkpoint metadata (more on this below)
* The format: `kafka` 

Finally, use the load method, which loads the data stream from the Kafka source and returns it as an unbounded `DataFrame`.

In [12]:
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

kafkaDF = (spark
  .readStream
  .option("kafka.bootstrap.servers", "server1.databricks.training:9092")
  .option("subscribe", "en")
  .format("kafka")
  .load()
)

-sandbox
Kafka transmits information using a key, value, and metadata such as topic and partition.  The information we're interested in is the `value` column.  Since this is a binary value, we must first cast it to a `StringType`.  We must also provide it with a schema.  Finally, we can expand the full structure of the JSON.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Wait until the stream finishes initializing.  
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Often streams are started with the `.start()` method.  In this example, `display()`, provided in Databricks environments, is running that command for you.

In [14]:
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StringType

kafkaCleanDF = (kafkaDF
  .select(from_json(col("value").cast(StringType()), schema).alias("message"))
  .select("message.*")
)

In [15]:
myStreamName = "lesson02_ps"
display(kafkaCleanDF, streamName = myStreamName)

timestamp,url,userURL,pageURL,isNewPage,geocoding


In [16]:
# Wait until the stream is ready...
untilStreamIsReady(myStreamName)

Now we have a live stream of Wikipedia data.  Now stop the stream (and any other active streams) to reduce resource consumption.

In [18]:
for s in spark.streams.active:
  s.stop()

### Transform the Stream

We can now start to apply transformation logic to our data in real time.  Parse out only the non-null country data.

In [20]:
goeocodingDF = (kafkaCleanDF
  .filter(col("geocoding.country").isNotNull())
  .select("timestamp", "pageURL", "geocoding.countryCode2", "geocoding.city")
)

display(goeocodingDF, streamName = myStreamName)

timestamp,pageURL,countryCode2,city


In [21]:
# Wait until the stream is ready...
untilStreamIsReady(myStreamName)

Stop the stream

In [23]:
for s in spark.streams.active:
  s.stop()

### Combine with Historical Data

Joins between historical data and streams operate in much the same way that other joins work.  Join the stream of Wikipedia data on country metadata

Import a lookup table of country data.

In [26]:
countries = spark.read.parquet("/mnt/training/countries/ISOCountryCodes/ISOCountryLookup.parquet")

display(countries)

EnglishShortName,alpha2Code,alpha3Code,numericCode,ISO31662SubdivisionCode,independentTerritory
Afghanistan,AF,AFG,4,ISO 3166-2:AF,Yes
Åland Islands,AX,ALA,248,ISO 3166-2:AX,No
Albania,AL,ALB,8,ISO 3166-2:AL,Yes
Algeria,DZ,DZA,12,ISO 3166-2:DZ,Yes
American Samoa,AS,ASM,16,ISO 3166-2:AS,No
Andorra,AD,AND,20,ISO 3166-2:AD,Yes
Angola,AO,AGO,24,ISO 3166-2:AO,Yes
Anguilla,AI,AIA,660,ISO 3166-2:AI,No
Antarctica,AQ,ATA,10,ISO 3166-2:AQ,No
Antigua and Barbuda,AG,ATG,28,ISO 3166-2:AG,Yes


Join the two `DataFrame`s on their two-letter country code.

In [28]:
joinedDF = goeocodingDF.join(countries, goeocodingDF.countryCode2 == countries.alpha2Code)

display(joinedDF, streamName = myStreamName)

timestamp,pageURL,countryCode2,city,EnglishShortName,alpha2Code,alpha3Code,numericCode,ISO31662SubdivisionCode,independentTerritory
2020-04-18T00:03:26.318+0000,http://en.wikipedia.org/wiki/Ellison_Barber,US,New York,United States of America,US,USA,840,ISO 3166-2:US,Yes
2020-04-18T00:03:28.643+0000,http://en.wikipedia.org/wiki/ARIA_Music_Awards_of_2014,AU,,Australia,AU,AUS,36,ISO 3166-2:AU,Yes
2020-04-18T00:03:29.224+0000,http://en.wikipedia.org/wiki/Special:Log/abusefilter,US,Staten Island,United States of America,US,USA,840,ISO 3166-2:US,Yes


In [29]:
# Wait until the stream is ready...
untilStreamIsReady(myStreamName)

Stop the stream

In [31]:
for s in spark.streams.active:
  s.stop()

-sandbox
### Batch and Streaming Writes

The logic that applies to batch processing can be ported over to streaming often with only minor modifications.  One best practice is to run batch operations as streaming jobs using a single trigger.  By using a checkpoint location, the metadata on which data has already been processed will be maintained so the cluster can be shut down without a loss of information.  This works best on streaming from a directory, where new files that appear are added to the stream.

Writes can be done against always up-to-date parquet files or a Databricks Delta table, which offers ACID compliant transactions on top of parquet.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> For more information on Databricks Delta, see the <a href="https://academy.databricks.com/collections/frontpage/products/using-databricks-delta-1-user-1-year" target="_blank">Databricks Delta course from Databricks Academy</a><br>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Read more about triggers in the <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers" target="_blank">Structured Streaming Programming Guide</a>

Confirm that Delta is enabled on your cluster.

In [34]:
spark.sql("set spark.databricks.delta.preview.enabled=true")

Write to a Databricks Delta table and partition by country.

In [36]:
basePath = "{}/etl3p".format(getUserhome())
checkpointPath = "{}/joined.checkpoint".format(basePath)
joinedPath = "{}/joined".format(basePath)

(joinedDF
  .writeStream                                   # Write the stream
  .format("delta")                               # Use the delta format
  .partitionBy("countryCode2")                   # Specify a feature to partition on
  .option("checkpointLocation", checkpointPath)  # Specify where to log metadata
  .option("path", joinedPath)                    # Specify the output path
  .outputMode("append")                          # Append new records to the output path
  .queryName(myStreamName)                       # The name of the stream
  .start()                                       # Start the operation
)

In [37]:
# Wait until the stream is ready...
untilStreamIsReady(myStreamName)

Check to see that the data is there.

In [39]:
from pyspark.sql.functions import count

countsDF = (spark.read.format("delta").load(joinedPath)
  .select(count("*").alias("count"))
)

display(countsDF)

count
7


Now stop all streams

In [41]:
for s in spark.streams.active:
  s.stop()

### Failure Recovery

To recover from cluster failure, Spark uses checkpointing for maintaining state.  In our initial command, we accomplished this using `.option("checkpointLocation", checkpointPath)`.  The checkpoint directory is per query and while that query is active, Spark continuously writes metadata of the processed data to this checkpoint directory.  Even if the entire cluster fails, the query can be restarted on a new cluster using the same directory and Spark can use this to start a new query where the failed one left off.  

**This is how Spark ensures end-to-end, exactly-once guarantees and how Spark maintains metadata on what data it has already seen between streaming jobs.**

In [43]:
display(dbutils.fs.ls(checkpointPath))

path,name,size
dbfs:/user/tbresee@umich.edu/etl3p/joined.checkpoint/commits/,commits/,0
dbfs:/user/tbresee@umich.edu/etl3p/joined.checkpoint/metadata,metadata,45
dbfs:/user/tbresee@umich.edu/etl3p/joined.checkpoint/offsets/,offsets/,0
dbfs:/user/tbresee@umich.edu/etl3p/joined.checkpoint/sources/,sources/,0


## Exercise 1: Streaming ETL

This exercise entails reading from a Kafka stream of product orders and writing the results to a Delta table.

### Step 1: Create a Streaming DataFrame

Create the streaming DataFrame `productsDF` using the following specifics:<br><br>

* Server endpoint: `server1.databricks.training:9092`
* Topic name: `product-orders`
* Format: `kafka`

In [46]:
# ANSWER
productsDF = (spark
  .readStream
  .option("kafka.bootstrap.servers", "server1.databricks.training:9092")
  .option("subscribe", "product-orders")
  .format("kafka")
  .load()
)

In [47]:
# TEST - Run this cell to test your solution
from pyspark.sql.types import *

dbTest("ET3-P-02-01-01", True, productsDF.schema['key'].dataType in [BinaryType(), StringType()])
dbTest("ET3-P-02-01-02", True, productsDF.schema['value'].dataType in [BinaryType(), StringType()])
dbTest("ET3-P-02-01-03", True, productsDF.schema['topic'].dataType == StringType())
dbTest("ET3-P-02-01-04", True, productsDF.schema['partition'].dataType == IntegerType())
dbTest("ET3-P-02-01-05", True, productsDF.schema['offset'].dataType == LongType())
dbTest("ET3-P-02-01-06", True, productsDF.schema['timestamp'].dataType in [TimestampType(), StringType()])
dbTest("ET3-P-02-01-07", True, productsDF.schema['timestampType'].dataType == IntegerType())

print("Tests passed!")

### Step 2: Defining a Schema

Define a schema that consists of the following values.  Save it as `productSchema`<br>

| Field            | Type             |
|:-----------------|:-----------------|
| `orderID`        | `IntegerType()`  |
| `productID`      | `IntegerType()`  |
| `orderTimestamp` | `TimestampType()`|
| `orderQty`       | `IntegerType()`  |

In [49]:
# ANSWER
from pyspark.sql.types import IntegerType, StructField, StructType, TimestampType

productSchema = StructType([
  StructField("orderID", IntegerType(), nullable=True),
  StructField("productID", IntegerType(), nullable=True),
  StructField("orderTimestamp", TimestampType(), nullable=True),
  StructField("orderQty", IntegerType(), nullable=True)
])

In [50]:
# TEST - Run this cell to test your solution.
from pyspark.sql.types import IntegerType, StringType, TimestampType

dbTest("ET3-P-02-02-01", True, productSchema['orderID'].dataType == IntegerType())
dbTest("ET3-P-02-02-02", True, productSchema['productID'].dataType == IntegerType())
dbTest("ET3-P-02-02-03", True, productSchema['orderTimestamp'].dataType in [StringType(), TimestampType()])
dbTest("ET3-P-02-02-04", True, productSchema['orderQty'].dataType == IntegerType())

print("Tests passed!")

### Step 3: Parsing the Kafka Data

Parse the `value` column of the Kafka data.  Remember to use the `from_json()` function and apply your schema.  Save the results to `ordersDF`

In [52]:
# ANSWER
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StringType

ordersDF = productsDF.select(from_json(col("value").cast(StringType()), productSchema).alias("order"))

In [53]:
# TEST - Run this cell to test your solution.
from pyspark.sql.types import IntegerType, TimestampType

_df = ordersDF.select(col("order.orderID").alias("orderID"),
                       col("order.productID").alias("productID"),
                       col("order.orderTimestamp").alias("orderTimestamp"),
                       col("order.orderQty").alias("orderQty"))

expected = {"orderID": IntegerType(), "orderQty": IntegerType(), "orderTimestamp": TimestampType(), "productID": IntegerType()}
sch = {c.name: c.dataType for c in _df.schema}

dbTest("ET3-P-02-03-01", True, sch == expected)

print("Tests passed!")

### Step 4: Write to a Delta Table

Write the results to an always up-to-date Databricks Delta table to the provided path.  The table should have the following columns:<br><br>

1. `orderID`
1. `productID`
1. `orderTimestamp`
1. `orderQty`

In [55]:
# ANSWER

ordersPath = "{}/orders".format(basePath)
checkpointPath = "{}/orders.checkpoint".format(basePath)

(ordersDF
  .select("order.orderID", "order.orderQty", "order.orderTimestamp", "order.productID")
  .writeStream
  .format("delta")
  .partitionBy("productID")
  .option("checkpointLocation", checkpointPath)
  .option("path", ordersPath)
  .outputMode("append")
  .queryName(myStreamName)
  .start()
)

In [56]:
# Wait until the stream is ready...
untilStreamIsReady(myStreamName)

In [57]:
# TEST - Run this cell to test your solution.
from pyspark.sql.types import *

ordersPath = "{}/orders".format(basePath)
_df = spark.read.format("delta").load(ordersPath)

expected = {"orderID": IntegerType(), "orderQty": IntegerType(), "orderTimestamp": TimestampType(), "productID": IntegerType()}
sch = {c.name: c.dataType for c in _df.schema}

dbTest("ET3-P-02-04-01", True, sch == expected)

print("Tests passed!")

### Step 5: Now Stop the Stream and Delete Files

Stop the stream.

In [59]:
for s in spark.streams.active:
  s.stop()

Recursively delete the files you created (this is a permanent operation).

In [61]:
dbutils.fs.rm(basePath, True)

## Review
**Question:** What is the best practice for different versions of ETL jobs?  
**Answer:** Generally speaking, an ETL solution should have three versions:
0. *Batch* or the ability to run a periodic workload
0. *Streaming* or the ability to process incoming data in real time
0. *Incremental* or the ability to process a specific set of data, especially in the case of job failure.<br>
In practice, batch and streaming jobs are oftentimes combined where a batch job is a streaming workload using a single trigger.

**Question:** What are commonly approached as data streams?  
**Answer:** Apache Kafka and the Azure managed alternative Event Hubs are common data streams.  Additionally, it's common to monitor a directory for incoming files.  When a new file appears, it is brought into the stream for processing.

**Question:** How does Spark ensure exactly-once data delivery and maintain metadata on a stream?  
**Answer:** Checkpoints give Spark this fault tolerance through the ability to maintain state off of the cluster.

**Question:** How does the Spark approach to streaming integrate with other Spark features?  
**Answer:** Spark Streaming uses the same DataFrame API, allowing easy integration with other Spark functionality.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [64]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Runnable Notebooks]($./ETL3 03 - Runnable Notebooks ).

## Additional Topics & Resources

**Q:** Where can I find out more information on streaming ETL jobs?  
**A:** Check out the Databricks blog post <a href="https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html" target="_blank">Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1</a>

**Q:** Where can I get more information on integrating Streaming and Kafka?  
**A:** Check out the <a href="https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html" target="_blank">Structured Streaming + Kafka Integration Guide</a>

**Q:** Where can I see a case study on an IoT pipeline using Spark Streaming?  
**A:** Check out the Databricks blog post <a href="https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html" target="_blank">Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>