d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Working with Time Windows Lab

## Instructions
* Insert solutions wherever it says `FILL_IN`
* Feel free to copy/paste code from the previous notebook, where applicable
* Run test cells to verify that your solution is correct

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.3**
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - ETL Part 1
  - Spark-SQL

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "../Includes/Classroom-Setup"

Define the name of the stream we are to use later in this lesson:

In [6]:
myStreamName = "lab03_ps"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise 1: Read data into a stream</h2>

The dataset used in this exercise consists of flight information about flights from/to various airports in 2007.

You have already seen this dataset in Exercise 1 of Notebook 02.

To refresh your memory, take a look at the first few lines of the dataset.

In [8]:
path = "dbfs:/mnt/training/asa/flights/2007-01-stream.parquet/part-00000-tid-9167815511861375854-22d81a30-d5b4-43d0-9216-0c20d14c3f54-178-c000.snappy.parquet"
df = spark.read.parquet(path)
display(df)

For this exercise you will need to complete the following tasks:
0. Start a stream that reads parquet files dumped to the directory `dataPath`
0. Control the size of each partition by forcing Spark to processes only 1 file per trigger.

Other notes:
0. The source data has already been defined as `dataPath`
0. The schema has already be defined as `parquetSchema`

In [10]:
# ANSWER
dataPath = "/mnt/training/asa/flights/2007-01-stream.parquet/"

parquetSchema = "DepartureAt timestamp, FlightDate string, DepTime string, CRSDepTime string, ArrTime string, CRSArrTime string, UniqueCarrier string, FlightNum integer, TailNum string, ActualElapsedTime string, CRSElapsedTime string, AirTime string, ArrDelay string, DepDelay string, Origin string, Dest string, Distance string, TaxiIn string, TaxiOut string, Cancelled integer, CancellationCode string, Diverted integer, CarrierDelay string, WeatherDelay string, NASDelay string, SecurityDelay string, LateAircraftDelay string"
  
# Configure the shuffle partitions to match the number of cores  
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

streamDF = (spark                   # Start with the SparkSesion
  .readStream                       # Get the DataStreamReader
  .format("parquet")                # Configure the stream's source for the appropriate file type
  .schema(parquetSchema)            # Specify the parquet files' schema
  .option("maxFilesPerTrigger", 1)  # Restrict Spark to processing only 1 file per trigger
  .load(dataPath)                   # Load the DataFrame specifying its location as dataPath
)

In [11]:
# TEST - Run this cell to test your solution.
schemaStr = str(streamDF.schema)

dbTest("SS-03-shuffles",  sc.defaultParallelism, spark.conf.get("spark.sql.shuffle.partitions"))

dbTest("SS-03-schema-1",  True, "(DepartureAt,TimestampType,true)" in schemaStr)
dbTest("SS-03-schema-2",  True, "(FlightDate,StringType,true)" in schemaStr)
dbTest("SS-03-schema-3",  True, "(DepTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-4",  True, "(CRSDepTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-5",  True, "(ArrTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-6",  True, "(CRSArrTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-7",  True, "(UniqueCarrier,StringType,true)" in schemaStr)
dbTest("SS-03-schema-8",  True, "(FlightNum,IntegerType,true)" in schemaStr)
dbTest("SS-03-schema-9",  True, "(TailNum,StringType,true)" in schemaStr)
dbTest("SS-03-schema-10",  True, "(ActualElapsedTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-11",  True, "(CRSElapsedTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-12",  True, "(AirTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-13",  True, "(ArrDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-14",  True, "(DepDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-15",  True, "(Origin,StringType,true)" in schemaStr)
dbTest("SS-03-schema-16",  True, "(Dest,StringType,true)" in schemaStr)
dbTest("SS-03-schema-17",  True, "(Distance,StringType,true)" in schemaStr)
dbTest("SS-03-schema-18",  True, "(TaxiIn,StringType,true)" in schemaStr)
dbTest("SS-03-schema-19",  True, "(TaxiOut,StringType,true)" in schemaStr)
dbTest("SS-03-schema-20",  True, "(Cancelled,IntegerType,true)" in schemaStr)
dbTest("SS-03-schema-21",  True, "(CancellationCode,StringType,true)" in schemaStr)
dbTest("SS-03-schema-22",  True, "(Diverted,IntegerType,true)" in schemaStr)
dbTest("SS-03-schema-23",  True, "(CarrierDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-24",  True, "(WeatherDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-25",  True, "(NASDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-26",  True, "(SecurityDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-27",  True, "(LateAircraftDelay,StringType,true)" in schemaStr)

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise 2: Plot grouped events</h2>

Plot the count of all flights aggregated by a 30 minute window and `UniqueCarrier`. 

Ignore any events delayed by 300 minutes or more.

You will need to:
0. Use a watermark to discard events not received within 300 minutes
0. Configure the stream for a 30 minute sliding window
0. Aggregate by the 30 minute window and the column `UniqueCarrier`
0. Add the column `start` by extracting it from `window.start`
0. Sort the stream by `start`

In order to create a LIVE bar chart of the data, you'll need to specify the following <b>Plot Options</b>:
* **Keys** is set to `start`
* **Series groupings** is set to `UniqueCarrier`
* **Values** is set to `count`

In [13]:
# ANSWER
from pyspark.sql.functions import hour, window, col

countsDF = (streamDF                                             # Start with the DataFrame
  .withWatermark("DepartureAt", "300 minutes")                   # Specify the watermark
  .groupBy(window("DepartureAt", "30 minute"), "UniqueCarrier" ) # Aggregate the data
  .count()                                                       # Produce a count for each aggreate
  .withColumn("start", col("window.start"))                      # Add the column "start", extracting it from "window.start"
  .orderBy("start")                                              # Sort the stream by "start" 
)
display(countsDF,  streamName = myStreamName)

Wait until stream is done initializing...

In [15]:
untilStreamIsReady(myStreamName)

In [16]:
# TEST - Run this cell to test your solution.
schemaStr = str(countsDF.schema)

dbTest("SS-03-schema-1",  True, "(UniqueCarrier,StringType,true)" in schemaStr)
dbTest("SS-03-schema-2",  True, "(count,LongType,false)" in schemaStr)
dbTest("SS-03-schema-5",  True, "(start,TimestampType,true)" in schemaStr)

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise 3: Stop streaming jobs</h2>

Before we can conclude, we need to shut down all active streams.

In [18]:
# ANSWER

for s in spark.streams.active:   # Iterate over all the active streams
  try:
    s.stop()                     # Stop the stream
    s.awaitTermination()         # Wait for it to stop
    
  except Exception as e:
    # In extream cases, this funtion may throw an ignorable error.
    print("An [ignorable] error has occured while stoping the stream.\n".str(e))    

In [19]:
# TEST - Run this cell to test your solution.
dbTest("SS-03-numActiveStreams", 0, len(spark.streams.active))

print("Tests passed!")

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [21]:
%run "../Includes/Classroom-Cleanup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Next Steps</h2>

Start the next lesson, [Using Kafka]($../SS 04a - Using Kafka).

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>