# Spark Structured Streaming

In this lesson, you will use Spark Structured Streaming to perform near-realtime processing of message.

-sandbox
### Getting Started

Run the following cell to configure your module.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Remember to attach your notebook to a cluster before running any cells in your notebook. In the notebook's toolbar, select the drop down arrow next to Detached, then select your cluster under Attach to.

![Attached to cluster](https://databricksdemostore.blob.core.windows.net/images/03/03/databricks-cluster-attach.png)

In [3]:
%run "./includes/Module-Setup"

## Prepare sample data

We are going to use a data generator function to create a sampling of flight delay data to build a source for streaming. The following command will run for three (3) minutes to generate a small sample of data you can use for this module.

In [5]:
%scala
DummyDataGenerator.start(3)

The sample flight delay data is written to a folder under your user directory. To view the files created there, run the following cell.

In [7]:
flightsPath = "dbfs:/tmp/%(username)s/new-flights/" %locals()

dbutils.fs.ls(flightsPath)

## Batch / Interactive processing

Now that we have some source data in place, let's explore the data, first using interactive processing, then using Spark Structured Streaming.

The usual first step in attempting to process data is to interactively query the data. 

To accomplish this, import the required libraries into your environment, define a scheme for the data being imported, and then create a static DataFrame.

In [9]:
import json
from pyspark.sql.types import StructType, StructField, IntegerType, DecimalType, StringType, TimestampType
from pyspark.sql.functions import *

# Create a schema for reading the data.
schema = StructType([
        StructField('FlightNumber',IntegerType()), 
        StructField('DepartureTime', TimestampType()), 
        StructField('Delay', DecimalType(10, 1), True),
        StructField('Reason', StringType()),
        StructField('Airline', StringType())])

staticFlights = spark.read.schema(schema).csv(flightsPath)

Quickly look at the data contained in the DataFrame, so you have a sense of the structure and data contained within it.

In [11]:
display(staticFlights)

The flight delay data provides information about flights, departure delays, and the reason for those delays, along with a timestamp of when the flight departure occurred.

Now, let's look at the reasons for flight delays, grouped within one hour windows. To do this, we will group by the `Airline`, and `Reason` columns and one hour windows over the `DepartureTime` column.

In [13]:
staticDelays = (
  staticFlights
    .groupBy(
      staticFlights.Airline,
      staticFlights.Reason,
      window(staticFlights.DepartureTime, "1 hour")
    )
    .count()
)
staticDelays.cache()

# Register the DataFrame as local table 'static_delays'
staticDelays.createOrReplaceTempView("static_delays")

In [14]:
display(staticDelays)

You can now use Databricks visualizations to view the delay reasons by hour graphically. For this, you will use the `%sql` magic command to execute a SQL query against the sample data.

After running the cell below, select the chart icon, select bar chart, and select plot options. Configure the chart as show in the following image.

 ![Delays by hour](https://databricksdemostore.blob.core.windows.net/images/08/02/static-actions-windowed-bar-chart.png 'Delays by hour')

In [16]:
%sql SELECT Reason, Airline, date_format(window.end, "MMM-dd HH:mm") AS DepartureDate, Count FROM static_delays ORDER BY DepartureDate

## Stream Processing

Now that you have analyzed the data interactively, let's convert this to a streaming query that continuously updates as data arrives. Since we just have a static set of data, we are going to emulate a stream from the files by reading them one at a time, in the chronological order they were created.

The query we have to write is similar to the interactive query above, except we replace `read` with `readStream`.

In [18]:
streamingFlights = (spark
              .readStream
              .schema(schema)
              .option("maxFilesPerTrigger", 1) #Treat a sequence of files as a stream by selecting one file at a time
              .csv(flightsPath)
            )

streamingDelays = (
  streamingFlights
    .groupBy(
      streamingFlights.Airline,
      streamingFlights.Reason,
      window(streamingFlights.DepartureTime, "1 hour")
    )
    .count()
)

Now, let's take a quick look at the isStreaming property. This is a way to verify whether or not a DataFrame is streaming. For our streamingWeblogs DataFrame, you should see `True` returned.

In [20]:
streamingDelays.isStreaming

For comparision, run the same command against the weblogs DataFrame you created above to load the weblogs data from CSV files in Azure storage.

In [22]:
staticDelays.isStreaming

### Create an in-memory query for accessing the streaming data
In this scenario, you will store the input data as an in-memory table. From here, you can query the dataset using SQL. The name of table is specified from the queryName option.

> NOTE: In-memory output should only be used on small datasets, as the entire output is collected and stored in memory. In a production environment, you would want to write the query to a file or other sink using code like the following:
  ```scala
  writeStream
      .format("parquet")        // can be "orc", "json", "csv", etc.
      .option("path", "path/to/destination/dir")
      .start()
  ```

In [24]:
spark.conf.set("spark.sql.shuffle.partitions", "2")  # keep the size of shuffles small

query = (
  streamingDelays
    .writeStream
    .format("memory")        # memory = store in-memory table (for testing only)
    .queryName("delays")     # delays = name of the in-memory table
    .outputMode("complete")  # complete = all the counts should be in the table
    .start()
)

`delays` is the handle to the streaming query running in the background. This query is continously picking up files and updating the windowed actions.

Note the status of `delays` in the above cell. By expanding the `delays` query above, you can view the stream and graphs associated with the ingestion of files by the streaming DataFrame.

 ![Query Dashboard](https://databricksdemostore.blob.core.windows.net/images/08/02/streaming-actions-dashboard.png 'Query Dashboard')

With the in-memory query, you can now access the data via Spark SQL, using the queryName, `delays`, as the table name. Let's view the windowed actions graphically, as you did previously.

In [27]:
%sql SELECT Reason, Airline, date_format(window.end, "MMM-dd HH:mm") AS DepartureDate, Count FROM delays ORDER BY DepartureDate

Re-run the above cell a few times, waiting 15 or 20 seconds between runs, and note that data is continually added to the DataFrame, and that the date range on the bottom continously shifts to the right as later days are streamed in.

## Clean up

Stop the streaming query.

In [30]:
query.stop()

## Next Steps

Now that you have a better understanding of Spark Structured Streaming, you can move on to the next lesson which combines Structure Streaming with Event Hubs for streaming data.


Start the next lesson, [Streaming with Event Hubs]($./03-Event-Hubs)