#**Spark Structured Streaming: DataFrames and Queries**




In [None]:
# Clone the GitHub repository
!git clone https://github.com/ssalloum/SDSC-Spark5.git

In [None]:
!ls /content/SDSC-Spark5/data

# Install Spark and Create SparkSession


In [None]:
!pip install pyspark

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local[*]")\
        .appName("Streaming Aggregations")\
        .getOrCreate()

spark

# Structured Streaming

* Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
* You can express your streaming computation the same way you would express a batch computation on static data.
* The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive.
* [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
* [Structured Streaming API Reference](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html): pyspark.sql.streaming

## Core Classes
* [pyspark.sql.streaming.DataStreamReader](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html)
* [pyspark.sql.streaming.DataStreamWriter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.html)
* [pyspark.sql.streaming.StreamingQuery](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.html)
* [pyspark.sql.streaming.StreamingQueryManager](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.html)
* [pyspark.sql.streaming.StreamingQueryListener](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryListener.html)

# Data: Heterogeneity Human Activity Recognition Dataset
* In this notebook, we will work with the [Heterogeneity Human Activity Recognition Dataset](https://archive.ics.uci.edu/dataset/344/heterogeneity+activity+recognition).
* This data is also avaiable in the [Spark: The Definitive Guide](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data) Github repository, in the activity data folder.
* The data consists of smartphone and smartwatch sensor readings from a variety of devices.
* Readings from these sensors were recorded while users performed activities like biking, sitting, standing, walking, and so on.
* There are several different smartphones and smartwatches used, and nine total users.


## Create a Static DataFrame

In [None]:
sensorDataPath = "/content/SDSC-Spark5/data/activity-data"

In [None]:
sensorStatic = spark.read.json(sensorDataPath)

In [None]:
sensorStatic.printSchema()

In [None]:
sensorStatic.show()

In [None]:
sensorStatic.groupBy("gt").count().show()

# Streaming Queries
* The easiest way to get started with Structured Streaming is to use files as streaming source.


## Define Input Source
* [pyspark.sql.streaming.DataStreamReader](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html)
* [pyspark.sql.SparkSession.readStream](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.readStream.html)
* `maxFilesPerTrigger`: file processing rate (maximum number of new files to read in every batch)

In [None]:
#create a streaming dataframe from json files
sensorStreaming = spark.readStream.schema(sensorStatic.schema).option("maxFilesPerTrigger", 5)\
                                    .json(sensorDataPath)

In [None]:
sensorStreaming

In [None]:
sensorStreaming.printSchema()

In [None]:
sensorStreaming.count()

* [pyspark.sql.DataFrame.isStreaming](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.isStreaming.html): check whether a DataFrame is a streaming DataFrame or not.

In [None]:
sensorStreaming.isStreaming

## Define Transformations

In [None]:
# count the number of rows for each activity type
activityCounts = sensorStreaming.groupBy("gt").count()

In [None]:
activityCounts

In [None]:
activityCounts.show()

In [None]:
activityCounts.isStreaming

## Specify Output Sink, Output Mode and Processing Details
* [pyspark.sql.streaming.DataStreamWriter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.html)
* [pyspark.sql.DataFrame.writeStream](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.writeStream.html)
* In this example, we use the Memory sink so that the output is stored in memory as an in-memory table.  This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory. Also, Memory sink is not fault-tolerant because it does not guarantee persistence of the output and is meant for debugging purposes only.
* The query name will be used as the table name.


In [None]:
activityWriter = activityCounts.writeStream.queryName("activityCounts")\
  .format("memory")\
  .outputMode("complete")\
  .trigger(processingTime="1 second")

In [None]:
activityWriter

## Start the Streaming Query
* [pyspark.sql.streaming.DataStreamWriter.start](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.start.html)


In [None]:
activityQuery = activityWriter.start()
activityQuery

* [pyspark.sql.streaming.StreamingQuery](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.html)
* `activityQuery` is a handle to the streaming query  named `activityCounts` that is running in the background. It can be used to manage and monitor the query as we will see later.
* In this example, this query continuously picks up files and updates the result counts.
* `start()` is a nonblocking method, so it will return as soon as the query has started in the background.
* If you want the main thread to block until the streaming query has terminated, you can use [StreamingQuery.awaitTermination()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.awaitTermination.html) (this is necessary in a production environment).

## Query the Result Table
While the streaming query is running, we can query the result table (it has the same name as the query).

In [None]:
spark.sql(" select * from activityCounts").show()

We can also user spark.table() tog et the result table as a DataFrame.

In [None]:
spark.table("activityCounts").show()

In [None]:
#we can also define a simple loop that will print the results every certain interval (1 second for example)
from time import sleep
for x in range(5):
    spark.table("activityCounts").show()
    sleep(1)

## Managing and Monitoring Streaming Queries
* [Streaming Query Management](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/query_management.html)

In [None]:
activityQuery.isActive

In [None]:
activityQuery.status

In [None]:
activityQuery.lastProgress

### Streaming Query Manager

* [pyspark.sql.streaming.StreamingQueryManager](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.html)
*  [pyspark.sql.SparkSession.streams](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.streams.html):  get the StreamingQueryManager that can be used to manage the currently active queries.

In [None]:
spark.streams

In [None]:
#You can list active streams in your Spark Session:
spark.streams.active

### Streaming Query Listener
* [pyspark.sql.streaming.StreamingQueryListener](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryListener.html)

* [pyspark.sql.DataFrame.observe](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.observe.html)

## Stop the Streaming Query

In [None]:
activityQuery.stop()

In [None]:
spark.streams.active