d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# Streaming Query
1. Build streaming DataFrames
1. Display streaming query results
1. Write streaming query results
1. Monitor streaming query

##### Classes
- DataStreamReader (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=datastreamreader#pyspark.sql.streaming.DataStreamReader" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamReader.html" target="_blank">Scala</a>)
- DataStreamWriter (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=datastreamwriter#pyspark.sql.streaming.DataStreamWriter" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html" target="_blank">Scala</a>)
- StreamingQuery (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streamingquery#pyspark.sql.streaming.StreamingQuery" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/StreamingQuery.html" target="_blank">Scala</a>)

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

df = (spark.readStream
  .schema(schema)
  .option("maxFilesPerTrigger", 1)
  .parquet(eventsPath)
)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Build streaming DataFrames

In [0]:
df.isStreaming

In [0]:
from pyspark.sql.functions import col, approx_count_distinct, count

emailTrafficDF = (df.filter(col("traffic_source") == "email")
  .withColumn("mobile", col("device").isin(["iOS", "Android"]))
  .select("user_id", "event_timestamp", "mobile")
)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Write streaming query results

In [0]:
checkpointPath = userhome + "/email_traffic/checkpoint"
outputPath = userhome + "/email_traffic/output"

devicesQuery = (emailTrafficDF.writeStream
  .outputMode("append")
  .format("parquet")
  .queryName("email_traffic_p")
  .trigger(processingTime="1 second")
  .option("checkpointLocation", checkpointPath)
  .start(outputPath)
)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Monitor streaming query

In [0]:
devicesQuery.id

In [0]:
devicesQuery.status

In [0]:
devicesQuery.awaitTermination(5)

In [0]:
devicesQuery.stop()

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Coupon Sales Lab
Process and append streaming data on transactions using coupons.
1. Read data stream
2. Filter for transactions with coupons codes
3. Write streaming query results to parquet
4. Monitor streaming query
5. Stop streaming query

##### Classes
- [DataStreamReader](http://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/streaming/DataStreamReader.html)
- [DataStreamWriter](http://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html)
- [StreamingQuery](http://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/streaming/StreamingQuery.html)

In [0]:
schema = "order_id BIGINT, email STRING, transaction_timestamp BIGINT, total_item_quantity BIGINT, purchase_revenue_in_usd DOUBLE, unique_items BIGINT, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>"

### 1. Read data stream
- Use schema stored in **`schema`**
- Set to process 1 file per trigger
- Read from parquet with filepath stored in **`salesPath`**

Assign the resulting DataFrame to **`df`**

In [0]:
# TODO
df = (spark.FILL_IN
)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert df.isStreaming
assert df.columns == ['order_id', 'email', 'transaction_timestamp', 'total_item_quantity', 'purchase_revenue_in_usd', 'unique_items', 'items']

### 2. Filter for transactions with coupon codes
- Explode **`items`** field in **`df`**
- Filter for records where **`items.coupon`** is not null

Assign the resulting DataFrame to **`couponSalesDF`**.

In [0]:
# TODO
couponSalesDF = (df.FILL_IN
)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
schemaStr = str(couponSalesDF.schema)
assert "StructField(items,StructType(List(StructField(coupon" in schemaStr, "items column was not exploded"

### 3. Write streaming query results to parquet
- Configure streaming query to write out to parquet in "append" mode
- Set query name to "coupon_sales"
- Set a trigger interval of 1 second
- Set checkpoint location to **`couponsCheckpointPath`**
- Set output filepath to **`couponsOutputPath`**

Assign the resulting streaming query to **`couponSalesQuery`**.

In [0]:
# TODO
couponsCheckpointPath = workingDir + "/coupon-sales/checkpoint"
couponsOutputPath = workingDir + "/coupon-sales/output"

couponSalesQuery = (couponSalesDF.FILL_IN
)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
untilStreamIsReady("coupon_sales")
assert couponSalesQuery.isActive
assert len(dbutils.fs.ls(couponsOutputPath)) > 0
assert len(dbutils.fs.ls(couponsCheckpointPath)) > 0
assert "coupon_sales" in couponSalesQuery.lastProgress["name"]

### 4. Monitor streaming query
- Get ID of streaming query
- Get status of streaming query

In [0]:
# TODO
queryID = couponSalesQuery.FILL_IN

In [0]:
# TODO
queryStatus = couponSalesQuery.FILL_IN

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert type(queryID) == str
assert list(queryStatus.keys()) == ['message', 'isDataAvailable', 'isTriggerActive']

### 5. Stop streaming query
- Stop the streaming query

In [0]:
# TODO
couponSalesQuery.FILL_IN

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert not couponSalesQuery.isActive

### Classroom Cleanup
Run the cell below to clean up resources.

In [0]:
%run ./Includes/Classroom-Cleanup
