# Spark Structured Streaming - Demo
## Pizza Oven


### Authors

```
Marco Balduini - marco.balduini@quantiaconsulting.com
Emanuele Della Valle - emanuele.dellavalle@polimi.it
```
```
Translation to SSS: Massimo Pavan - massimo1.pavan@mail.polimi.it
```

### Use Case Description - Linear Pizza Oven
We have a linear oven to continuously cook pizza.

The cooking operation has two main steps:

* the cooking of the pizza base, and
* the mozzarella melting area.

There are two sensors:

* S1 measures the temperature and the relative humidity of the pizza base cooking area.
* S2 measures the temperature and the relative humidity of the mozzarella melting area. 

Both sensors send a temperature measurement every minute, but are not synchronised.

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
import io
from pyspark.sql.functions import *
import time
import json
import struct
import requests 

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.5,org.apache.kafka:kafka-clients:2.6.0 pyspark-shell'
                                    
spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )

spark

set up the environment variables

In [None]:
temperature_humidity_topic = 'Temperature_Humidity_Sensor_Event'
servers = "kafka:9092"

## Understanding spark-kafka integration
Let's treat first kafka as a bulk source

In [None]:
temperature_humidity_df = (spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("subscribe", temperature_humidity_topic)
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load())

In [None]:
temperature_humidity_df.printSchema()

In [None]:
temperature_humidity_df.show(5)

In [None]:
stringified_temperature_humidity_df = temperature_humidity_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
stringified_temperature_humidity_df.show(5,False)

In [None]:
from pyspark.sql.types import *

temperature_humidity_schema = StructType([
    StructField("sensor", StringType(), True),
    StructField("temperature", IntegerType(), True),
    StructField("humidity", IntegerType(), True),
    StructField("ts", TimestampType(), True)])

In [None]:
decoded_temperature_humidity_df = stringified_temperature_humidity_df.select(col("key").cast("string"),from_json(col("value"), temperature_humidity_schema).alias("value"))

In [None]:
decoded_temperature_humidity_df.printSchema()

In [None]:
decoded_temperature_humidity_df.select("value.*").show(35)

## DEMO
Please refer to [insert_link_here_if_available]() for the EPL version of the following queries.

link to docs: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

In [None]:
streaming_temperature_humidity_df = (spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("startingOffsets", "earliest")
  .option("subscribe", temperature_humidity_topic)
  .load())

In [None]:
decoded_streaming_temperature_humidity_df=(streaming_temperature_humidity_df
                      .select(from_json(col("value").cast("string"), temperature_humidity_schema).alias("value"))
                      .select("value.*"))

In [None]:
decoded_streaming_temperature_humidity_df.printSchema()

In [None]:
temperature_humidity_query = (decoded_streaming_temperature_humidity_df
    .writeStream
    .format("memory")
    .queryName("temperature_humiditySensorEvent")
    .start())

In [None]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent ORDER BY TS ASC").show(10)

## Q1 - Filter

In [None]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent WHERE temperature < 100 AND sensor = 'S2' ").show(5)

## Q2 - Filter

Extract all the measurements in a given range
### Absolute range

In [None]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent WHERE ts >= '2020-07-21 12:00:00' AND ts <= '2020-07-21 12:05:00'").show(5)

### Relative range (start: -36h)

In [None]:
from datetime import datetime

now = 1595333140
thirtysixhoursago = datetime.fromtimestamp(now - 60*60*36).strftime("%Y-%m-%d %H:%M:%S") #60*60*36 = seconds*minutes*hours
query = "SELECT * FROM temperature_humiditySensorEvent WHERE ts >= '{}'".format(thirtysixhoursago)
spark.sql(query).show(5)

## Q3 - Filter by tag

Extract the temperature data from the cooking base area (sensor S1)

In [None]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent WHERE sensor = 'S1'").show(5)

## Q4 - Filter By Value 

Extract the measurements from the cooking base area (sensor S1) with a temperature under 300°  

In [None]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent WHERE sensor = 'S1' AND temperature < 300 ").show(5)

## Q5 - Grouping + Aggregator (mean)

#### Extract the average temperature and the average humidity along the different stages of the linear pizza oven

In [None]:
#Watermarks are necessary while quering the data, in order to understand how much the data can arrive late 
All_time_averages_query = (decoded_streaming_temperature_humidity_df
                         .withWatermark("ts", "1 minutes")
                         .groupBy(col("sensor"))
                         .avg("humidity", "temperature")
                     .writeStream
                     .outputMode("complete")
                     .format("memory")
                     .queryName("All_time_averages_query")
                     .start())

In [None]:
#The execution of this query could require some time: if dataframe seems empty, just try to re-run the cell after a while
spark.sql("SELECT * FROM All_time_averages_query").show()

In [None]:
All_time_averages_query.stop()

#### Extract the last humidity and temperature measurements from the cooking base area

In [None]:
#AND is necessary because there can be a record with that ts also from the s2 sensor
spark.sql("""SELECT * FROM temperature_humiditySensorEvent WHERE ts = (SELECT MAX(ts) FROM temperature_humiditySensorEvent 
            WHERE sensor = 'S1') AND sensor = 'S1'""").show()

## Q6 - Aggregate Window

#### Extract the moving average temperature observed in the cooking base area over a window of 2 minutes (DEMO)

In [None]:
#note: this corresponds to a logical tumbling window
LTW_temperature_query = (decoded_streaming_temperature_humidity_df
                         .withWatermark("ts", "1 minutes")
                         .groupBy(window("ts", "2 minutes"),"sensor")
                         .avg("temperature")
                     .writeStream
                     .format("memory")
                     .queryName("LTW_temperature_query_results")
                     .start())

In [None]:
#CARE: IT MAY TAKE A WHILE (MINUTES) TO PROCESS THE MOST RECENT WINDOW
spark.sql("SELECT * FROM LTW_temperature_query_results WHERE sensor = 'S1' ").show(25,False)

In [None]:
LTW_temperature_query.stop()

#### Extract the moving average temperature observed by S2 over a window of 3 minutes (hands-on)

In [None]:
LTW_temperature_query2 = (decoded_streaming_temperature_humidity_df
                         .withWatermark("TS", "1 minutes")
                         .groupBy(window("TS", "3 minutes"),"sensor")
                         .avg("temperature")
                     .writeStream
                     .format("memory")
                     .queryName("LTW_temperature_query2_results")
                     .start())

In [None]:
spark.sql("SELECT * FROM LTW_temperature_query2_results WHERE sensor = 'S2'").show(5,False)

In [None]:
LTW_temperature_query2.stop()

## Q7 - Map and custom function

#### Correct the temperature observations of the cooking base area by by subtracting a delta of 5°C to each value

In [None]:
#if you want to keep all the records, also the one from the other sensor, a solution could be:

new_column = when(
        (col("sensor") == "S1"), col("temperature") - 5
    ).otherwise(col("temperature"))

map_temperature_query = (decoded_streaming_temperature_humidity_df
                         .withColumn("temperature", new_column)
                     .writeStream
                     .format("memory")
                     .queryName("map_temperature_query_results")
                     .start())

In [None]:
spark.sql("SELECT * FROM map_temperature_query_results").show(5,False)

In [None]:
map_temperature_query.stop()

In [None]:
#alternatively, if you'd like to keep only the values from sensor S1 a solution could be:
def sub5(x):
    x = x-5
    return x

df = decoded_streaming_temperature_humidity_df.select("*").where("sensor = 'S1'")
fun = udf(sub5)

map_temperature_query_alt = (df
                         .withColumn("temperature", fun(df["temperature"]))
                     .writeStream
                     .format("memory")
                     .queryName("map_temperature_query_alt_results")
                     .start())

In [None]:
spark.sql("SELECT * FROM map_temperature_query_alt_results").show(5,False)

In [None]:
map_temperature_query_alt.stop()

## Q8 - Stream-to-Stream Join

#### Extract the difference between the temperature of the base cooking area and the mozzarella melting area

### Join assuming synchronous time-series

Apply watermarks on event-time columns and other filters

In [None]:
only_S1_events = (decoded_streaming_temperature_humidity_df
                .withWatermark("ts", "1 minute")
                .filter(col("sensor") == "S1")
               )

only_S2_events = (decoded_streaming_temperature_humidity_df
                .withWatermark("ts", "1 minute")
                .filter(col("sensor") == "S2")
               )

Join with event-time constraints

In [None]:
join_df = (only_S1_events.join(
  only_S2_events,
    (only_S1_events.ts == only_S2_events.ts)) 
           .select(only_S1_events.temperature,
                   only_S2_events.temperature,
                   only_S1_events.humidity,
                   only_S2_events.humidity,
                   only_S1_events.ts
                  ))

In [None]:
s_to_s_join_query = (join_df
                     .writeStream
                     .format("memory")
                     .queryName("s_to_s_join_query_results")
                     .start())

In [None]:
spark.sql("SELECT * FROM s_to_s_join_query_results ORDER BY ts DESC").show(5,False)

**IMPORTANT:** If we simply try to join on the ts the df will always be empty, since the records are not sincronized!

In [None]:
s_to_s_join_query.stop()

### Join assuming a fixed delta

In [None]:
only_S1_events = (decoded_streaming_temperature_humidity_df
                .filter(col("sensor") == "S1")
                .select(col("ts").alias("S1_ts"), 
                        col("temperature").alias("S1_temperature"), col("humidity").alias("S1_humidity"))
                .withWatermark("S1_ts", "1 minutes")
               )

only_S2_events = (decoded_streaming_temperature_humidity_df
                .filter(col("sensor") == "S2")
                .select(col("ts").alias("S2_ts"), 
                        col("temperature").alias("S2_temperature"), col("humidity").alias("S2_humidity"))
                .withWatermark("S2_ts", "1 minutes")
               )

In [None]:
only_S1_query = (only_S1_events
                     .writeStream
                     .format("memory")
                     .queryName("results1")
                     .start())

only_S2_query = (only_S2_events
                     .writeStream
                     .format("memory")
                     .queryName("results2")
                     .start())

In [None]:
#join 
df = spark.sql("SELECT * FROM results1 join results2 ON S1_ts <= (S2_ts + INTERVAL 4 seconds) AND S1_ts >= S2_ts")
df.show(25)

In [None]:
from pyspark.sql.functions import udf
#Alternative way for using user defined functions

@udf("int")
def diff(x, y):
    return x - y

#Calculating difference
df.withColumn("difference", diff(df["S1_temperature"], df["S2_temperature"])).show(25)

In [None]:
only_S2_query.stop()
only_S1_query.stop()

### Join exploiting time-windows 

In [None]:
#note: to demonstrate the use of a different time-window, for this query a LOGICAL HOPPING WINDOW HAVE BEEN USED
only_S1_wind_events = (decoded_streaming_temperature_humidity_df
                .filter(col("sensor") == "S1")
                .select(col("ts").alias("S1_ts"), 
                        col("temperature").alias("S1_temperature"), col("humidity").alias("S1_humidity"))
                .withWatermark("S1_ts", "1 minutes")
                       .groupBy(window("S1_ts", "1 minutes", "30 seconds"))
                       .avg("S1_humidity")
               )

only_S2_wind_events = (decoded_streaming_temperature_humidity_df
                .filter(col("sensor") == "S2")
                .select(col("ts").alias("S2_ts"), 
                        col("temperature").alias("S2_temperature"), col("humidity").alias("S2_humidity"))
                .withWatermark("S2_ts", "1 minutes")
                       .groupBy(window("S2_ts", "1 minutes", "30 seconds"))
                       .avg("S2_humidity")
               )

In [None]:
only_S1_wind_query = (only_S1_wind_events
                     .writeStream
                     .format("memory")
                     .queryName("results1")
                     .start())

only_S2_wind_query = (only_S2_wind_events
                     .writeStream
                     .format("memory")
                     .queryName("results2")
                     .start())

In [None]:
#join 
df = spark.sql("SELECT * FROM results1 join results2 ON results1.window = results2.window")
df.show(45, truncate = False)


In [None]:
#Calculating difference
df = df.withColumn("difference", fun(df["avg(S2_humidity)"], df["avg(S1_humidity)"]))
df.show(25)

#### Extract the difference between the humidity levels of the base cooking area and the mozzarella melting area. Find if the differences are between 20 and 30

In [None]:
df.filter(df["difference"] > 20).filter(df["difference"] < 30).show(5)

In [None]:
only_S1_wind_query.stop()
only_S2_wind_query.stop()

## Q9 - static-streaming join df

consider the following data are store in a DB

```
CREATE DATABASE pizza-erp;

CREATE TABLE public.oven
(
    pid bigint NOT NULL,
    kind character varying COLLATE pg_catalog."default" NOT NULL,
    enteringtime bigint NOT NULL,
    exitingtime bigint,
    sensor character varying COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT hoven_pkey PRIMARY KEY (pid,enteringtime,sensor)
);

INSERT INTO oven (pid,kind,enteringtime,exitingtime,sensor) VALUES(2,'napoli',1602504000000000000,1602504150000000000,'S1');
INSERT INTO oven (pid,kind,enteringtime,exitingtime,sensor) VALUES(1,'margherita',1602504010000000000,1602504080000000000,'S2');
INSERT INTO oven (pid,kind,enteringtime,exitingtime,sensor) VALUES(3,'pepperoni',1602504170000000000,1602504250000000000,'S1');
INSERT INTO oven (pid,kind,enteringtime,exitingtime,sensor) VALUES(2,'napoli',1602504130000000000,1602504284000000000,'S2');
```

enrich the time-serires with the data in the DB

In [None]:
from pyspark import SparkConf
from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
#create the static-df
pizza_df = sc.parallelize([
    [2,'napoli', 1595332800,1595332960,'S1'],
    [1,'margherita',1595332810,1595332935,'S2'],
    [3,'pepperoni',1595332980,1595333060,'S1'],
    [2,'napoli',1595332960,1595333095,'S2']]
).toDF(["pid","kind","enteringtime","exitingtime","sensor"])

In [None]:
#cast time from unix to ts format

pizza_df = pizza_df.withColumn("enteringtime", to_timestamp(pizza_df["enteringtime"]))
pizza_df = pizza_df.withColumn("exitingtime", to_timestamp(pizza_df["exitingtime"]))
pizza_df.show()

In [None]:
join_df = decoded_streaming_temperature_humidity_df.join(pizza_df, (pizza_df.sensor == decoded_streaming_temperature_humidity_df.sensor) & 
                                                         (pizza_df.enteringtime <= decoded_streaming_temperature_humidity_df.ts) & 
                                                         (pizza_df.exitingtime >= decoded_streaming_temperature_humidity_df.ts))

In [None]:
join_query = (join_df
    .writeStream
    .format("memory")
    .queryName("join_Event")
    .start())

In [None]:
df = spark.sql("SELECT * FROM join_Event")
df.show(25)

In [None]:
join_query.stop()

## clean up

In [None]:
temperature_humidity_query.stop()