# Spark Structured Streaming - Demo
## Pizza Oven


### Authors

```
Marco Balduini - marco.balduini@quantiaconsulting.com
Emanuele Della Valle - emanuele.dellavalle@polimi.it
```
```
Translation to SSS: Massimo Pavan - massimo1.pavan@mail.polimi.it
```

### Use Case Description - Linear Pizza Oven
We have a linear oven to continuously cook pizza.

The cooking operation has two main steps:

* the cooking of the pizza base, and
* the mozzarella melting area.

There are two sensors:

* S1 measures the temperature and the relative humidity of the pizza base cooking area.
* S2 measures the temperature and the relative humidity of the mozzarella melting area. 

Both sensors send a temperature measurement every minute, but are not synchronised.

Most of the functions used in this demo

In [1]:
import os
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
import io
from pyspark.sql.functions import *
import time
import json
import struct
import requests 

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.5,org.apache.kafka:kafka-clients:2.6.0 pyspark-shell'
                                    
spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )

spark

set up the environment variables

In [2]:
temperature_humidity_topic = 'TemperatureHumiditySensorEvent'
servers = "kafka:9092"

## Understanding spark-kafka integration
Let's treat first kafka as a bulk source

In [3]:
temperature_humidity_df = (spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("subscribe", temperature_humidity_topic)
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load())

In [4]:
temperature_humidity_df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [5]:
temperature_humidity_df.show(5)

+-------+--------------------+--------------------+---------+------+--------------------+-------------+
|    key|               value|               topic|partition|offset|           timestamp|timestampType|
+-------+--------------------+--------------------+---------+------+--------------------+-------------+
|[53 31]|[7B 22 73 65 6E 7...|TemperatureHumidi...|        0|     0|2021-02-26 15:33:...|            0|
|[53 32]|[7B 22 73 65 6E 7...|TemperatureHumidi...|        0|     1|2021-02-26 15:33:...|            0|
|[53 32]|[7B 22 73 65 6E 7...|TemperatureHumidi...|        0|     2|2021-02-26 15:33:...|            0|
|[53 31]|[7B 22 73 65 6E 7...|TemperatureHumidi...|        0|     3|2021-02-26 15:33:...|            0|
|[53 32]|[7B 22 73 65 6E 7...|TemperatureHumidi...|        0|     4|2021-02-26 15:33:...|            0|
+-------+--------------------+--------------------+---------+------+--------------------+-------------+
only showing top 5 rows



In [6]:
stringified_temperature_humidity_df = temperature_humidity_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
stringified_temperature_humidity_df.show(5,False)

+---+----------------------------------------------------------------------+
|key|value                                                                 |
+---+----------------------------------------------------------------------+
|S1 |{"sensor": "S1", "temperature": 290, "humidity": 30, "ts": 1614353606}|
|S2 |{"sensor": "S2", "temperature": 105, "humidity": 55, "ts": 1614353610}|
|S2 |{"sensor": "S2", "temperature": 110, "humidity": 60, "ts": 1614353613}|
|S1 |{"sensor": "S1", "temperature": 305, "humidity": 38, "ts": 1614353616}|
|S2 |{"sensor": "S2", "temperature": 120, "humidity": 65, "ts": 1614353619}|
+---+----------------------------------------------------------------------+
only showing top 5 rows



In [7]:
from pyspark.sql.types import *

temperature_humidity_schema = StructType([
    StructField("sensor", StringType(), True),
    StructField("temperature", IntegerType(), True),
    StructField("humidity", IntegerType(), True),
    StructField("ts", TimestampType(), True)])

In [8]:
decoded_temperature_humidity_df = stringified_temperature_humidity_df.select(col("key").cast("string"),from_json(col("value"), temperature_humidity_schema).alias("value"))

In [10]:
decoded_temperature_humidity_df.printSchema()

root
 |-- key: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- sensor: string (nullable = true)
 |    |-- temperature: integer (nullable = true)
 |    |-- humidity: integer (nullable = true)
 |    |-- ts: timestamp (nullable = true)



In [11]:
decoded_temperature_humidity_df.select("value.*").show(5)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|                 ts|
+------+-----------+--------+-------------------+
|    S1|        290|      30|2021-02-26 15:33:26|
|    S2|        105|      55|2021-02-26 15:33:30|
|    S2|        110|      60|2021-02-26 15:33:33|
|    S1|        305|      38|2021-02-26 15:33:36|
|    S2|        120|      65|2021-02-26 15:33:39|
+------+-----------+--------+-------------------+
only showing top 5 rows



## DEMO
Please refer to [insert_link_here_if_available]() for the EPL version of the following queries.

link to docs: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

In [12]:
streaming_temperature_humidity_df = (spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("startingOffsets", "earliest")
  .option("subscribe", temperature_humidity_topic)
  .load())

In [13]:
decoded_streaming_temperature_humidity_df=(streaming_temperature_humidity_df
                      .select(from_json(col("value").cast("string"), temperature_humidity_schema).alias("value"))
                      .select("value.*"))

In [14]:
decoded_streaming_temperature_humidity_df.printSchema()

root
 |-- sensor: string (nullable = true)
 |-- temperature: integer (nullable = true)
 |-- humidity: integer (nullable = true)
 |-- ts: timestamp (nullable = true)



In [15]:
temperature_humidity_query = (decoded_streaming_temperature_humidity_df
    .writeStream
    .format("memory")
    .queryName("temperature_humiditySensorEvent")
    .start())

In [16]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent ORDER BY TS ASC").show(10)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|                 ts|
+------+-----------+--------+-------------------+
|    S1|        290|      30|2020-07-21 12:00:00|
|    S1|        290|      30|2020-07-21 12:00:00|
|    S2|        105|      55|2020-07-21 12:00:15|
|    S2|        105|      55|2020-07-21 12:00:15|
|    S2|        110|      60|2020-07-21 12:00:45|
|    S2|        110|      60|2020-07-21 12:00:45|
|    S1|        305|      38|2020-07-21 12:01:00|
|    S1|        305|      38|2020-07-21 12:01:00|
|    S2|        120|      65|2020-07-21 12:01:15|
|    S2|        120|      65|2020-07-21 12:01:15|
+------+-----------+--------+-------------------+
only showing top 10 rows



## Q1 - Filter

In [17]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent WHERE temperature < 100 AND sensor = 'S2' ").show(5)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|                 ts|
+------+-----------+--------+-------------------+
|    S2|         95|      65|2021-02-26 15:33:57|
|    S2|         90|      60|2021-02-26 15:34:00|
|    S2|         95|      65|2021-02-27 09:20:58|
|    S2|         90|      60|2021-02-27 09:21:01|
|    S2|         95|      65|2021-03-01 11:37:13|
+------+-----------+--------+-------------------+
only showing top 5 rows



## Q2 - Filter

Extract all the measurements in a given range
### Absolute range

In [18]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent WHERE ts >= '2020-07-21 12:00:00' AND ts <= '2020-07-21 12:05:00'").show(5)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|                 ts|
+------+-----------+--------+-------------------+
|    S1|        290|      30|2020-07-21 12:00:00|
|    S2|        105|      55|2020-07-21 12:00:15|
|    S2|        110|      60|2020-07-21 12:00:45|
|    S1|        305|      38|2020-07-21 12:01:00|
|    S2|        120|      65|2020-07-21 12:01:15|
+------+-----------+--------+-------------------+
only showing top 5 rows



### Relative range (start: -36h)

In [19]:
from datetime import datetime

now = int(time.time())
thirtysixhoursago = datetime.fromtimestamp(now - 60*60*36).strftime("%Y-%m-%d %H:%M:%S") #60*60*36 = seconds*minutes*hours
query = "SELECT * FROM temperature_humiditySensorEvent WHERE ts >= '{}'".format(thirtysixhoursago)
spark.sql(query).show(5)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|                 ts|
+------+-----------+--------+-------------------+
|    S1|        290|      30|2021-03-01 11:36:42|
|    S2|        105|      55|2021-03-01 11:36:46|
|    S2|        110|      60|2021-03-01 11:36:49|
|    S1|        305|      38|2021-03-01 11:36:52|
|    S2|        120|      65|2021-03-01 11:36:55|
+------+-----------+--------+-------------------+
only showing top 5 rows



## Q3 - Filter by tag

Extract the temperature data from the cooking base area (sensor S1)

In [21]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent WHERE sensor = 'S1'").show(5)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|                 ts|
+------+-----------+--------+-------------------+
|    S1|        290|      30|2021-02-26 15:33:26|
|    S1|        305|      38|2021-02-26 15:33:36|
|    S1|        280|      45|2021-02-26 15:33:45|
|    S1|        280|      22|2021-02-26 15:33:54|
|    S1|        285|      32|2021-02-26 15:34:03|
+------+-----------+--------+-------------------+
only showing top 5 rows



## Q4 - Filter By Value 

Extract the measurements from the cooking base area (sensor S1) with a temperature under 300°  

In [22]:
spark.sql("SELECT * FROM temperature_humiditySensorEvent WHERE sensor = 'S1' AND temperature < 300 ").show(5)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|                 ts|
+------+-----------+--------+-------------------+
|    S1|        290|      30|2021-02-26 15:33:26|
|    S1|        280|      45|2021-02-26 15:33:45|
|    S1|        280|      22|2021-02-26 15:33:54|
|    S1|        285|      32|2021-02-26 15:34:03|
|    S1|        290|      30|2021-02-27 09:20:27|
+------+-----------+--------+-------------------+
only showing top 5 rows



## Q5 - Grouping + Aggregator (mean)

#### Extract the average temperature and the average humidity along the different stages of the linear pizza oven

In [23]:
#Watermarks are necessary while quering the data, in order to understand how much the data can arrive late 
All_time_averages_query = (decoded_streaming_temperature_humidity_df
                         .withWatermark("ts", "1 minutes")
                         .groupBy(col("sensor"))
                         .avg("humidity", "temperature")
                     .writeStream
                     .outputMode("complete")
                     .format("memory")
                     .queryName("group_query3")
                     .start())

In [26]:
#The execution of this query could require some time: if dataframe seems empty, just try to re-run the cell after a while
spark.sql("SELECT * FROM group_query3").show()

+------+-------------+----------------+
|sensor|avg(humidity)|avg(temperature)|
+------+-------------+----------------+
|    S2|         61.9|           106.5|
|    S1|         33.4|           288.0|
+------+-------------+----------------+



In [27]:
All_time_averages_query.stop()

#### Extract the last humidity and temperature measurements from the cooking base area

In [66]:
#AND is necessary because there can be a record with that ts also from the s2 sensor
spark.sql("""SELECT * FROM temperature_humiditySensorEvent WHERE ts = (SELECT MAX(ts) FROM temperature_humiditySensorEvent 
            WHERE sensor = 'S1') AND sensor = 'S1'""").show()

+------+-----------+--------+--------------------+
|sensor|temperature|humidity|                  ts|
+------+-----------+--------+--------------------+
|    S1|        285|      32|+105664-03-19 10:...|
+------+-----------+--------+--------------------+



## Q6 - Aggregate Window

#### Extract the moving average temperature observed in the cooking base area over a window of 2 minutes (DEMO)

In [29]:
#note: this corresponds to a logical tumbling window
LTW_temperature_query = (decoded_streaming_temperature_humidity_df
                         .withWatermark("TS", "1 minutes")
                         .groupBy(window("TS", "2 minutes"),"sensor")
                         .avg("temperature")
                     .writeStream
                     .format("memory")
                     .queryName("results")
                     .start())

In [31]:
spark.sql("SELECT * FROM results WHERE sensor = 'S1' ORDER BY window ASC").show(5,False)

+------------------------------------------+------+----------------+
|window                                    |sensor|avg(temperature)|
+------------------------------------------+------+----------------+
|[2020-07-21 12:00:00, 2020-07-21 12:02:00]|S1    |297.5           |
|[2020-07-21 12:02:00, 2020-07-21 12:04:00]|S1    |280.0           |
|[2020-07-21 12:04:00, 2020-07-21 12:06:00]|S1    |285.0           |
|[2021-02-26 15:32:00, 2021-02-26 15:34:00]|S1    |288.75          |
|[2021-02-26 15:34:00, 2021-02-26 15:36:00]|S1    |285.0           |
+------------------------------------------+------+----------------+
only showing top 5 rows



In [32]:
LTW_temperature_query.stop()

#### Extract the moving average temperature observed by S2 over a window of 3 minutes (hands-on)

In [33]:
LTW_temperature_query2 = (decoded_streaming_temperature_humidity_df
                         .withWatermark("TS", "1 minutes")
                         .groupBy(window("TS", "3 minutes"),"sensor")
                         .avg("temperature")
                     .writeStream
                     .format("memory")
                     .queryName("results2")
                     .start())

In [35]:
spark.sql("SELECT * FROM results2 WHERE sensor = 'S2' ORDER BY window ASC").show(5,False)

+------------------------------------------+------+-----------------+
|window                                    |sensor|avg(temperature) |
+------------------------------------------+------+-----------------+
|[2020-07-21 12:00:00, 2020-07-21 12:03:00]|S2    |112.5            |
|[2020-07-21 12:03:00, 2020-07-21 12:06:00]|S2    |97.5             |
|[2021-02-26 15:33:00, 2021-02-26 15:36:00]|S2    |106.5            |
|[2021-02-27 09:18:00, 2021-02-27 09:21:00]|S2    |110.0            |
|[2021-02-27 09:21:00, 2021-02-27 09:24:00]|S2    |98.33333333333333|
+------------------------------------------+------+-----------------+
only showing top 5 rows



In [36]:
LTW_temperature_query2.stop()

## Q7 - Map and custom function

#### Correct the temperature observations of the cooking base area by by subtracting a delta of 5°C to each value

In [37]:
#if you want to keep all the records, also the one from the other sensor, a solution could be:

new_column = when(
        (col("sensor") == "S1"), col("temperature") - 5
    ).otherwise(col("temperature"))

map_temperature_query = (decoded_streaming_temperature_humidity_df
                         .withColumn("temperature", new_column)
                     .writeStream
                     .format("memory")
                     .queryName("results")
                     .start())

In [38]:
spark.sql("SELECT * FROM results").show(5,False)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|ts                 |
+------+-----------+--------+-------------------+
|S1    |285        |30      |2021-02-26 15:33:26|
|S2    |105        |55      |2021-02-26 15:33:30|
|S2    |110        |60      |2021-02-26 15:33:33|
|S1    |300        |38      |2021-02-26 15:33:36|
|S2    |120        |65      |2021-02-26 15:33:39|
+------+-----------+--------+-------------------+
only showing top 5 rows



In [39]:
map_temperature_query.stop()

In [40]:
#alternatively, if you'd like to keep only the values from sensor S1 a solution could be:
def sub5(x):
    x = x-5
    return x

df = decoded_streaming_temperature_humidity_df.select("*").where("sensor = 'S1'")
fun = udf(sub5)

map_temperature_query = (df
                         .withColumn("temperature", fun(df["temperature"]))
                     .writeStream
                     .format("memory")
                     .queryName("results")
                     .start())

In [41]:
spark.sql("SELECT * FROM results").show(5,False)

+------+-----------+--------+-------------------+
|sensor|temperature|humidity|ts                 |
+------+-----------+--------+-------------------+
|S1    |285        |30      |2021-02-26 15:33:26|
|S1    |300        |38      |2021-02-26 15:33:36|
|S1    |275        |45      |2021-02-26 15:33:45|
|S1    |275        |22      |2021-02-26 15:33:54|
|S1    |280        |32      |2021-02-26 15:34:03|
+------+-----------+--------+-------------------+
only showing top 5 rows



In [42]:
map_temperature_query.stop()

## Q8 - Stream-to-Stream Join

#### Extract the difference between the temperature of the base cooking area and the mozzarella melting area

### Join assuming synchronous time-series

Apply watermarks on event-time columns and other filters

In [43]:
only_S1_events = (decoded_streaming_temperature_humidity_df
                .withWatermark("ts", "1 minute")
                .filter(col("sensor") == "S1")
               )

only_S2_events = (decoded_streaming_temperature_humidity_df
                .withWatermark("ts", "1 minute")
                .filter(col("sensor") == "S2")
               )

Join with event-time constraints

In [44]:
join_df = (only_S1_events.join(
  only_S2_events,
    (only_S1_events.ts == only_S2_events.ts)) 
           .select(only_S1_events.temperature,
                   only_S2_events.temperature,
                   only_S1_events.humidity,
                   only_S2_events.humidity,
                   only_S1_events.ts
                  ))

In [45]:
s_to_s_join_query = (join_df
                     .writeStream
                     .format("memory")
                     .queryName("results")
                     .start())

In [46]:
spark.sql("SELECT * FROM results ORDER BY ts DESC").show(5,False)

+-----------+-----------+--------+--------+---+
|temperature|temperature|humidity|humidity|ts |
+-----------+-----------+--------+--------+---+
+-----------+-----------+--------+--------+---+



**IMPORTANT:** If we simply try to join on the ts the df will always be empty, since the records are not sincronized!

In [47]:
s_to_s_join_query.stop()

### Join assuming a fixed delta

In [48]:
only_S1_events = (decoded_streaming_temperature_humidity_df
                .filter(col("sensor") == "S1")
                .select(col("ts").alias("S1_ts"), 
                        col("temperature").alias("S1_temperature"), col("humidity").alias("S1_humidity"))
                .withWatermark("S1_ts", "2 hours")
               )

only_S2_events = (decoded_streaming_temperature_humidity_df
                .filter(col("sensor") == "S2")
                .select(col("ts").alias("S2_ts"), 
                        col("temperature").alias("S2_temperature"), col("humidity").alias("S2_humidity"))
                .withWatermark("S2_ts", "2 hours")
               )

In [49]:
only_S1_query = (only_S1_events
                     .writeStream
                     .format("memory")
                     .queryName("results1")
                     .start())

only_S2_query = (only_S2_events
                     .writeStream
                     .format("memory")
                     .queryName("results2")
                     .start())

In [50]:
#join 
df = spark.sql("SELECT * FROM results1 join results2 ON S1_ts <= (S2_ts + INTERVAL 4 seconds) AND S1_ts >= S2_ts")
df.show(25)

+-------------------+--------------+-----------+-------------------+--------------+-----------+
|              S1_ts|S1_temperature|S1_humidity|              S2_ts|S2_temperature|S2_humidity|
+-------------------+--------------+-----------+-------------------+--------------+-----------+
|2021-02-26 15:33:36|           305|         38|2021-02-26 15:33:33|           110|         60|
|2021-02-26 15:33:45|           280|         45|2021-02-26 15:33:42|           115|         60|
|2021-02-26 15:33:54|           280|         22|2021-02-26 15:33:51|           115|         72|
|2021-02-26 15:34:03|           285|         32|2021-02-26 15:34:00|            90|         60|
|2021-02-27 09:20:37|           305|         38|2021-02-27 09:20:34|           110|         60|
|2021-02-27 09:20:46|           280|         45|2021-02-27 09:20:43|           115|         60|
|2021-02-27 09:20:55|           280|         22|2021-02-27 09:20:52|           115|         72|
|2021-02-27 09:21:04|           285|    

In [51]:
def diff(x, y):
    return x - y

fun = udf(diff)

#Calculating difference
df.withColumn("difference", fun(df["S1_temperature"], df["S2_temperature"])).show(25)

+-------------------+--------------+-----------+-------------------+--------------+-----------+----------+
|              S1_ts|S1_temperature|S1_humidity|              S2_ts|S2_temperature|S2_humidity|difference|
+-------------------+--------------+-----------+-------------------+--------------+-----------+----------+
|2021-02-26 15:33:36|           305|         38|2021-02-26 15:33:33|           110|         60|       195|
|2021-02-26 15:33:45|           280|         45|2021-02-26 15:33:42|           115|         60|       165|
|2021-02-26 15:33:54|           280|         22|2021-02-26 15:33:51|           115|         72|       165|
|2021-02-26 15:34:03|           285|         32|2021-02-26 15:34:00|            90|         60|       195|
|2021-02-27 09:20:37|           305|         38|2021-02-27 09:20:34|           110|         60|       195|
|2021-02-27 09:20:46|           280|         45|2021-02-27 09:20:43|           115|         60|       165|
|2021-02-27 09:20:55|           280| 

In [53]:
only_S2_query.stop()
only_S1_query.stop()

### Join exploiting time-windows 

In [54]:
only_S1_wind_events = (decoded_streaming_temperature_humidity_df
                .filter(col("sensor") == "S1")
                .select(col("ts").alias("S1_ts"), 
                        col("temperature").alias("S1_temperature"), col("humidity").alias("S1_humidity"))
                .withWatermark("S1_ts", "2 hours")
                       .groupBy(window("S1_ts", "10 seconds"))
                       .avg("S1_humidity")
               )

only_S2_wind_events = (decoded_streaming_temperature_humidity_df
                .filter(col("sensor") == "S2")
                .select(col("ts").alias("S2_ts"), 
                        col("temperature").alias("S2_temperature"), col("humidity").alias("S2_humidity"))
                .withWatermark("S2_ts", "2 hours")
                       .groupBy(window("S2_ts", "10 seconds"))
                       .avg("S2_humidity")
               )

In [71]:
only_S1_wind_query = (only_S1_wind_events
                     .writeStream
                     .format("memory")
                     .queryName("results1")
                     .start())

only_S2_wind_query = (only_S2_wind_events
                     .writeStream
                     .format("memory")
                     .queryName("results2")
                     .start())

In [79]:
#join 
df = spark.sql("SELECT * FROM results1 join results2 ON results1.window = results2.window")
df.show(25)

+--------------------+----------------+--------------------+------------------+
|              window|avg(S1_humidity)|              window|  avg(S2_humidity)|
+--------------------+----------------+--------------------+------------------+
|[2021-02-27 09:21...|            32.0|[2021-02-27 09:21...|              57.5|
|[2021-02-26 15:34...|            32.0|[2021-02-26 15:34...|58.333333333333336|
|[2021-03-01 20:05...|            22.0|[2021-03-01 20:05...|              68.5|
|[2021-02-26 15:33...|            38.0|[2021-02-26 15:33...|              60.0|
|[2021-02-27 09:20...|            38.0|[2021-02-27 09:20...|              57.5|
|[2021-03-01 20:06...|            32.0|[2021-03-01 20:06...|58.333333333333336|
|[2021-03-01 11:36...|            30.0|[2021-03-01 11:36...|              57.5|
|[2021-02-26 15:33...|            22.0|[2021-02-26 15:33...|              68.5|
|[2021-03-01 20:05...|            45.0|[2021-03-01 20:05...|              63.5|
|[2021-02-27 09:20...|            22.0|[

In [88]:
#Calculating difference
df = df.withColumn("difference", fun(df["avg(S2_humidity)"], df["avg(S1_humidity)"]))
df.show(25)

+--------------------+----------------+--------------------+------------------+------------------+
|              window|avg(S1_humidity)|              window|  avg(S2_humidity)|        difference|
+--------------------+----------------+--------------------+------------------+------------------+
|[2021-02-27 09:21...|            32.0|[2021-02-27 09:21...|              57.5|              25.5|
|[2021-02-26 15:34...|            32.0|[2021-02-26 15:34...|58.333333333333336|26.333333333333336|
|[2021-03-01 20:05...|            22.0|[2021-03-01 20:05...|              68.5|              46.5|
|[2021-02-26 15:33...|            38.0|[2021-02-26 15:33...|              60.0|              22.0|
|[2021-02-27 09:20...|            38.0|[2021-02-27 09:20...|              57.5|              19.5|
|[2021-03-01 20:06...|            32.0|[2021-03-01 20:06...|58.333333333333336|26.333333333333336|
|[2021-03-01 11:36...|            30.0|[2021-03-01 11:36...|              57.5|              27.5|
|[2021-02-

#### Extract the difference between the humidity levels of the base cooking area and the mozzarella melting area. Find if the differences are between 20 and 30

In [89]:
df.filter(df["difference"] > 20).filter(df["difference"] < 30).show(5)

+--------------------+----------------+--------------------+------------------+------------------+
|              window|avg(S1_humidity)|              window|  avg(S2_humidity)|        difference|
+--------------------+----------------+--------------------+------------------+------------------+
|[2021-02-27 09:21...|            32.0|[2021-02-27 09:21...|              57.5|              25.5|
|[2021-02-26 15:34...|            32.0|[2021-02-26 15:34...|58.333333333333336|26.333333333333336|
|[2021-02-26 15:33...|            38.0|[2021-02-26 15:33...|              60.0|              22.0|
|[2021-03-01 20:06...|            32.0|[2021-03-01 20:06...|58.333333333333336|26.333333333333336|
|[2021-03-01 11:36...|            30.0|[2021-03-01 11:36...|              57.5|              27.5|
+--------------------+----------------+--------------------+------------------+------------------+
only showing top 5 rows



In [90]:
only_S1_wind_query.stop()
only_S2_wind_query.stop()

## Q9 - static-streaming join df

consider the following data are store in a DB

```
CREATE DATABASE pizza-erp;

CREATE TABLE public.oven
(
    pid bigint NOT NULL,
    kind character varying COLLATE pg_catalog."default" NOT NULL,
    enteringtime bigint NOT NULL,
    exitingtime bigint,
    sensor character varying COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT hoven_pkey PRIMARY KEY (pid,enteringtime,sensor)
);

INSERT INTO oven (pid,kind,enteringtime,exitingtime,sensor) VALUES(2,'napoli',1602504000000000000,1602504150000000000,'S1');
INSERT INTO oven (pid,kind,enteringtime,exitingtime,sensor) VALUES(1,'margherita',1602504010000000000,1602504080000000000,'S2');
INSERT INTO oven (pid,kind,enteringtime,exitingtime,sensor) VALUES(3,'pepperoni',1602504170000000000,1602504250000000000,'S1');
INSERT INTO oven (pid,kind,enteringtime,exitingtime,sensor) VALUES(2,'napoli',1602504130000000000,1602504284000000000,'S2');
```

enrich the time-serires with the data in the DB

In [59]:
from pyspark import SparkConf
from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
#create the static-df
pizza_df = sc.parallelize([
    [2,'napoli', 1595332780,1595332820,'S1'],
    [1,'margherita',1595332795,1595332835,'S2'],
    [3,'pepperoni',1595332840,1595332880,'S1'],
    [2,'napoli',1595332825,1595332865,'S2']]
).toDF(["pid","kind","enteringtime","exitingtime","sensor"])

In [60]:
#cast time from unix to ts format

pizza_df = pizza_df.withColumn("enteringtime", to_timestamp(pizza_df["enteringtime"]))
pizza_df = pizza_df.withColumn("exitingtime", to_timestamp(pizza_df["exitingtime"]))
pizza_df.show()

+---+----------+-------------------+-------------------+------+
|pid|      kind|       enteringtime|        exitingtime|sensor|
+---+----------+-------------------+-------------------+------+
|  2|    napoli|2020-07-21 11:59:40|2020-07-21 12:00:20|    S1|
|  1|margherita|2020-07-21 11:59:55|2020-07-21 12:00:35|    S2|
|  3| pepperoni|2020-07-21 12:00:40|2020-07-21 12:01:20|    S1|
|  2|    napoli|2020-07-21 12:00:25|2020-07-21 12:01:05|    S2|
+---+----------+-------------------+-------------------+------+



In [61]:
join_df = decoded_streaming_temperature_humidity_df.join(pizza_df, (pizza_df.sensor == decoded_streaming_temperature_humidity_df.sensor) & 
                                                         (pizza_df.enteringtime <= decoded_streaming_temperature_humidity_df.ts) & 
                                                         (pizza_df.exitingtime >= decoded_streaming_temperature_humidity_df.ts))

In [62]:
join_query = (join_df
    .writeStream
    .format("memory")
    .queryName("join_Event")
    .start())

In [63]:
df = spark.sql("SELECT * FROM join_Event")
df.show(25)

+------+-----------+--------+-------------------+---+----------+-------------------+-------------------+------+
|sensor|temperature|humidity|                 ts|pid|      kind|       enteringtime|        exitingtime|sensor|
+------+-----------+--------+-------------------+---+----------+-------------------+-------------------+------+
|    S2|        105|      55|2020-07-21 12:00:15|  1|margherita|2020-07-21 11:59:55|2020-07-21 12:00:35|    S2|
|    S2|        110|      60|2020-07-21 12:00:45|  2|    napoli|2020-07-21 12:00:25|2020-07-21 12:01:05|    S2|
|    S2|        105|      55|2020-07-21 12:00:15|  1|margherita|2020-07-21 11:59:55|2020-07-21 12:00:35|    S2|
|    S2|        110|      60|2020-07-21 12:00:45|  2|    napoli|2020-07-21 12:00:25|2020-07-21 12:01:05|    S2|
|    S1|        290|      30|2020-07-21 12:00:00|  2|    napoli|2020-07-21 11:59:40|2020-07-21 12:00:20|    S1|
|    S1|        305|      38|2020-07-21 12:01:00|  3| pepperoni|2020-07-21 12:00:40|2020-07-21 12:01:20|

In [64]:
join_query.stop()

## clean up

In [65]:
temperature_humidity_query.stop()