# Structured Streaming Exercise

In order to perform the exercises included in this Notebook, it is neccesary to create several csv files iteratively in the data folder (`data/streaming`). For doing so, open a terminal using the Jupyter console and place the working directory in `ex3-structured-streaming`. After that, type `python generate_data.py`.

In [4]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

## 1. Quick Example

In [5]:
# Create SparkSession

spark = SparkSession.builder.appName("structured-streaming")\
.master("local[*]").getOrCreate()

In [11]:
# Create Schema

schema = T.StructType([T.StructField("word", T.StringType(), True),
                      T.StructField("timestamp", T.TimestampType(), True)])

In [12]:
# Read data from folder

csvDF = spark.readStream \
    .option("sep", ",") \
    .schema(schema) \
    .csv("../data/streaming")

In [13]:
# Generate running word count

wordCounts = csvDF.groupBy(F.window(F.col("timestamp"), "3 minutes", "10 seconds"),
                           "word").count()

In [None]:
# Generate and start query

query = wordCounts.writeStream.outputMode("complete")\
    .format("console").start()

query.awaitTermination()