# EX9-STREAM: Spark Structured Streaming + Kafka

Your assignment: complete the `TODO`'s and include also the **output of each cell**.

#### You may need to read the [Structured Streaming API Documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html) to complete this lab.

### Before starting this exercise: (1) start `kafka` stack; (2) start `kafkafakestream` stack.

### Step 1: **[PLAN A]** Start Spark Session

In [None]:
from pyspark.sql import SparkSession

try:
    spark.stop()
except NameError:
    print("SparkContext not defined")

# cluster mode (and minio for distributed file system)
spark = SparkSession.builder \
            .appName("Spark SQL basic example") \
            .master("spark://spark:7077") \
	    	.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.4") \
            .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
            .config("spark.hadoop.fs.s3a.access.key", "pdm_minio") \
            .config("spark.hadoop.fs.s3a.secret.key", "pdm_minio") \
            .config("spark.hadoop.fs.s3a.path.style.access", "true") \
            .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
            .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
	    	.getOrCreate()

### Step 2: **[PLAN A]** Create stream of pizza orders from Kafka

In [None]:
# Read from Kafka
df_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "pizza") \
    .option("startingOffsets", "earliest") \
    .load()

from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType, ArrayType
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StringType, TimestampType

schema = StructType([
    StructField("id", IntegerType()),
    StructField("shop", StringType()),
    StructField("name", StringType()),
    StructField("phoneNumber", StringType()),
    StructField("address", StringType()),
    StructField("pizzas", ArrayType(
        StructType([
            StructField("pizzaName", StringType()),
            StructField("additionalToppings", ArrayType(StringType()))
        ])
    )),
    StructField("timestamp", LongType())  # This is in epoch millis
])


df_stream = df_stream.selectExpr("CAST(value AS STRING) as json_str")
df_stream = df_stream.select(from_json(col("json_str"), schema).alias("data")).select("data.*")

df_stream_writer = df_stream.writeStream.format("console").outputMode("append")
df_stream_writer = df_stream_writer.trigger(processingTime="1 second")
df_stream_query = df_stream_writer.start()
df_stream_query.awaitTermination(10)

### Step 3: Explore the example above, change parameters, see the results

This is a open exercise (show your work and explain the output). Fake kafka stream has other options concerning subject, number of messages, waiting time, etc.