1. Run the docker image and check if the Kafka server has any defined topics:
    - start terminal and in jupyterlab directory run docker compose up
    - in the additional window of the terminal, enter the command:
        
```bash
    docker exec broker kafka-topics --list --bootstrap-server broker:9092
```

2. Add topic `streaming`
```bash
docker exec broker kafka-topics --bootstrap-server broker:9092 --create --topic streaming
```

3. check the list of topics again making sure you have the topic `streaming`

4. Launch a new terminal on your computer and create common data for the new topic

```bash
docker exec --interactive --tty broker kafka-console-producer --bootstrap-server broker:9092 --topic streaming
```

5. To check if sending messages is working, launch another terminal window and enter the following command to execute the consumer:

```bash
docker exec --interactive --tty broker kafka-console-consumer --bootstrap-server broker:9092 --topic streaming --from-beginning
```

Complete the script so that it generates the data:

1. create a `message` variable, which will be a dictionary containing information of a single event (key: value):
    - "time" : current time + timedelta(seconds=random.randint(-15, 0))
    - "id" : randomly selected from lists ["a", "b", "c", "d", "e"]
    - "value: random value between 0 and 100

In [4]:
%%file stream.py

import json
import random
import sys
from datetime import datetime, timedelta
from time import sleep

from kafka import KafkaProducer

if __name__ == "__main__":
    SERVER = "broker:9092"

    producer = KafkaProducer(
        bootstrap_servers=[SERVER],
        value_serializer=lambda x: json.dumps(x).encode("utf-8"),
        api_version=(2, 7, 0),
    )
    
    try:
        while True:
            
            t = datetime.now() + timedelta(seconds=random.randint(-15, 0))
            
            message = {
                "time" : str(t),
                "id" : random.choice(["a", "b", "c", "d", "e"]),
                "values" : random.randint(0,100)
            }
            
            
            producer.send("streaming", value=message)
            sleep(1)
    except KeyboardInterrupt:
        producer.close()

Overwriting stream.py


2.  in jupyterlab terminal run a `stream.py` file
```bash
python stream.py
```
check in the consumer window if the sent messages come to Kafka.

The `kafka-python` library is responsible for running the kafka import
which you can install with `pip install kafka-python`

In [10]:
%%file raw_app.py

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType, StringType, IntegerType

import pyspark.sql.functions as f


SERVER = "broker:9092"

if __name__ == "__main__":
    ## create spark variable
    #YOUR CODE HERE
    spark = SparkSession.builder.getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")
    
    
    json_schema = StructType(
        [
            StructField("time", TimestampType()),
            StructField("id", StringType()),
            StructField("values", IntegerType()),
        ]
    )
    
    
    raw = (
        spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "broker:9092")
        .option("subscribe", "streaming")
        .load()
    )
    
    parsed = raw.select(
        "timestamp", f.from_json(raw.value.cast("string"), json_schema).alias("json")
    ).select(
        f.col("timestamp").alias("proc_time"),
        f.col("json").getField("time").alias("event_time"),
        f.col("json").getField("id").alias("id"),
        f.col("json").getField("values").alias("value"),
    )
    
    info = parsed.groupBy("id").count()
    
    query = (
        info.writeStream
        .outputMode("complete")
        .format("console")
        .start()
    )
    
    query.awaitTermination()
    query.stop()

Overwriting raw_app.py


run streaming analysis: 
```bash
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 raw_app.py
```

Modify pragram `raw_app.py` by:
```python
    json_schema = StructType(
        [
            StructField("time", TimestampType()),
            StructField("id", StringType()),
            StructField("value", IntegerType()),
        ]
    )
    
```
and parsed stream 

```python
    parsed = raw.select(
        "timestamp", f.from_json(raw.value.cast("string"), json_schema).alias("json")
    ).select(
        f.col("timestamp").alias("proc_time"),
        f.col("json").getField("time").alias("event_time"),
        f.col("json").getField("id").alias("id"),
        f.col("json").getField("value").alias("value"),
    )

```

## count the number of events by group ID