# Processamento de Streams 2024
## TP1 - Energy Meter Monitoring


The sensor data corresponds to (periodic) readings from 11 residential energy meters. The data covers the month of February 2024, and is streamed off Kafka.

Each data sample has the following schema:

timestamp | sensor_id | energy
----------|-------------|-----------
timestamp | string  | float

Each energy value (KWh) corresponds to the accumulated value of the meter at the time of measurement. As such,
each meter is expected to produce a monotonically increasing series of pairs of timestamp and energy consummed up to that moment.

The meters do not start at zero or at the same value.

The contracted energy provider is [SU Eletricidade](https://sueletricidade.pt/en/home)

The cost of energy varies depending on the time of day, according to the table below:

vazio | super-vazio | cheias | ponta |
------|-------------|--------|-------|
0.1072€| 0.1072€ | 0.1741€ | 0.2400€|

The plan corresponds to the [daily schedule tariff](https://sueletricidade.pt/en/schedules/546/daily-and-weekly-timetable), so the schedule is the same
for all days of the week.

## Questions

For each sensor, separately:

1. Compute the running total energy consumed so far, for the month. The value should be updated every 5 minutes. (Sorted in descending order by value and sensor.)

2. Compute the running total energy consumed so far, for the day. The value should be updated every 5 minutes. (Sorted in descending order by value and sensor.)

3. For the current day, compute the total energy used in each half hour period. The value should be updated every 5 minutes. (Sorted by period; a column for each sensor)

4. Compute the running total expense for the day. The value should be updated every minute. (Sorted in descending order by value and sensor.)



## Requeriments

Solve each question using Structured Spark Streaming.

## Other Grading Criteria

+ Grading will also take into account the general clarity of the programming and of the presentation report (notebook).




### Deadline

26th April + 1/2 day - ***no penalty***

For each day late, ***0.5 / day penalty***. Penalty accumulates until the grade of the assignment reaches 8.0.

---
### Colab Setup


In [None]:
#@title Mount Google Drive (Optional)
from google.colab import drive
drive.mount('/content/drive')

In [1]:
#@title Install PySpark
!pip install pyspark findspark --quiet
import findspark
findspark.init()
findspark.find()

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


'/usr/local/lib/python3.10/dist-packages/pyspark'

In [None]:
#@title Install & Launch Kafka
%%bash
KAFKA_VERSION=3.7.0
KAFKA=kafka_2.12-$KAFKA_VERSION
wget -q -O /tmp/$KAFKA.tgz https://dlcdn.apache.org/kafka/$KAFKA_VERSION/$KAFKA.tgz
tar xfz /tmp/$KAFKA.tgz
wget -q -O $KAFKA/config/server1.properties - https://github.com/smduarte/ps2024/raw/main/colab/server1.properties

UUID=`$KAFKA/bin/kafka-storage.sh random-uuid`
$KAFKA/bin/kafka-storage.sh format -t $UUID -c $KAFKA/config/server1.properties
$KAFKA/bin/kafka-server-start.sh -daemon $KAFKA/config/server1.properties

### Energy sensor data publisher
This a small python Kafka client that publishes a continous stream of text lines, obtained from the periodic output of the sensors.

* The Kafka server is accessible @localhost:9092
* The events are published to the `energy` topic
* Events are published 60x faster than realtime relative to the timestamp


In [None]:
#@title Start Kafka Publisher
%%bash
pip install kafka-python dataclasses --quiet
wget -q -O - https://github.com/smduarte/ps2024/raw/main/colab/kafka-tp1-logsender.tgz | tar xfz - 2> /dev/null
wget -q -O data-sorted.csv https://github.com/smduarte/ps2024/raw/main/tp1/data-sorted.csv

nohup python kafka-tp1-logsender/publisher.py --filename data-sorted.csv --topic energy  --speedup 60 2> kafka-publisher-error.log > kafka-publisher-out.log &

In [None]:
#@title Python Kafka client (For Debugging)
!pip -q install confluent-kafka
from confluent_kafka import Consumer

conf = {'bootstrap.servers': 'localhost:9092',
        'group.id': '*',
        'enable.auto.commit': False,
        'auto.offset.reset': 'earliest'}

try:
  consumer = Consumer(conf)
  consumer.subscribe(['energy'])

  while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None: continue
    print(msg.value())
finally:
  consumer.close()

The python code below shows the basics needed to process JSON data from Kafka source using PySpark.

Spark Streaming python documentation is found [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html)

---
#### PySpark Kafka Stream Example


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

def dumpBatchDF(df, epoch_id):
    df.show(20, False)

spark = SparkSession \
    .builder \
    .appName('Kafka Spark Structured Streaming Example') \
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1') \
    .getOrCreate()

lines = spark \
  .readStream \
  .format('kafka') \
  .option('kafka.bootstrap.servers', 'localhost:9092') \
  .option('subscribe', 'energy') \
  .option('startingOffsets', 'earliest') \
  .load() \
  .selectExpr('CAST(value AS STRING)')


schema = StructType([StructField('timestamp', TimestampType(), True),
                     StructField('sensor_id', StringType(), True),
                     StructField('energy', FloatType(), True)])

lines = lines.select( from_json(col('value'), schema).alias('data')).select('data.*')

query = lines \
    .writeStream \
    .outputMode('append') \
    .foreachBatch(dumpBatchDF) \
    .start()

query.awaitTermination(600)
query.stop()
spark.stop()