<a href="https://colab.research.google.com/github/smduarte/spbd-2425/blob/main/docs/labs/projs/spbd2425_tp2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sistemas para Processamento de Big Data
## TP2 - Energy Meter Live Monitoring


The sensor data corresponds to regular readings from 11 residential energy meters. The data covers the month of February 2024.

Each data sample has the following schema:

timestamp | sensor_id | energy
----------|-------------|-----------
timestamp | string  | float

Each energy value (KWh) corresponds to the accumulated value of the meter at the time of measurement. As such,
each meter is expected to produce a monotonically increasing series of pairs of timestamp and energy consummed up to that moment.

The meters do not start at zero or at the same value.


## Questions

For all the sensors combined:

1. For the current month and current day, compute the running total energy consumed so far. The values should be updated every 5 minutes.

2. For the current month and current day, compute the running total energy consumed so far, **as a percentage**, **compared to the same periods in February 2024**. The values should be updated every 5 minutes.

For each sensor, separately:

3. For the current month and current day, compute the running total energy consumed so far, as a percentage, **comparing the value of each individual sensor, relative to the same results for all the sensors together (as in #1)**. The values should be updated every 5 minutes. (Sorted in descending order by value and sensor.)



## Requeriments

Solve each question using Structured Spark Streaming.

## Other Grading Criteria

+ Grading will also take into account the general clarity of the programming and of the presentation report (notebook).




### Deadline

December 6.

Penalty of 0.25 grade points per day late.

Penalty accumulates until the grade of the assignment reaches 8.0.

---
### Colab Setup


In [1]:
#@title Install PySpark
!pip install pyspark --quiet


'/usr/local/lib/python3.10/dist-packages/pyspark'

The python code below shows the basics needed to process JSON data from a socket source using PySpark.

Spark Streaming python documentation is found [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html)

---
#### PySpark Kafka Stream Example


In [8]:
#@title Download Archived February Energy Readings
!wget -q -O /tmp/readings.csv https://raw.githubusercontent.com/smduarte/spbd-2425/refs/heads/main/docs/labs/projs/energy-readings.csv
!grep "2024-02" /tmp/readings.csv > february-energy-readings.csv
!head -2 february-energy-readings.csv


2024-02-01 00:00:00;D;2615.0
2024-02-01 00:00:18;C;1098.8


In [2]:
#@title Start the Structured Source

!wget -q -O - https://github.com/smduarte/spbd-2425/raw/main/scripts/json_energy_sender.tgz  | tar xfz - 2> /dev/null

!nohup python json_energy_sender/server.py --filename json_energy_sender/energy-readings.csv --speedup 60 > /dev/null 2> /dev/null &

In [3]:
from pyspark.sql import *
from pyspark.sql.functions import *

spark = SparkSession \
    .builder \
    .appName("StructuredWebLogExample") \
    .getOrCreate()


# Extract a sample JSON string to infer schema
sample_json = '{"date": "2024-02-01 00:00:00", "sensor": "D", "energy": 2615.0}'
inferred_schema = schema_of_json(sample_json)


# Create DataFrame representing the stream of input
# lines from connection to logsender 7777
try:
  json_lines = spark.readStream.format("socket") \
      .option("host", "localhost") \
      .option("port", 7777) \
      .load()

  # Parse the JSON using the inferred schema
  json_lines = json_lines.withColumn("json_data", from_json(col("value"), inferred_schema)) \
    .select("json_data.*")  # Expand the JSON fields into columns


  query = json_lines \
    .writeStream \
    .outputMode("append") \
    .trigger(processingTime='1 seconds') \
    .foreachBatch(lambda df, epoch: df.show(10, False)) \
    .start()

  query.awaitTermination(60)
except Exception as err:
  print(err)
  query.stop()

+----+------+------+
|date|energy|sensor|
+----+------+------+
+----+------+------+

+-------------------+-------+------+
|date               |energy |sensor|
+-------------------+-------+------+
|2024-10-01 00:04:21|2790.18|C     |
|2024-10-01 00:04:27|5949.0 |D     |
|2024-10-01 00:04:36|2162.37|J     |
|2024-10-01 00:04:52|2682.69|I     |
|2024-10-01 00:04:24|3993.9 |H     |
|2024-10-01 00:04:33|3481.07|E     |
|2024-10-01 00:04:43|1597.49|F     |
+-------------------+-------+------+

+-------------------+-------+------+
|date               |energy |sensor|
+-------------------+-------+------+
|2024-10-01 00:14:30|2790.19|C     |
|2024-10-01 00:14:42|3481.08|E     |
|2024-10-01 00:14:48|1668.96|B     |
|2024-10-01 00:14:54|1649.25|A     |
|2024-10-01 00:14:36|5949.1 |D     |
|2024-10-01 00:14:45|2162.5 |J     |
|2024-10-01 00:14:51|1597.5 |F     |
|2024-10-01 00:14:57|2080.99|G     |
+-------------------+-------+------+

+-------------------+-------+------+
|date               |ener