<a href="https://colab.research.google.com/github/tiagopecurto/tiagopecurto/blob/main/docs/labs/projs/spbp2425_tp1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sistemas para Processamento de Big Data
## TP1 - Energy Meter Monitoring




The sensor data corresponds to regular readings from 11 residential energy meters. The data covers the month of February 2024.

Each data sample has the following schema:

timestamp | sensor_id | energy
----------|-------------|-----------
timestamp | string  | float

Each energy value (KWh) corresponds to the accumulated value of the meter at the time of measurement. As such,
each meter is expected to produce a monotonically increasing series of pairs of timestamp and energy consummed up to that moment.

The meters do not start at zero or at the same value.


## Questions

The following questions should be answered for the month of February and only for this month.

### For the group of sensors:

1. Compute the total energy consumed.

2. Compute the running total energy consumed so far for each day, inclusive.

Note: You can approximate the result but using the last reading of each day from each sensor.

### For each sensor, separately:

3. Compute the total energy consumed and the average energy consumption per day.

4. Compute the day of the month with minimum and maximum energy consumption.

Note: You can approximate the result but using the last reading of each day from each sensor.

### For each sensor, separately, with estimations:

**Assumptions:**

+ Readings may be missing for extended periods due to communication problems with the sensors.

+ Readings are collected do not fall precisely "on the hour". The are collected and recorded any time.

+ For more precise results, estimate the value of the meter at precise timestamp, using linear interpolation from nearest readings.

5. Compute the **estimated** value of each sensor meter for every hour and day of the month (in ascending order).

6. Compute the **estimated** running total of the energy consumed so far. The value should be updated every hour.

## Requeriments

Solve each question using Structured Spark, either Dataframes or SQL or both.

## Other Grading Criteria

+ Grading will also take into account the general clarity of the programming and of the presentation report (notebook).




### Deadline
+ November 10, 23h59

For each day late, ***0.5 / day penalty***. Penalty accumulates until the grade of the assignment reaches 8.0.

---
### Colab Setup


In [None]:
#@title Install PySpark
!pip install pyspark findspark --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
#@title Download the dataset

!wget -q -O energy-readings.csv https://raw.githubusercontent.com/smduarte/spbd-2425/refs/heads/main/docs/labs/projs/energy-readings.csv
!head -2 energy-readings.csv

date;sensor;energy
2024-02-01 00:00:00;D;2615.0


In [15]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('energy').getOrCreate()

sc = spark.sparkContext
try :
    readings = spark.read.csv('energy-readings.csv',
                             sep =';', header=True, inferSchema=True)

    readings.printSchema()

    #0.Filtra os dados para manter apenas o mês de fevereiro de 2024
    readings_february = readings.filter((month("date") == 2) & (year("date") == 2024))
      #readings_february.show()

    #1.Total energy consumed in February
    total_energy_consumed = readings_february.groupBy("sensor") \
      .agg((max("energy")-min("energy")).alias("energy_consumed_per_sensor")) \
      .agg(sum("energy_consumed_per_sensor").alias("total_energy_consumed"))
      #total_energy_consumed.show()

    #2.Total energy consumed by day in February
    #2.1: Calcula o consumo diário de cada sensor
    daily_consumption_per_sensor = readings_february.withColumn("day", to_date("date")) \
      .groupBy("day", "sensor") \
      .agg((max("energy") - min("energy")).alias("daily_consumption")) \
      .orderBy("day", "sensor")
    daily_consumption_per_sensor.show()

    #2.2: Soma o consumo diário para todos os sensores por dia
    total_daily_consumption = daily_consumption_per_sensor.groupBy("day") \
      .agg(sum("daily_consumption").alias("total_daily_consumption")) \
      .orderBy("day")
    total_daily_consumption.show()




    #readings.show(5)
except Exception as err:
    print(err)

root
 |-- date: timestamp (nullable = true)
 |-- sensor: string (nullable = true)
 |-- energy: double (nullable = true)

+----------+------+------------------+
|       day|sensor| daily_consumption|
+----------+------+------------------+
|2024-02-01|     A| 8.100000000000023|
|2024-02-01|     B|4.2000000000000455|
|2024-02-01|     C| 9.700000000000045|
|2024-02-01|     D|12.900000000000091|
|2024-02-01|     E| 16.09999999999991|
|2024-02-01|     F|              11.0|
|2024-02-01|     G|3.2999999999999545|
|2024-02-01|     H|              23.5|
|2024-02-01|     I|19.899999999999977|
|2024-02-01|     J| 2.599999999999909|
|2024-02-01|     K| 8.399999999999977|
|2024-02-02|     A|4.7999999999999545|
|2024-02-02|     B| 4.599999999999909|
|2024-02-02|     C| 1.599999999999909|
|2024-02-02|     D| 5.900000000000091|
|2024-02-02|     E|10.099999999999909|
|2024-02-02|     F|               6.5|
|2024-02-02|     G| 4.100000000000023|
|2024-02-02|     H| 6.199999999999818|
|2024-02-02|     I|  