<h1>MLTS Exercise 09 - LSTM Training</h1>

Your task is to train a LSTM network on the electic power consumption dataset. The goal is to forcast values for the next day based on the mesurement values from a fixed window over the data from previous days.

The dataset can be downloaded from [Individual Household Electric Power Consumption](https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption)

It contains "Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available."

**Dataset Reference**  
Hebrail, G. & Berard, A. (2006). Individual Household Electric Power Consumption [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C58K54.

In [None]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import torch
from torch.utils.data import DataLoader, TensorDataset

In [3]:
# Set random seed for reproducibility
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

### Load the dataset

In [4]:
# Importing dataset
path = 'data/household_power_consumption.txt'

Household_consumption = pd.read_csv(path, sep=';', low_memory=False)
# Household_consumption

## Preprocess Data for training

For our model, we just use the `Global_active_power` as a timeseries which we want to forcast. Therefore, we drop the other columns beforehand.

Tasks:
* Drop all columns except `Date`, `Time` and `Global_active_power`
* Convert seperate date and time columns into datetime column
* Convert numeric columns to correct type
* Find and replace missing values

In [5]:
# Drop unneeded columns
Household_consumption.drop(columns=[
    'Global_reactive_power', 'Voltage',
    'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3'], inplace=True)

In [6]:
# Parsing date and time into a single datetime column
Household_consumption['Datetime'] = pd.to_datetime(
    Household_consumption['Date'] + ' ' + Household_consumption['Time'], 
    format='%d/%m/%Y %H:%M:%S',
    errors='coerce'
)
# Drop date and time column
Household_consumption.drop(columns=['Date', 'Time'], inplace=True)
# Drop rows with missing datetime
Household_consumption.dropna(subset=['Datetime'], inplace=True)

In [7]:
# Convert numerical columns to numeric type
numeric_column = 'Global_active_power'
Household_consumption[numeric_column] = pd.to_numeric(Household_consumption[numeric_column], errors='coerce')

# Household_consumption.head(3)

In [None]:
# Find missing values
missing_values = Household_consumption.isnull().sum()
print("Missing values per column:\n", missing_values)

# Fill missing values with median for simplicity
Household_consumption[numeric_column] = Household_consumption[numeric_column].fillna(
    Household_consumption[numeric_column].median()
)

## Prepare the datasets for training

We want to forcased the mean value of the next day given the previous mean values. Therefore we need to resample our data to only contain one value per day.  

This will make training fast and it should therefore also work on your laptops CPU. Additionally, you can train the model on Google Colab or on in the CIP pool.

In [None]:
# Resample data to dayly intervals, calculating the mean for each day

Household_consumption_daily = None  # TODO

Household_consumption_daily.head(3)

Plot the data

In [None]:
daily_data_week = Household_consumption_daily['Global_active_power']

# Plot daily trends and rolling average
plt.figure(figsize=(12, 6))
plt.plot(daily_data_week.index, daily_data_week, label="Daily Global Active Power", color='blue', alpha=0.6)
plt.title("Daily Global Active Power Consumption", fontsize=16)
plt.xlabel("Date", fontsize=12)
plt.ylabel("Energy Consumption (kW)", fontsize=12)
plt.grid(alpha=0.3)
plt.show()

#### Scale the data between 0 and 1

Normalizing or scaling the data is an important step before using the data for training as high raw values can often cause exploding gradients during training.

In [None]:
# TODO

#### Convert Time Series to Supervised Format

Time series data is sequential, but LSTMs require input-output pairs to learn patterns. By converting it into supervised format, we prepare the data so the model can learn from past observations to predict the future.

In [85]:
# Convert Time Series to Supervised Learning Format
def create_supervised_data(data: np.ndarray, lag: int) -> tuple[np.ndarray, np.ndarray]:
    """
    Convert time series data into supervised learning format.
    `lag` determines how many previous time steps are used.
    """
    pass  # TODO

In [None]:
# Target column (e.g., 'Global_active_power')
target_column = None  # TODO

# Use a lag of 30 (e.g., previous 30 time steps to predict the next step)
lag = 30
X, y = create_supervised_data(target_column, lag)

X.shape, y.shape

#### Devide data into train and test set

We do not split train and test set randomly but by year. This will give us a better estimate at how the model could perform over a longer period of time that has not been seen in the trainings set.

For the test set, we will split away the last year of data, between '2010-01-01' and '2010-11-26'.  
Therefore, the train set will be all the data between '2006-12-16' and '2010-01-01'.

In [None]:
# Define the cutoff for the test set (e.g., one year)
test_start_date = '2010-01-01'  # Start of the test year
test_end_date = '2010-11-26'    # End of the test year

# TODO

# Print shapes for confirmation
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

## Setup LSTM Model

Create your LSTM model here. You can either code it completely from scratch or use the already implemented models by pytorch.  

Nevertheless, think about how your inputs and outputs will look like and how the data is processed through the dataset. How do you need to implement the LSTM model?

In [112]:
class LSTMModel(torch.nn.Module):

    pass  # TODO

Initialize LSTM model, the optimizer and the loss function. 

Loss function: MSELoss  
Optimizer: Adam

In [None]:
# TODO

In [169]:
# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move model to device
model = model.to(device)

#### Setup the dataloaders for the model

Convert the train and test set to torch, create a TorchDataset and a DataLoader for both

In [None]:
# TODO

## Training/Testing Loop

Setup the training and test loop. Therefore, go through N epochs and in each epoch go through the data of your dataloader, pass the data to the model, calculate the loss and optimize the network. After each epoch, test the model on the test set by passing the data through the model and computing the loss. Save both test and train loss for later inspection

In [171]:
# Training parameters
num_epochs = None  # TODO

In [None]:
# TODO

Display loss curves

In [None]:
# TODO

## Evaluation

Evaluate your trained model by comparing the ground truth data against the predicted time series values.

#### Ground Truth vs. Predicted Time Series

In [None]:
# TODO

#### Long term forcasting

Additionally, we want to test if our model can do long term forcasting and predict correct values based on its previous predictions. Of course, our model is not designed and trained to do that specifically, but it is a good test how errors propagate over time and maybe gives us more ideas on how we could improve the model.

In [175]:
# TODO

In [None]:
# TODO

#### Questions:

* How well can the time series be predicted?
* How could the training be improved / changed?
* Can the data be modified to recieve better results?

Feel free to test out more ideas as you like!