1. Load the dataset provided. **(3 pts)**
   - Ensure all columns have the correct data type, e.g., `float` or `int` for numeric quantities, `object` for categorical variables, `datetime` for dates, etc.
   - Remove columns where all values are missing or equal to zero.
   - Describe the basic statistics (mean, minimum, maximum, standard deviation, etc.) of the numerical variables.

2. Perform all data pre-processing steps needed for the model (i.e., integration, cleaning, reduction, and transformation). **(7 pts)**  
   For each pre-processing step that you apply, dedicate at least one markdown cell next to the code to explain:
   - the step you are applying
   - why you need to do it
   - what is the result  

   **Note:** In some scenarios, it is acceptable to repeat steps; however, make sure your code is logical and efficient.

3. Organize your time-dependent data. In a separate markdown cell, summarize how you organize your time-dependent data. Motivate your strategy. You can include a figure if needed. **(4 pts)**
   - **Hint:** you can include a figure in markdown syntax as follows:  

4. Prepare your data for model selection and evaluation. **(4 pts)**  
   What is the experimental setup that you use to build the model? Please provide a figure that summarizes your setup where you clearly mark:
   - how do you split the available data for training, validation, or testing
   - what type of validation strategy do you use (e.g., hold out, cross-validation)

5. Choose a recurrent neural network architecture to build the prediction model (e.g., LSTM, GRU, etc.) and implement your solution. Motivate which hyperparameters you select or optimize. **(5 pts)**

6. Create plots of training and validation losses. Describe the trends you notice and whether they indicate underfitting or overfitting behaviors. **(2 pts)**

7. Evaluate the performance of the model. Describe which measures you use and discuss the results of the evaluation. **(3 pts)**

8. Create a plot contrasting the predictions of the model and the ground truth values from the test set. **(2 pts)**

In [1]:
import pandas as pd
import numpy as np

# Load datasets
train = pd.read_csv('energy_generation_train.csv')
test = pd.read_csv('energy_generation_test.csv')

def process(df, name):
    print(f"\n{'='*50}\nProcessing {name}\n{'='*50}")
    
    # Convert DateTime column (if present) to datetime
    if 'DateTime' in df.columns:
        df['DateTime'] = pd.to_datetime(df['DateTime'])
    
    # Identify numeric columns (everything except DateTime)
    numeric_cols = [c for c in df.columns if c != 'DateTime']
    
    # Remove columns where all values are either missing OR zero
    # (i.e., after filling NaN with 0, all values become 0)
    cols_to_drop = []
    for col in numeric_cols:
        if df[col].fillna(0).eq(0).all():
            cols_to_drop.append(col)
    
    if cols_to_drop:
        df = df.drop(columns=cols_to_drop)
        print(f"Dropped columns (all missing/zero): {cols_to_drop}")
    
    # Remaining numeric columns
    remaining_numeric = [c for c in df.columns if c != 'DateTime' and c in numeric_cols]
    
    # Basic statistics for numerical variables
    if remaining_numeric:
        stats = df[remaining_numeric].describe().T[['mean','min','max','std']]
        stats['missing'] = df[remaining_numeric].isnull().sum()
        print("\nNumerical statistics:")
        print(stats.round(2))
    else:
        print("No numerical columns remaining.")
    
    print(f"\nFinal columns: {list(df.columns)}")
    return df

# Process both sets
train_clean = process(train, 'TRAIN Dataset')
test_clean = process(test, 'TEST Dataset')



Processing TRAIN Dataset
Dropped columns (all missing/zero): ['Biomass', 'Fossil Brown coal/Lignite', 'Fossil Coal-derived gas', 'Fossil Oil', 'Fossil Oil shale', 'Fossil Peat', 'Geothermal', 'Hydro Pumped Storage', 'Hydro Pumped Storage.1', 'Hydro Run-of-river and poundage', 'Hydro Water Reservoir', 'Marine', 'Other renewable']

Numerical statistics:
                     mean  min      max      std  missing
Fossil Gas        3512.53  0.0   9666.0  2256.92       27
Fossil Hard coal  1362.93  0.0   3912.0   999.45       27
Nuclear            433.14  0.0    490.0   135.73       26
Other              351.87  0.0  11638.0   292.84       26
Solar               33.59  0.0    252.0    56.98       26
Waste               62.45  0.0     89.0    17.87       27
Wind Offshore     1296.40  0.0   3339.0   966.51       26
Wind Onshore       431.28  0.0   1226.0   347.33       26

Final columns: ['DateTime', 'Fossil Gas', 'Fossil Hard coal', 'Nuclear', 'Other', 'Solar', 'Waste', 'Wind Offshore', 'Wind