# Your mission

You started working on the Ecowatt project at RTE. In order to avoid possible shortage, one must plan for peaks in national electricity. You manager Mark is going on holidays for a week. You will be sole responsible for forecasting the weekly demand, while he is absent.

In order to prevent electricity shortage, you must accurately forecast the demand 7 days ahead, on an hourly basis.

Your mission is to train an accurate predictive model with the lowest root mean squared error (RMSE). Mark is a very technical guy, he likes to understand all technical details and would like you to compare the performances of classical models and neural-net based models.


Your **target variable** is the consommation_totale

**Data source** : https://data.enedis.fr/pages/accueil/

# Import

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive

In [2]:
drive.mount('/content/gdrive')
os.chdir("/content/gdrive/MyDrive/Colab Notebooks/EI_ST4_G1/EI_TS_CS-20230526T084435Z-001/EI_TS_CS")

Mounted at /content/gdrive


In [3]:
%run ./utils.ipynb

In [4]:
FILE_PATH = "data/bilan.csv"
TARGET = "consommation_totale"
EXOGENEOUS= "Température normale lissée (°C)"

## Prepare the data

Define here the range of your train/test split

In [5]:
def read_data(data_path : str = "data/bilan.csv") -> pd.DataFrame:
    df = pd.read_csv(data_path)
    df['horodate'] = pd.to_datetime(df['horodate']) # Convert 'horodate' column to datetime
    df = df.set_index('horodate') # Set 'horodate' as the index
    # Filter rows where the value of column "Mois" is 6
    df = df[df['Mois'].isin([5,6])]
    hourly_avg = df.groupby('Mois')['consommation_totale'].mean().reset_index(name='consommation_totale')
    
    return df

df = read_data(FILE_PATH)

X_train = df[-1000:-100]
X_test = df[-100:]

In [6]:
df

Unnamed: 0_level_0,Unnamed: 0,Mois,Injection RTE (W),Refoulement RTE (W),Pertes modélisées (W),consommation_totale,Consommation totale télérelevée (W),Consommation HTA télérelevée (W),Consommation totale profilée (W),Consommation HTA profilée (W),...,Production décentralisée profilée (W),Production photovoltaïque profilée (W),Production autre profilée (W),Température réalisée lissée (°C),Température normale lissée (°C),Production éolienne totale (W),Production photovoltaïque totale (W),Pseudo rayonnement,Consommation HTA totale (W),Soutirage net vers autres GRD (W)
horodate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-05-13 00:30:00+02:00,0.0,5.0,3.033121e+10,1.540171e+09,1.345570e+09,3.118363e+10,7.476941e+09,7.473625e+09,2.370669e+10,704894715.0,...,1.711777e+07,0.000000e+00,276802.0,14.0,15.2,2.546684e+09,2.833300e+04,29.0,8.178520e+09,281335021.0
2018-05-13 01:30:00+02:00,0.0,5.0,2.845965e+10,1.913083e+09,1.244898e+09,2.953303e+10,7.370166e+09,7.366722e+09,2.216286e+10,655407562.0,...,1.711776e+07,0.000000e+00,276802.0,14.0,15.1,2.997843e+09,2.866700e+04,30.0,8.022130e+09,248347310.0
2018-05-13 09:30:00+02:00,0.0,5.0,2.810209e+10,1.202395e+09,1.225843e+09,2.988485e+10,7.655606e+09,7.651590e+09,2.222924e+10,635190971.0,...,6.466203e+08,6.294351e+08,276802.0,13.1,14.7,1.530496e+09,1.430109e+09,20.0,8.286781e+09,254290386.0
2018-05-13 23:30:00+02:00,0.0,5.0,3.475971e+10,1.977665e+09,1.589989e+09,3.604854e+10,7.806472e+09,7.803119e+09,2.824207e+10,717113409.0,...,1.705033e+07,0.000000e+00,276802.0,12.1,15.3,3.643059e+09,2.600000e+04,45.0,8.520232e+09,314277157.0
2018-05-14 00:00:00+02:00,0.0,5.0,3.345301e+10,2.128149e+09,1.319859e+09,3.511246e+10,7.846507e+09,7.843158e+09,2.726596e+10,792112359.0,...,1.690620e+07,0.000000e+00,276477.0,12.0,15.3,3.882783e+09,2.733300e+04,44.0,8.635271e+09,306619073.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-06-30 03:30:00+02:00,0.0,6.0,2.557632e+10,5.225270e+08,1.461077e+09,2.555978e+10,1.240539e+10,1.089969e+10,1.315439e+10,233375292.0,...,4.724614e+06,4.056200e+05,1780534.0,19.2,19.7,1.004989e+09,3.466747e+06,42.0,1.113307e+10,209326374.0
2022-06-30 04:30:00+02:00,0.0,6.0,2.526779e+10,4.224380e+08,1.409177e+09,2.524470e+10,1.270799e+10,1.118602e+10,1.253671e+10,230758732.0,...,4.683996e+06,4.501540e+05,1780534.0,19.1,19.6,8.694093e+08,3.495865e+06,39.0,1.141678e+10,204683737.0
2022-06-30 05:30:00+02:00,0.0,6.0,2.643994e+10,4.058390e+08,1.398221e+09,2.639976e+10,1.385684e+10,1.217465e+10,1.254293e+10,234341147.0,...,5.364963e+06,1.126140e+06,1780534.0,19.0,19.5,8.351213e+08,4.755411e+06,35.0,1.240899e+10,213128249.0
2022-06-30 13:00:00+02:00,0.0,6.0,3.568284e+10,1.378085e+09,2.270972e+09,3.938927e+10,1.827745e+10,1.527415e+10,2.111183e+10,321429290.0,...,1.664627e+09,1.660242e+09,1780534.0,19.1,21.1,1.463046e+09,4.827318e+09,22.0,1.559558e+10,273966985.0


# Modeling with Random Forest



## Modeling
The following code allows Random Forest modelling using the temperature as an exogeneous variable.

In [7]:
X_test[TARGET]

horodate
2022-06-16 06:30:00+02:00    3.051276e+10
2022-06-16 09:00:00+02:00    3.767542e+10
2022-06-16 09:30:00+02:00    3.843907e+10
2022-06-16 11:30:00+02:00    4.091108e+10
2022-06-16 18:30:00+02:00    3.876166e+10
                                 ...     
2022-06-30 03:30:00+02:00    2.555978e+10
2022-06-30 04:30:00+02:00    2.524470e+10
2022-06-30 05:30:00+02:00    2.639976e+10
2022-06-30 13:00:00+02:00    3.938927e+10
2022-06-30 15:00:00+02:00    3.744468e+10
Name: consommation_totale, Length: 100, dtype: float64

In [8]:
errors, predictions = evaluate_random_forest_model(
    np.array(X_train[TARGET]).reshape(-1, 1),
    np.array(X_test[TARGET]).reshape(-1,1),
    np.array(X_train[EXOGENEOUS]).reshape(-1,1),
    np.array(X_test[EXOGENEOUS]).reshape(-1,1),
    )
errors

  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), history)
  rf.fit(np.array(exog_history), h

5040355761.395226

## Search for the best ARIMA model
We use grid search to search for the best ARIMA parameters that gives the lowest error. This follows the Box-Jenkins methology.

In [9]:
import numpy as np

# Assurez-vous d'importer la fonction random_forest_grid_search du fichier ou du module approprié

params_grid = {
    'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'n_estimators': [100]
}

best_cfg, best_score = random_forest_grid_search(
    np.array(X_train[TARGET]).reshape(-1, 1),
    np.array(X_test[TARGET]).reshape(-1, 1),
    np.array(X_train[EXOGENEOUS]).reshape(-1, 1),
    np.array(X_test[EXOGENEOUS]).reshape(-1, 1),
    params_grid
)

  rf.fit(np.array(exog_train), X_train)
  rf.fit(np.array(exog_train), X_train)


RF dict_keys(['max_depth', 'n_estimators'])=(1, 100) RMSE=5006761681.434715
RF dict_keys(['max_depth', 'n_estimators'])=(2, 100) RMSE=4877635679.78251


  rf.fit(np.array(exog_train), X_train)
  rf.fit(np.array(exog_train), X_train)


RF dict_keys(['max_depth', 'n_estimators'])=(3, 100) RMSE=5104598845.863625
RF dict_keys(['max_depth', 'n_estimators'])=(4, 100) RMSE=5046400603.311011


  rf.fit(np.array(exog_train), X_train)
  rf.fit(np.array(exog_train), X_train)


RF dict_keys(['max_depth', 'n_estimators'])=(5, 100) RMSE=5046867912.751725
RF dict_keys(['max_depth', 'n_estimators'])=(6, 100) RMSE=5029541074.246608


  rf.fit(np.array(exog_train), X_train)
  rf.fit(np.array(exog_train), X_train)


RF dict_keys(['max_depth', 'n_estimators'])=(7, 100) RMSE=5076980018.044216
RF dict_keys(['max_depth', 'n_estimators'])=(8, 100) RMSE=5102807181.934081


  rf.fit(np.array(exog_train), X_train)
  rf.fit(np.array(exog_train), X_train)


RF dict_keys(['max_depth', 'n_estimators'])=(9, 100) RMSE=5073495466.622395
RF dict_keys(['max_depth', 'n_estimators'])=(10, 100) RMSE=5077792937.140115
Best RF parameters: {'max_depth': 2, 'n_estimators': 100} RMSE=4877635679.78251


In [10]:
print(best_cfg, best_score)

{'max_depth': 2, 'n_estimators': 100} 4877635679.78251


## Visualization
To have a better view on the difference between true and predict values, we visualize them by plotting both the signals.

In [13]:
# prepare the dataset for plotting
df_reset = df.reset_index()
predict_date = df_reset["horodate"]
df_predict = pd.DataFrame(zip(predict_date[-100:],
                              predictions, X_test[TARGET].values),
                          columns=["date", "predict", "true"])

In [15]:
df_predict

Unnamed: 0,date,predict,true
0,2022-06-16 06:30:00+02:00,3.091277e+10,3.051276e+10
1,2022-06-16 09:00:00+02:00,3.145260e+10,3.767542e+10
2,2022-06-16 09:30:00+02:00,3.215069e+10,3.843907e+10
3,2022-06-16 11:30:00+02:00,3.615721e+10,4.091108e+10
4,2022-06-16 18:30:00+02:00,3.211221e+10,3.876166e+10
...,...,...,...
94,2022-06-29 15:00:00+02:00,3.303414e+10,3.761818e+10
95,2022-06-30 03:30:00+02:00,3.142064e+10,2.555978e+10
96,2022-06-30 04:30:00+02:00,3.145508e+10,2.524470e+10
97,2022-06-30 05:30:00+02:00,3.498049e+10,2.639976e+10


In [14]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=df_predict["date"], y=df_predict["predict"], name="predict"))
fig.add_trace(go.Scatter(x=df_predict["date"], y=df_predict["true"], name="true"))

fig.update_layout(title="Predictions vs true values")

# Modeling with other models

Try other other models : random forest, xgboost ...