# [Challenge Data - Electricity price forecasting by Elmy](https://challengedata.ens.fr/challenges/140)

## Decision Tree Classifier

Ce notebook vise à faire l'étude (mise en place, optimisation et score) de la méthode de Decision Tree Learning sur notre problème de prédiction du paramètre _spot_id_delta_. Pour cela, nous allons regrouper les données d'entrainement en deux groupes :
* _spot_id_delta_ >= 0
* _spot_id_delta_ <= 0

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats

from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score

X = pd.read_csv('../data/X_train.csv').set_index('DELIVERY_START')
y = pd.read_csv('../data/y_train.csv') #.set_index('DELIVERY_START')
X_rendu = pd.read_csv('../data/X_test.csv').set_index('DELIVERY_START')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

y_train_classified = y_train.copy()
y_train_classified['spot_id_delta'] = np.sign(y_train_classified['spot_id_delta'])
y_test_classified = y_test.copy()
y_test_classified['spot_id_delta'] = np.sign(y_test_classified['spot_id_delta'])

In [4]:
y_train.head()

Unnamed: 0.1,Unnamed: 0,DELIVERY_START,spot_id_delta
553,9849,2023-02-26 11:00:00+01:00,-14.591457
2374,5670,2022-09-01 08:00:00+02:00,-14.896794
1061,4433,2022-07-11 19:00:00+02:00,16.143834
6370,8239,2022-12-20 09:00:00+01:00,-33.480208
2674,2177,2022-04-05 19:00:00+02:00,-15.206277


In [None]:
def weighted_accuracy(y_true: pd.DataFrame, y_pred: np.ndarray):
    y_pred = pd.DataFrame(data={'y_pred': y_pred})
    df = pd.concat([y_true.rename(columns={'spot_id_delta': 'y_true'}), y_pred], axis=1)
    df['accuracy'] = df.apply(lambda row: (math.floor(abs((np.sign(row.y_true) + np.sign(row.y_pred)/2))))*(1 - abs((row.y_true - row.y_pred)/row.y_true)), axis=1)
    return df['accuracy'].mean()

Maintenant que les données sont classifiées, on peut entraîner un modèle de Decistion Tree Learning:

In [None]:
import matplotlib.pyplot as plt
import math
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max)
clf.fit(X_train, y_train_classified)
y_pred = pd.DataFrame(clf.predict(X_train)[:, 1])

In [None]:
sns.heatmap(y_pred.isna(),cbar=False)

In [None]:
sns.heatmap(y_train.isna(),cbar=False)

In [None]:
print("weighted accuracy on y_train: ", weighted_accuracy(y_train, y_pred))
print("weighted accuracy on y_train: ", weighted_accuracy(y_test, clf.predict(X_test)[:, 1]))

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(depths, train_scores, color="b", linestyle=':', label="train")
plt.plot(depths, test_scores, color="r", linestyle='-', label="test")
plt.xlabel("Tree depth")
plt.ylabel("Score")
plt.grid()
plt.legend(loc="best")

On en déduit la profondeur optimale pour le Decision Tree Classifier :

In [None]:
best_depth = depths[np.argmax(test_scores)]
print(f"best_depth = {best_depth} for a score of {max(test_scores)}")

Le score étant supérieur à 0.5, on peut donc prédire avec ce modèle **avec plus d'une chance sur deux** si le prix sera plus élevé sur le marché SPOT ou sur le marché Intraday.

On souhaite maintenant faire les prédictions à partir du data set X_rendu :

In [None]:
clf = DecisionTreeClassifier(max_depth=24)
clf.fit(X_train, y_train_classified)

Y_test_submission = X_rendu[['DELIVERY_START']].copy()
Y_test_submission['spot_id_delta'] = clf.predict(X_rendu.drop('DELIVERY_START', axis=1).fillna(0))


In [None]:
Y_test_submission.head()

In [None]:
Y_test_submission.to_csv('../data/y_submission.csv', index=False)

In [None]:
y_train = pd.read_csv('../data/y_train.csv').fillna(0)
X_train = pd.read_csv('../data/X_train.csv').fillna(0)

threshold = 600

eliminated = y_train[abs(y_train['spot_id_delta']) - threshold >= 0].DELIVERY_START
print(eliminated)
y_train = y_train[~y_train['DELIVERY_START'].isin(eliminated)]
X_train = X_train[~X_train['DELIVERY_START'].isin(eliminated)]


y=y_train['spot_id_delta']
x=X_train["load_forecast"]
plt.title("écart en fonction de la prévision de consommation totale d'éléctricité en France")
plt.xlabel("prévision")
plt.ylabel("écart")
plt.scatter(x,y)

In [None]:
x.size

In [None]:
y.size