<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Récupération-des-données" data-toc-modified-id="Récupération-des-données-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Récupération des données</a></span></li><li><span><a href="#Séparation-des-données" data-toc-modified-id="Séparation-des-données-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Séparation des données</a></span></li><li><span><a href="#Paramètres-globaux-pour-générer-les-baselines" data-toc-modified-id="Paramètres-globaux-pour-générer-les-baselines-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Paramètres globaux pour générer les baselines</a></span><ul class="toc-item"><li><span><a href="#nb_predictions" data-toc-modified-id="nb_predictions-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>nb_predictions</a></span></li><li><span><a href="#nb_jours" data-toc-modified-id="nb_jours-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>nb_jours</a></span></li><li><span><a href="#cols_to_keep" data-toc-modified-id="cols_to_keep-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>cols_to_keep</a></span></li><li><span><a href="#time_index" data-toc-modified-id="time_index-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>time_index</a></span></li></ul></li><li><span><a href="#Naive-last" data-toc-modified-id="Naive-last-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Naive last</a></span></li><li><span><a href="#Naive-seasonal" data-toc-modified-id="Naive-seasonal-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Naive seasonal</a></span><ul class="toc-item"><li><span><a href="#Exponential-smoothing" data-toc-modified-id="Exponential-smoothing-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Exponential smoothing</a></span></li></ul></li></ul></div>

In [103]:
import pandas as pd

from statsmodels.tsa.holtwinters import SimpleExpSmoothing

from sklearn.metrics import mean_squared_error

# Récupération des données

In [58]:
df_train = pd.read_pickle('data/df_train_baseline.pkl')
df_test = pd.read_pickle('data/df_test_baseline.pkl')

Vérifions que le test set est sur la bonne période.

In [3]:
df_test.head(1)

Unnamed: 0,Libelle,Date et heure de comptage,Débit horaire,Taux d'occupation,Etat trafic,Etat arc,filename,Date,Jour de la semaine_0,Jour de la semaine_1,...,mintempC,avgtempC,totalSnow_cm,sunHour,daily_uvIndex,sunrise,sunset,moon_phase,moon_illumination,Journée
9259,AV_Champs_Elysees,2020-11-23,91.0,1.38722,1,Invalide,champs-elysees.csv,2020-11-23,1,0,...,8,9,0.0,3.3,2,2020-11-23 08:12:00,2020-11-23 17:02:00,First Quarter,54,False


In [4]:
df_test.tail(1)

Unnamed: 0,Libelle,Date et heure de comptage,Débit horaire,Taux d'occupation,Etat trafic,Etat arc,filename,Date,Jour de la semaine_0,Jour de la semaine_1,...,mintempC,avgtempC,totalSnow_cm,sunHour,daily_uvIndex,sunrise,sunset,moon_phase,moon_illumination,Journée
28383,Sts_Peres,2020-11-27 23:00:00,109.0,1.24945,1,Invalide,sts.csv,2020-11-27,0,0,...,8,10,0.0,8.4,3,2020-11-27 08:18:00,2020-11-27 16:59:00,Waxing Gibbous,85,False


Définition des noms de variables à expliquer :

In [5]:
cols_to_explain = ["Débit horaire", "Taux d'occupation"]

# Essai sur la dernière ligne de df_train :
df_train.iloc[-1][cols_to_explain]

Débit horaire             90
Taux d'occupation    1.32778
Name: 28263, dtype: object

# Séparation des données

In [6]:
list_filenames = df_train['filename'].unique().tolist()
list_filenames

['champs-elysees.csv', 'convention.csv', 'sts.csv']

In [7]:
dic_df_train = {filename[:-4]: df_train[df_train['filename']==filename] for filename in list_filenames}
dic_df_test = {filename[:-4]: df_test[df_test['filename']==filename] for filename in list_filenames}

dic_df_train.keys()

dict_keys(['champs-elysees', 'convention', 'sts'])

In [8]:
predictions = pd.DataFrame([df_train.iloc[-1][cols_to_explain] for _ in range(7)])
predictions

Unnamed: 0,Débit horaire,Taux d'occupation
28263,90.0,1.32778
28263,90.0,1.32778
28263,90.0,1.32778
28263,90.0,1.32778
28263,90.0,1.32778
28263,90.0,1.32778
28263,90.0,1.32778


# Paramètres globaux pour générer les baselines

## nb_predictions

`nb_predictions` is the size of the sought output.

In [9]:
nb_predictions = len(df_test)
nb_predictions

360

## nb_jours

In [10]:
nb_days = 5

## cols_to_keep

In [11]:
cols_to_keep = ['Date et heure de comptage'] + cols_to_explain
cols_to_keep

['Date et heure de comptage', 'Débit horaire', "Taux d'occupation"]

## time_index

In [49]:
time_index = dic_df_test['sts']['Date et heure de comptage'].reset_index(drop=True)
time_index

0     2020-11-23 00:00:00
1     2020-11-23 01:00:00
2     2020-11-23 02:00:00
3     2020-11-23 03:00:00
4     2020-11-23 04:00:00
              ...        
115   2020-11-27 19:00:00
116   2020-11-27 20:00:00
117   2020-11-27 21:00:00
118   2020-11-27 22:00:00
119   2020-11-27 23:00:00
Name: Date et heure de comptage, Length: 120, dtype: datetime64[ns]

# Naive last

In [54]:
class Naive_last():
    def __init__(self, time_index):
        self.last_day = None
        self.time_index = time_index
    
    @staticmethod
    def get_last_day(df_train):
        last_day = df_train['Date et heure de comptage'].max().date()
        return df_train[df_train['Date']==last_day]
    
    def fit(self, df_train):
        self.last_day = self.get_last_day(df_train)[cols_to_explain]
        return
    
    def predict(self, nb_days):
        df_temp = pd.concat([self.last_day for _ in range(nb_days)], axis=0).reset_index(drop=True)
        
        return pd.concat([self.time_index, df_temp], axis=1)

In [56]:
dic_results_naive_last = dict()

for street in dic_df_train.keys():
    
    dic_results_naive_last[street] = dict()
    
    df_train = dic_df_train[street]
    df_test = dic_df_test[street]
    
    # On crée notre modèle :
    naive_last = Naive_last(time_index)
    naive_last.fit(df_train)
    
    predictions = naive_last.predict(nb_days)
    
    for col in cols_to_explain:
        dic_results_naive_last[street][col] = mean_squared_error(df_test[col], predictions[col], squared=False)
    
dic_results_naive_last    

{'champs-elysees': {'Débit horaire': 523.4125125876657,
  "Taux d'occupation": 5.930661152210112},
 'convention': {'Débit horaire': 280.83731945736844,
  "Taux d'occupation": 2.637713405966425},
 'sts': {'Débit horaire': 256.02517779833033,
  "Taux d'occupation": 3.4003668198224095}}

# Naive seasonal

**Remarque :** Contrairement à son nom, on ne prédit pas vriament sur une semaine ici. On désignera toutefois par `week` ces 5 jours.

In [97]:
class Naive_seasonal():
    def __init__(self, time_index):
        self.last_week = None
        self.time_index = time_index
    
    @staticmethod
    def get_last_week(df_train):
        last_day = df_train['Date et heure de comptage'].max()
        list_last_week = [(last_day - pd.DateOffset(i)).date() for i in range(5)]
        return df_train[df_train['Date'].isin(list_last_week)]
    
    def fit(self, df_train):
        self.last_week = self.get_last_week(df_train)[cols_to_explain].reset_index(drop=True)
        return
    
    def predict(self):
        return pd.concat([self.time_index, self.last_week], axis=1)

In [98]:
df_train = dic_df_train[street]
df_test = dic_df_test[street]

# On crée notre modèle :
naive_seasonal = Naive_seasonal(time_index)
naive_seasonal.fit(df_train)

In [102]:
dic_results_naive_seasonal = dict()

for street in dic_df_train.keys():
    
    dic_results_naive_seasonal[street] = dict()
    
    df_train = dic_df_train[street]
    df_test = dic_df_test[street]
    
    # On crée notre modèle :
    naive_seasonal = Naive_seasonal(time_index)
    naive_seasonal.fit(df_train)
    
    predictions = naive_seasonal.predict()
    
    for col in cols_to_explain:
        dic_results_naive_seasonal[street][col] = mean_squared_error(df_test[col], predictions[col], squared=False)
    
dic_results_naive_seasonal    

{'champs-elysees': {'Débit horaire': 296.29010277091606,
  "Taux d'occupation": 3.5735788455522193},
 'convention': {'Débit horaire': 159.13280617144915,
  "Taux d'occupation": 1.919154093148854},
 'sts': {'Débit horaire': 144.68549224668888,
  "Taux d'occupation": 1.9229872899031775}}

## Exponential smoothing