<h1 style="text-align:center">Sangam 2019 - ML Hackathon by IITMAA</h1>
<p>
    <strong>Approach used: </strong>SARIMAX (Seasonal Autoregressive Integrated Moving Average with eXogeneous variables)<br><br>
    <strong>Reason: </strong>The data provided is seasonal, and it is a time series data with multiple exogeneous variables influencing the result. Hence, the optimal statistical model that can be applied to this task is SARIMAX
    <br><br>
    <strong>Main Modules Used: </strong>
    <ul>
        <li><code>statsmodel</code> package in Python</li>
    </ul>
</p>

<h2>Import Required Modules</h2>

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
from tqdm import tqdm

<h2>Read the Train Data</h2>

In [44]:
data = pd.read_excel('data1.xlsx')
data.index = data.mois
data = data.drop(['mois'],axis=1)
data.head()

Unnamed: 0_level_0,Mesure
mois,Unnamed: 1_level_1
2018-01-01,1141
2018-01-01,1157
2018-01-01,2246
2018-01-01,3177
2018-01-01,276


<h2>Data Preprocessing</h2>
<p>For handling categorical variables <code>is_holiday</code>, <code>weather_type</code>, <code>weather_description</code>, we perform <strong>one-hot encoding</strong></p>

In [45]:
def pre_process(data):
    data['Mesure'] = 0
    for i in tqdm(range(len(data))):
        if(data.iloc[i]['Mesure'] != "None"):
            data.iloc[i]['Mesure'] = 1
    #mois_ = pd.get_dummies(data['mois'],prefix="mois")
    Mesure_ = pd.get_dummies(data['Mesure'],prefix="Mesure")
    data = data.drop(['Mesure'],axis=1)
    data = pd.concat([data,Mesure_],axis=1)
    data.head()
    return(data)

In [42]:
data = pre_process(data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.iloc[i]['Mesure'] = 1
100%|██████████| 2652/2652 [00:00<00:00, 6014.94it/s]


<h2>Train Data Assignment </h2>
<h4>(Here all data is set to train, but for validation the commented out part should be used)</h4>

In [46]:
#train = data.iloc[:int(0.9*len(data))]
# test = data.iloc[int(0.9*len(data)):]'

train = data
train.head()

Unnamed: 0_level_0,Mesure
mois,Unnamed: 1_level_1
2018-01-01,1141
2018-01-01,1157
2018-01-01,2246
2018-01-01,3177
2018-01-01,276


<h2>Specify endogenous and exogenous variables in the data</h2>

In [49]:
# Variables
exog_data = train.drop(['Mesure'],axis=1)
exog = sm.add_constant(exog_data)
endog = train[[u'Mesure']]

print(endog)
print(exog)
# nobs = endog.shape[0]

            Mesure
mois              
2018-01-01    1141
2018-01-01    1157
2018-01-01    2246
2018-01-01    3177
2018-01-01     276
...            ...
2021-04-01     962
2021-04-01   11445
2021-05-01   20998
2021-05-01  439194
2021-05-01  101445

[2652 rows x 1 columns]
            const
mois             
2018-01-01    1.0
2018-01-01    1.0
2018-01-01    1.0
2018-01-01    1.0
2018-01-01    1.0
...           ...
2021-04-01    1.0
2021-04-01    1.0
2021-05-01    1.0
2021-05-01    1.0
2021-05-01    1.0

[2652 rows x 1 columns]


<h2>Train the Model (Slow Cell)</h2>

In [50]:
# Fit the model
mod = sm.tsa.statespace.SARIMAX(endog, exog=exog, order=(1,0,1))
fit_res = mod.fit(disp=False)
print(fit_res.summary())



                               SARIMAX Results                                
Dep. Variable:                 Mesure   No. Observations:                 2652
Model:               SARIMAX(1, 0, 1)   Log Likelihood              -28570.360
Date:                Thu, 03 Mar 2022   AIC                          57148.720
Time:                        11:31:02   BIC                          57172.252
Sample:                             0   HQIC                         57157.238
                               - 2652                                         
Covariance Type:                  opg                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const       6365.7885   1467.539      4.338      0.000    3489.464    9242.113
ar.L1          0.9660      0.013     74.095      0.000       0.940       0.992
ma.L1         -0.8607      0.015    -56.528      0.0

<h2>Read in Test Set</h2>

In [51]:
test_set = pd.read_excel("data.xlsx")
test_set.index = test_set.mois
test_set = test_set.drop(['mois'],axis=1)
test_set.head()

Unnamed: 0_level_0,Mesure
mois,Unnamed: 1_level_1
2018-01-01,1141
2018-01-01,1157
2018-01-01,2246
2018-01-01,3177
2018-01-01,276


In [52]:
test_set = pre_process(test_set)

100%|██████████| 2652/2652 [00:00<00:00, 8429.43it/s]


In [53]:
test_set.head()

Unnamed: 0_level_0,Mesure_1
mois,Unnamed: 1_level_1
2018-01-01,1
2018-01-01,1
2018-01-01,1
2018-01-01,1
2018-01-01,1


<h2>Handling columns that aren't present in the test set, but are in the train set</h2>

In [54]:
for i in train.columns:
    if i not in test_set.columns:
        test_set[i] = 0
test_set.tail()

Unnamed: 0_level_0,Mesure_1,Mesure
mois,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-04-01,1,0
2021-04-01,1,0
2021-05-01,1,0
2021-05-01,1,0
2021-05-01,1,0


<h2>Forecasting the <code>traffic_volume</code> for the given test set</h2>

In [56]:
# last_train = train.iloc[len(train)-1].name
first_predict = test_set.iloc[0].name
# print(last_train,first_predict)
# import datetime as dt

# start_dt = dt.datetime.strptime(last_train, '%Y-%m-%d %H:%M:%S')
# predict_dt = dt.datetime.strptime(first_predict, '%Y-%m-%d %H:%M:%S')
# diff = (predict_dt - start_dt) 
# days, seconds = diff.days, diff.seconds
# hours = days * 24 + seconds // 3600
# print(hours)

exog1 = (sm.add_constant(test_set).loc[first_predict:])
exog1 = exog1.drop(['Mesure'],axis=1)

# print(pd.concat([exog,exog1]))
# predict = fit_res.predict(start=hours,end=hours,exog=exog1)
# print(predict)

print(exog1)
print(exog1.shape)
forecast = fit_res.forecast(steps = len(test_set),exog = exog1)
print(forecast,len(forecast),len(test_set))

            Mesure_1
mois                
2018-01-01         1
2018-01-01         1
2018-01-01         1
2018-01-01         1
2018-01-01         1
...              ...
2021-04-01         1
2021-04-01         1
2021-05-01         1
2021-05-01         1
2021-05-01         1

[2652 rows x 1 columns]
(2652, 1)
2652    58540.918895
2653    56767.507620
2654    55054.373866
2655    53399.468825
2656    51800.813328
            ...     
5299     6365.788522
5300     6365.788522
5301     6365.788522
5302     6365.788522
5303     6365.788522
Name: predicted_mean, Length: 2652, dtype: float64 2652 2652




In [57]:
result_data = pd.DataFrame(index=test_set.index, columns=['mois','Mesure'])
result_data.head()

Unnamed: 0_level_0,mois,Mesure
mois,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01,,
2018-01-01,,
2018-01-01,,
2018-01-01,,
2018-01-01,,


In [58]:
chk = 0
for i in tqdm(forecast):
    result_data.iloc[chk]["date_time"] = test_set.iloc[chk].name
    result_data.iloc[chk]["Mesure"] = i
    chk+=1
result_data.head()

100%|██████████| 2652/2652 [00:01<00:00, 1963.50it/s]


Unnamed: 0_level_0,mois,Mesure
mois,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01,,58540.918895
2018-01-01,,56767.50762
2018-01-01,,55054.373866
2018-01-01,,53399.468825
2018-01-01,,51800.813328


In [59]:
result_data.to_csv('results.csv', header=['mois','Mesure'], index=False) 