# Tabular Playground July

In this notebook I will make some exploratory data analysis and use some model to make predictions. The model will probably be some sARIMA with exogenous variables.

## Data loading

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv', 
                    parse_dates=["date_time"])
target = train[['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']]
series = target.copy()
series = series.set_index(train.date_time)
train = train.set_index('date_time')
train = train.drop(['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'], axis=1)
test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv',
                  parse_dates=["date_time"])
test = test.set_index('date_time')

## Exploratory Data Analysis

In this part I will cover the following:
* Histograms
* Summaries with basic information (mean, stdev...)
* Autocorrelation and partial autocorrelation plots

In [None]:
train.head()

In [None]:
train.shape

### Histograms

In [None]:
fig = train.hist(figsize=(100, 100), bins=30)
[x.title.set_size(80) for x in fig.ravel()]
plt.show()

First thing to notice is that the histograms seem to be skewed. Moreover, we can see some variables have a peak at the zero. Later on we could correct this by using a box-cox transformation. For the zero problem we could leave as if they are normal values, or we could assume they are nans and interpolate the value. We will try both and see which one yields better results.

### Summary

In [None]:
train.describe()

Second thing to notice is that the values are in very different scales. To solve that we may want to scale all the variables to be on the same range.

### ACF and PACF

In [None]:
from statsmodels.tsa.stattools import acf

def acf_plot(series, lags):
    fig, ax = plt.subplots(1,3, figsize=(50,10))
    fig.tight_layout(pad=10)
    labels = ['Benzene', 'CO', 'NO']
    for i in range(3):
        acf_points, confint = acf(series.iloc[:,i], 
                                                 nlags=lags, fft=False, alpha=.05)


        ax[i].stem(acf_points, linefmt='-', markerfmt='o', basefmt='black')
        confint_center = confint-np.array((acf_points,acf_points)).T
        confint_low = confint_center[:,0]
        confint_high = confint_center[:,1]
        ax[i].fill_between(range(lags+1),confint_low, confint_high, alpha=0.5)
        ax[i].set_title(labels[i] + ' acf', fontweight="bold", size=50)
        plt.setp(ax[i].get_xticklabels(), rotation='horizontal', fontsize=30)
        plt.setp(ax[i].get_yticklabels(), rotation='horizontal', fontsize=30)
    plt.show()
acf_plot(series, 100)

It is clear that there is a seasonal pattern and that the series is not stationary, it can be seen in the p-value and in the graphic.

In [None]:
from statsmodels.tsa.stattools import pacf

labels = ['Benzene', 'CO', 'NO']
def pacf_plot(series, lags):
    fig, ax = plt.subplots(1,3, figsize=(50,10))
    fig.tight_layout(pad=10)
    for i in range(3):
        acf_points, confint = pacf(series.iloc[:,i], 
                                   nlags=lags, alpha=.05)


        ax[i].stem(acf_points, linefmt='-', markerfmt='o', basefmt='black')
        confint_center = confint-np.array((acf_points,acf_points)).T
        confint_low = confint_center[:,0]
        confint_high = confint_center[:,1]
        ax[i].fill_between(range(lags+1),confint_low, confint_high, alpha=0.5)
        ax[i].set_title(labels[i] + ' pacf', fontweight="bold", size=50)
        plt.setp(ax[i].get_xticklabels(), rotation='horizontal', fontsize=30)
        plt.setp(ax[i].get_yticklabels(), rotation='horizontal', fontsize=30)
    plt.show()
pacf_plot(series,50)

With the pacf plot we confirm that the seasonality seems to be every 22 lags and in the acf every 24. It seems more realistic the 24 value since the data is split by hours. We will use this to differentiate the series and find the proper sARIMA model later on.

## sARIMA

In order to choose the proper sARIMA model we first need an stationary series. To achieve that the first part of the process is to use seasonal differentiation and after that normal differentiation to find the parameter d. Then, we plot the new acf, and pacf and see which model fits best.

### Seasonal differentiation

We can try many different values for the period, representing day periodicity or week, month or year periodicity. I'll start with the value 24 representing the day one.

In [None]:
series_D = series.diff(periods=24)[24:]
acf_plot(series_D, 100)
pacf_plot(series_D, 100)

Now the acf and pacf seems to show the pattern of an MA(1) in the seasonal part. Apart from that the values are near zero which seems to validate the hypothesis that $D=24$ is a good value.

### Normal differentiationÂ 

I'll be using the variance criterion to choose the value for the number of differentiations. It is to simply stop differentiating when the variance increases.

In [None]:
series_D.plot(subplots=True)
plt.show()
print('Var:', np.var(series_D))

In [None]:
series_D_d = series_D.diff()[1:]
series_D_d.plot(subplots=True)
plt.show()
print('Var:', np.var(series_D_d))

In [None]:
series_D_d2 = series_D_d.diff()[1:]
series_D_d2.plot(subplots=True)
plt.show()
print('Var:', np.var(series_D_d2))

As we can see the variance decreases with one differentiation but increases with two. So the value is $d=1$. Let's see which sARIMA is the best now. And this time I will reduce the lags to 30 in order to see the non-seasonal part.

In [None]:
acf_plot(series_D_d, 30)
pacf_plot(series_D_d, 100)

It seems to be that it could be an MA(5). However we will try different models, this is just to know how to start modelling.

## Standarisation

Now is time to apply the box-cox we mentioned before and scale all the data.

In [None]:
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='box-cox')
train_sc = pd.DataFrame(data = pt.fit_transform(train), columns=train.columns, index=train.index)

In [None]:
train_sc.describe()

And now the same for the time series. Since the data is originally positive, I'll use the box-cox first and then the differentiations.

In [None]:
series_sc = pd.DataFrame(data = pt.fit_transform(series), columns=series.columns, index=series.index)
series_sc_D_d = series_sc.diff(24)[24:].diff()[1:]

In [None]:
series_sc_D_d.plot(subplots=True)
plt.show()

## Model

Now that the data has been preprocessed and analysed we can start applying models to see how well they work.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(series, target, shuffle=False, test_size=0.4)
X_train_sc = pd.DataFrame(data = pt.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_val_sc = pd.DataFrame(data = pt.transform(X_val), columns=X_val.columns, index=X_val.index)

In [None]:
X_val.shape

In [None]:
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_log_error

def arima_plot(order, s_order):
    err = 0
    preds = [[],[],[]]
    for i in range(3):
        model = ARIMA(X_train_sc.iloc[:,i], order=order, seasonal_order=s_order, freq='H')
        res = model.fit()

        preds[i] = res.predict(start=X_val_sc.index[0], end=X_val_sc.index[-1])
    preds = pd.DataFrame(data=pt.inverse_transform(np.array(preds).T), 
                         columns=X_val.columns, index=X_val.index)
    err += mean_squared_log_error(y_val, preds)

    fig, ax = plt.subplots(3,3, figsize=(50,20))
    fig.tight_layout(pad=10)
    for i in range(3):
        X_train.iloc[:,i].plot(ax = ax[0,i])
        X_val.iloc[:,i].plot(color='red', ax = ax[0,i])
        preds.iloc[:,i].plot(color='green', ax = ax[0,i])
        ax[0,i].legend(['Train','Validation','Prediction'])
        ax[0,i].set_title(labels[i] + ' predictions', fontweight="bold", size=50)
        plt.setp(ax[0,i].get_xticklabels(), rotation='horizontal', fontsize=30)
        plt.setp(ax[0,i].get_yticklabels(), rotation='horizontal', fontsize=30)
        
        # plot residual errors
        residuals = pd.DataFrame(res.resid)
        residuals.plot(ax = ax[1,i])
        residuals.plot(kind='kde', ax = ax[2,i])
        ax[1,i].set_title(labels[i] + ' residuals', fontweight="bold", size=50)
        plt.setp(ax[1,i].get_xticklabels(), rotation='horizontal', fontsize=30)
        plt.setp(ax[1,i].get_yticklabels(), rotation='horizontal', fontsize=30)
    plt.show() 
    print('Validation RMSLE:', err/3)

arima_plot((0,1,5), (0,1,1,24))

As we an see it is has much room for improvement. Let's try some other models and see what happens.

In [None]:
arima_plot((0,1,5), (1,1,1,24))

In [None]:
arima_plot((5,1,0), (0,1,1,24))

## Final predictions

Finally, we train the model in the whole training set and predict the test set, save it and submit it to see the result.

In [None]:
final_preds = [[],[],[]]
series_sc = pd.DataFrame(data = pt.fit_transform(series), columns=series.columns, index=series.index)
for i in range(3):
    final_model = ARIMA(series_sc.iloc[:,i], seasonal_order=(0,1,2,24), freq='H')
    final_res = final_model.fit()
    final_preds[i] = final_res.predict(start=test.index[0], end=test.index[-1])
preds_final = pd.DataFrame(data=pt.inverse_transform(np.array(final_preds).T), 
                         columns=series_sc.columns, index=test.index)

In [None]:
fig, ax = plt.subplots(3,1, figsize=(50,20))
series.target_benzene.plot(ax=ax[0])
preds_final.target_benzene.plot(ax=ax[0])

series.target_carbon_monoxide.plot(ax=ax[1])
preds_final.target_carbon_monoxide.plot(ax=ax[1])

series.target_nitrogen_oxides.plot(ax=ax[2])
preds_final.target_nitrogen_oxides.plot(ax=ax[2])
plt.show()

In [None]:
preds_final.reset_index().to_csv('submission.csv', index=False)

The best result was achieve with a sARIMA(0,0,0)(0,1,1)24.