## Importing Libraries and **Data files**

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')

In [None]:
print("train shape", train.shape)
print("test shape", test.shape)

# Data Analysis

Steps in data analysis will follow these following steps

* Check out categorical and numerical data
* Check for duplicate data
* Detect Null values 
* Explore the data with graphs and plots 
* Buiild model and predict







## Generic data analysis 
like **Null-Value**; **Data Types**; **Duplicate Data**

In [None]:
print("train shape", train.shape)
train.describe().T

In [None]:
print('Data types of the columns \n\n', train.info())
print('\n\n\n Total Null values\n', train.isnull().sum())

In [None]:
print('Total Duplicate values\n', train.duplicated().sum())

### *Inference* 1: 
* No Null values
* All except time are continuous variables of float type data
* I will change the Date-Time from string to date format
* Predcitions can be a forecast or forecasting methods like -
 * *Simple Univariate time series*
 * *Multivariate time series*
 * *Regression model, since the target is a continuous variable*

# Time Analysis:  

## Loading and preparing the data

In [None]:
# date time has to be change from object to a datetine format, this can be directly done also while importing the data
train['date_time']=pd.to_datetime(train['date_time'])
train.set_index('date_time', inplace = True)

# dataset for univariate analysis
train_u = train.copy()
cols = ['deg_C', 'relative_humidity', 'absolute_humidity',
       'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5']
train_u.drop(cols, axis = 1, inplace = True)

#renaming the columns
# cm - carbon monoxide; ben = benzene; no = nitrogen oxides
train_u.columns = ["cm", "ben", "no"]

train_u.head()


## Visualize the data
So we have hourly data from 10th march to 1st jan next year. 
* we can extract information or trend based on month, week, day, and hourly 

In [None]:
train_u.plot(
    subplots = True, 
    layout = (3,1), 
    sharex = True, 
    figsize = (30,15) )

Plotting all the target together made me realise that the pollutants all share a same pattern. Toward the end the their respective values have increased, may be an over all increase in the trend. 


In [None]:
# Separately performing each target variable.
# Purpose of train_cm is to decide for the model parameters later

train_cm = train_u.copy()
train_cm.drop(columns = ['ben','no'], inplace = True)
train_cm.tail()

In [None]:
train_cm.plot(figsize = (20,10))

## Stationarity in the data

In [None]:
# Testing For Stationarity
from statsmodels.tsa.stattools import adfuller


#Ho: It is non stationary
#H1: It is stationary
def adfuller_test(levels):
    result=adfuller(levels)
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )
    if result[1] <= 0.05:
        print("\n strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data has no unit root and is stationary")
    else:
        print("\n weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary ")

In [None]:
adfuller_test(train_cm['cm'])

## AR / MA ? model parameters


In [None]:
from pandas.plotting import autocorrelation_plot

plt.figure(figsize=(20,8))
autocorrelation_plot(train_cm['cm']).set_xlim([0,100]) # setting the limit to a managable level
plt.xticks(np.arange(0, 200, 12)) # changing the tick frequency for matplotlib
plt.show()

The data is repeating every 24 lags, thus a seasonal pattern observed that can be used in the model preparation later.
* Next would be to get the ACF and PACF analysed for the TSA model

In [None]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf

In [None]:
# AR(p) or MA(q) based on the ACF and PACF graphs without accounting for the cyclic nature

fig = plt.figure(figsize=(24,8))
ax1 = fig.add_subplot(211)
fig = plot_acf(train_cm['cm'], lags = 60, ax=ax1)
ax2 = fig.add_subplot(212)
fig = plot_pacf(train_cm['cm'], lags=60, ax=ax2)

* From the ACF plot, a seasonality/cyclic nature of 24 is again observed
* PACF = P = 1 or 2 for sure. 
* Need to removed the pattern by differencing with a 24 moving window format

In [None]:
train_cm['diff'] = train_cm['cm'] - train_cm['cm'].shift(24)
train_cm['diff'].plot(figsize = (24,6))

# post differencing 
fig = plt.figure(figsize=(24,10))
ax1 = fig.add_subplot(211)
fig = plot_acf(train_cm['diff'].iloc[24:], lags = 200, ax=ax1)
ax2 = fig.add_subplot(212)
fig = plot_pacf(train_cm['diff'].iloc[24:],lags = 200, ax=ax2)

For the sARIMA model, since there is seasonality observed
* AR(p) = can be 1 or strecthed to 2, since post that the correlation values drops in the PACF plot
* d = 1, since the differencing has been conducted once 
* MA(q) = is 1 since the inertia of that is carried forward to the rest of the values, and it is something that is validated by the ACF plot also.


## Building sARIMA 

### Selecting the order

In [None]:
# arima 111
from statsmodels.tsa.arima_model import ARIMA

arima_model = ARIMA(train_cm['cm'],order=(1,1,1))
model_fit = arima_model.fit()
train_cm['forecast111'] = model_fit.predict(start=6000 , end=7110 )
train_cm[['cm','forecast111']].plot(figsize=(20,10))

The fact that forecast is bad proves the initial point of seasonality and the i need to try SARIMA model. 
* Note for this starting project i am just focusing on the traget variables, but in the subsequent notebook will try adding the exogenous variables

In [None]:
# sarima 111
import statsmodels.api as sm

sarima_model = sm.tsa.statespace.SARIMAX(train_cm['cm'],order=(1, 1, 1))
model_fit = sarima_model.fit()
train_cm['sforecast111'] = model_fit.predict(start=6000 , end=7110 )
train_cm[['cm','sforecast111']].plot(figsize=(20,10))

In [None]:
# sarima 211
import statsmodels.api as sm

sarima_model = sm.tsa.statespace.SARIMAX(train_cm['cm'],order=(2, 1, 1))
model_fit = sarima_model.fit()
train_cm['sforecast211'] = model_fit.predict(start=6000 , end=7110 )
train_cm[['cm','sforecast211']].plot(figsize=(20,10))

sARIMA ***111, 211***; all gave promising result compared to arima
* Combined error for the forecast 
* the best model will be applied on the test csv

### Accuracy test

Mean Absolute Percentage Error (MAPE) 
mape = np.mean(np.abs(forecast - actual)/np.abs(actual))

In [None]:
train_cm.columns

In [None]:
model_111 = np.mean(np.abs(train_cm['forecast111'].iloc[6000:7111] - train_cm['cm'].iloc[6000:7111])/np.abs(train_cm['cm'].iloc[6000:7111]))

model_s111 = np.mean(np.abs(train_cm['sforecast111'].iloc[6000:7111] - train_cm['cm'].iloc[6000:7111])/np.abs(train_cm['cm'].iloc[6000:7111]))

model_s211 = np.mean(np.abs(train_cm['sforecast211'].iloc[6000:7111] - train_cm['cm'].iloc[6000:7111])/np.abs(train_cm['cm'].iloc[6000:7111]))

print(' error - 111',model_111, '\n error - s111', model_s111, '\n error - s211', model_s211)

Least error is observed for seasonal arima with 111 model. 
* *There is still room for improvement if the exogenous variables are used in a multivarite Time series analysis*

### Time to predict 

In [None]:
sub = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')

# dataset for univariate analysis

cols = ['target_carbon_monoxide','target_benzene','target_nitrogen_oxides']
sub.drop(cols, axis = 1, inplace = True)
sub

In [None]:
test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')

# dataset for univariate analysis

cols = ['deg_C', 'relative_humidity', 'absolute_humidity',
       'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5']
test.drop(cols, axis = 1, inplace = True)
test

In [None]:
test

In [None]:
# cm - sarima 111
import statsmodels.api as sm

sarima_model = sm.tsa.statespace.SARIMAX(train_u['cm'],order=(1, 1, 1))
model_fit = sarima_model.fit(full_output = True)
pre_cm = model_fit.predict(start = pd.to_datetime('2011-01-01 00:00:00'), 
                           end=pd.to_datetime('2011-04-04 14:00:00'), dynamic = False)

In [None]:
pre_cm

In [None]:
# ben - sarima 111
import statsmodels.api as sm

sarima_model = sm.tsa.statespace.SARIMAX(train_u['ben'],order=(1, 1, 1))
model_fit = sarima_model.fit()
pre_ben = model_fit.predict(start = pd.to_datetime('2011-01-01 00:00:00'), 
                           end=pd.to_datetime('2011-04-04 14:00:00'), dynamic = False)

In [None]:
pre_ben

In [None]:
# no - sarima 111
import statsmodels.api as sm

sarima_model = sm.tsa.statespace.SARIMAX(train_u['no'],order=(1, 1, 1))
model_fit = sarima_model.fit()
pre_no = model_fit.predict(start = pd.to_datetime('2011-01-01 00:00:00'), 
                           end=pd.to_datetime('2011-04-04 14:00:00'), dynamic = False)

In [None]:
pre_no

In [None]:
final = pd.DataFrame({'cm':pre_cm, 'ben':pre_ben, 'no':pre_no}).reset_index()
final.columns = ['date_time', 'target_carbon_monoxide','target_benzene', 'target_nitrogen_oxides']
final

In [None]:
final.to_csv("Submission.csv", index = False)