# Time Series data forecasting

# Importing Libraries

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

import statsmodels.api as sm
from statsmodels.tsa.api import VAR #Vector AutoRegression

from statsmodels.tsa.stattools import adfuller #for the Dicky-Fuller Test
from sklearn.metrics import mean_squared_error #for calculating the performance metric

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



# Data Preprocessing

In [1]:
df_train=pd.read_csv("../input/into-the-future/train.csv")
df_train.head()

In [1]:
df1=df_train[['time','feature_1','feature_2']]
df1.head()

In [1]:
df1.set_index('time', inplace=True)
df1.plot(figsize=(12,8))

Checking the info, just to be sure, in case any missing values.

In [1]:
df1.info()

Splitting the Train data for training the model : the first 85% for Train and the remaining 15% for Cross Validation

In [1]:
df1_train=df1[:int(0.85*len(df1))]
df1_test=df1[int(0.85*len(df1)):]

Even though the data looks *stationary*, but it's still better to be sure, as the model might make wrong predictions in case of *non-stationary data*.

<br></br>
I'll be performing the ***Dicky Fuller Test*** , a Hypothesis Testing where: <br></br>
* H0 : It's Non-stationary
* H1 : It's Stationary


In [1]:
def adfuller_test(data):
    result=adfuller(data,autolag='AIC')
    labels=["ADF Test Statistic","p-value","#Lags used","#Observations"]
    for val,lab in zip(result,labels):
        print(lab+":"+str(val))
    #taking the significance value as 0.05
    if result[1]>0.05 :
        print("Model is Not Stationary") #Null Hypothesis
    else:
        print("Model is Stationary") #Alternate Hypothesis
        

print("FEATURE 1 ")
adfuller_test(df1_train['feature_1'])
print("FEATURE 2 ")
adfuller_test(df1_train['feature_2'])

As we can see, both *feature_1* and *feature_2* are stationary, so we can proceed with training the model.

# Training and Testing Model

In [1]:
model = VAR(df1_train)
results = model.fit(maxlags=15, ic='aic')
results.summary()

Forecasting the CV data and converting it to DataFrame for further use.

In [1]:
predicted = results.forecast(results.y, steps=len(df1_test))

labels=['feature_1','feature_2']
predicted=pd.DataFrame(predicted, columns=labels)

Making sure if the shape of the predicted dataframe and the df1_test dataframe is same.

In [1]:
print(predicted.shape)
df1_test.shape

# Measuring the Performance

Root Mean Squared Error(RMSE) for *feature_1* and *feature_2*

In [1]:
for i in labels:
    print('rmse for '+i+' is : '+str(math.sqrt(mean_squared_error(predicted[i],df1_test[i]))))

Plotting the Real and Predicted data

In [1]:
plt.plot(predicted['feature_2'])
plt.plot(df1_test['feature_2'])
plt.show()

So, as we can see that the RMSE for feature_1 is **8.057180950988691** and the RMSE for feature_2 is **242.59720453616242** so we can conclude that our forecast is quite accurate for the train data, so let's proceed to the final predictions of *feature_2* for Test data

# Final Prediction and Submission

Loading the test.csv

In [1]:
df_test=pd.read_csv('../input/into-the-future/test.csv')
print(df_test.head())
df_test.shape

Performing the same Data Processing steps as performed in the Train dataset.

In [1]:
data_test=df_test[['time','feature_1']]
data_test.set_index('time', inplace=True)
data_test.head()

Forecasting the final test data.

In [1]:
final_prediction = results.forecast(results.y, steps=len(data_test)+len(df1_test))
final_prediction.shape

Converting to DataFrame for further computations.

In [1]:
final_prediction1 = pd.DataFrame(final_prediction,columns=['feature_1','feature_2'],index=range(len(df1_test), len(df1_test)+len(final_prediction), 1))
print(final_prediction1.shape)

final_prediction2=final_prediction1[len(df1_test):]
final_prediction2.shape

Checking the RMSE for *feature_1* just to be sure for any Overfitting and plotting the Predicted values of *feature_1* alongwith the Actual values *feature_1* in for test.csv

In [1]:
print('rmse for '+i+' is : '+str(math.sqrt(mean_squared_error(final_prediction2['feature_1'],data_test['feature_1']))))

plt.plot(final_prediction2['feature_1'])
plt.plot(data_test['feature_1'])
plt.show()

As we can see that the RMSE for *feature_1* is **33.40936553566021** and the plot also shows that the forecast is quite satisfactory, so we can conclude that even for *feature_2* the RMSE would be nearly accurate, as per the trend.

So, moving ahead with the submission.

In [1]:
final_prediction2['id'] = index=range(564, 564+len(final_prediction2), 1)
final_prediction2.set_index('id',inplace=True)
final_sol =final_prediction2.drop(['feature_1'],1)

final_sol

In [1]:
final_sol.to_csv('Final_Solution.csv')

# Conclusion

I had tested with different types models and different parameters, and this was the model with the best predictions, having comparatively better RMSE score.

Thank you for this opportunity, solving this was a great learning experience for me.
<br></br>I'll be waiting for the feedback.
<br></br>Thank You.