# Bullet Train

This time we are helping out SOV Investors with your data hacking skills. They are considering making an investment in a new form of transportation - BulletTrain. BulletTrain uses Jet propulsion technology to run rails and move people at a high speed! While BulletTrain has mastered the technology and they hold the patent for their product, the investment would only make sense, if they can get more than 1 Million monthly users with in next 18 months.
 
You need to help SOV ventures with the decision. They usually invest in B2C start-ups less than 4 years old looking for pre-series A funding. In order to help SOV Ventures in their decision, you need to forecast the traffic on BulletTrain for the next 7 months. You are provided with traffic data of BulletTrain since inception in the test file.

# Hypothesis Generation 
The first step to start, i.e. Hypothesis Generation. Hypothesis Generation is the process of listing out all the possible factors that can affect the outcome.<br>
Hypothesis generation is done before having a look at the data in order to avoid any bias that may result after the observation.<br>
1) Hypothesis Generation <br>
Hypothesis generation helps us to point out the factors which might affect our dependent variable. Below are some of the hypotheses which I think can affect the passenger count(dependent variable for this time series problem) on the BulletTrain:<br>
1.	There will be an increase in the traffic as the years pass by.<br>
•	Explanation - Population has a general upward trend with time, so I can expect more people to travel by BulletTrain. Also, generally companies expand their businesses over time leading to more customers travelling through BulletTrain.<br>
2.	The traffic will be high from May to October.<br>
•	Explanation - Tourist visits generally increases during this time period.<br>
3.	Traffic on weekdays will be more as compared to weekends/holidays.<br>
•	Explanation - People will go to office on weekdays and hence the traffic will be more. <br>
4.	Traffic during the peak hours will be high.<br>
•	Explanation - People will travel to work, college.<br>
We will try to validate each of these hypothesis based on the dataset. Now let’s have a look at the dataset.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats, integrate
from sklearn.model_selection import train_test_split
from sklearn import metrics
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt 
%matplotlib inline
pd.options.display.float_format = '{:.2f}'.format
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats, integrate
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
pd.options.display.float_format = '{:.2f}'.format
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 14

In [None]:
# Importing the train and test datasets


train = pd.read_csv("../input/Train.csv")
test = pd.read_csv("../input/Test.csv")

In [None]:
# Making copy of dataset

train_original=train.copy() 
test_original=test.copy()

In [None]:
train_original.shape, test_original.shape

In [None]:
print(train_original.head())
print (test_original.head())

In [None]:
train.info(), test.info()

•	ID and Count are in integer format while the Datetime is in object format for the train file.<br>
•	ID is in integer and Datetime is in object format for test file.


# Feature Extraction

First extract the time and date from the Datetime. It is seen earlier that the data type of Datetime is object. So first of all, change the data type to datetime format otherwise we can not extract features from it.

In [None]:
import datetime 

train['Datetime'] = pd.to_datetime(train.Datetime,format='%d-%m-%Y %H:%M',infer_datetime_format=True) 
test['Datetime'] = pd.to_datetime(test.Datetime,format='%d-%m-%Y %H:%M', infer_datetime_format=True) 
test_original['Datetime'] = pd.to_datetime(test_original.Datetime,format='%d-%m-%Y %H:%M', infer_datetime_format=True) 
train_original['Datetime'] = pd.to_datetime(train_original.Datetime,format='%d %m %Y %H:%M',  infer_datetime_format=True)


In [None]:
train_original.head()

<b>To validate our hypothesis, extracting the year, month, day and hour from the Datetime. <br>
 Then made these hypothesis for the effect of hour, day, month and year on the passenger count. <b>

In [None]:
for i in (train, test, test_original, train_original):
    i['year']=i.Datetime.dt.year 
    i['month']=i.Datetime.dt.month 
    i['day']=i.Datetime.dt.day
    i['Hour']=i.Datetime.dt.hour 

The hypothesis is drawn for the traffic pattern on weekday and weekend. So, a weekend variable is generated to visualize the impact of weekend on traffic.<br>
• First extract the day of week from Datetime and then based on the values we will assign whether the day is a weekend or not.<br>
• Values of 5 and 6 represents that the days are weekend.

In [None]:
train['day of week']=train['Datetime'].dt.dayofweek 
temp = train['Datetime']

In [None]:
temp.head()

Assigning  1 if the day of week is a weekend and 0 if the day of week in not a weekend.

In [None]:
def applyer(row):
    if row.dayofweek == 5 or row.dayofweek == 6:
        return 1
    else:
        return 0 
temp2 = train['Datetime'].apply(applyer) 
train['weekend']=temp2

In [None]:
from pandas.plotting import register_matplotlib_converters

train.index = train['Datetime'] # indexing the Datetime to get the time period on the x-axis. 
df=train.drop('ID',1)           # drop ID variable to get only the Datetime on x-axis. 
ts = df['Count'] 
plt.figure(figsize=(16,8)) 
plt.plot(ts, label='Passenger Count') 
plt.title('Time Series') 
plt.xlabel("Time(year-month)") 
plt.ylabel("Passenger count") 
plt.legend(loc='best')

# Recalling the hypothesis that we made earlier:

Traffic will increase as the years pass by <br>
Traffic will be high from May to October <br>
Traffic on weekdays will be more <br>
Traffic during the peak hours will be high <br>

# Exploratory Data Analysis

Let us try to verify our hypothesis using the actual data.<br>

Our first hypothesis was traffic will increase as the years pass by. So let’s look at yearly passenger count. <br>

In [None]:
ts.head()

In [None]:
# dropping ID 

df.tail()

In [None]:
# different way of plotting passenger count for training dataset. 

train.Count.plot(figsize=(16, 8))
plt.title('Time Series') 
plt.xlabel("Time(year-month)") 
plt.ylabel("Passenger count") 

In [None]:
train.groupby('year')['Count'].mean().plot.bar(fontsize=14,figsize=(10,7),title='Yearly Passenger Count')

Exponential growth is noticed year by year which validate our first hypothesis

**The second hypothesis was increase in traffic from May to October. So, let’s see the relation between count and month.**

In [None]:
train.groupby('month')['Count'].mean().plot.bar(fontsize=14,figsize=(10,7), title='Monthly Passenger Count')

Here we see a decline in passenger count inlast three months which seems to be incorrect to our first hypothesis. So lets look at monthly mean of each year. 

In [None]:
temp=train.groupby(['year', 'month'])['Count'].mean() 
temp.plot(figsize=(15,5), title= 'Passenger Count(Year& Month)', fontsize=14)

It is visible that the month Oct, Nov and Dec having a very low mean value in year 2012 and the values for theses months are not present in year 2014.<br> 

Since there is an increasing trend in our time series, the mean value for rest of the months will be more because of their larger passenger counts in year 2014. Therefore, we will get smaller value for these 3 months.<br>

In the above bar plot we can see an increasing trend in monthly passenger count and the growth is approximately exponential.<br>



In [None]:
train.groupby('day')['Count'].mean().plot.bar(fontsize=14,figsize=(10,7),title='Daily_PassengerCount')

From daily passenger count, we are unable to gather much insight. So,its time to look for the mean of hourly passenger count, which will highlight the hypothesis that the traffic will be more during peak hours.

In [None]:
train.groupby('Hour')['Count'].mean().plot.bar(color='m', figsize=(10,7),fontsize=14,title='Hourly_PassengerCount')

It can be inferred that the peak traffic in the evening is at 7 PM. Then a decreasing trend is noticed till 5 AM.
After that the passenger count starts increasing again and peaks again between 11AM and 12 Noon.

**To validate our another hypothesis in which we assumed that the traffic will be more on weekdays.**

In [None]:
train.groupby('weekend')['Count'].mean().plot.bar(fontsize=14,figsize=(10,7),title='Weekend_PassengerCount')

From the above graph, we can inferred that the traffic is more during the weekdays as compared to weekend which validates the hypothesis the traffic will be more on weekdays. <br>
Now, for the Day of week passenger count, where 0 is monday and 6 is sunday.

In [None]:
train.groupby('day of week')['Count'].mean().plot.bar(fontsize=14,figsize=(10,7), title='Day of week_PassengerCount')

 # Basic modeling techniques. 
    
Drop the ID variable as it has nothing to do with the passenger count.

In [None]:
train=train.drop('ID',1)

A lot of noise in the hourly time series is noticed. So, aggregate the hourly time series to daily, weekly, and monthly time series to reduce the noise and make it more stable and hence would be easier for a model to learn.

In [None]:
train.Timestamp = pd.to_datetime(train.Datetime,format='%d-%m-%Y %H:%M') 
train.index = train.Timestamp 
# Hourly time series 
hourly = train.resample('H').mean() 
# Converting to daily mean 
daily = train.resample('D').mean() 
# Converting to weekly mean 
weekly = train.resample('W').mean() 
# Converting to monthly mean 
monthly = train.resample('M').mean()

In [None]:
fig, axs = plt.subplots(4,1) 
hourly.Count.plot(figsize=(15,8), title= 'Hourly', fontsize=14, ax=axs[0]) 
daily.Count.plot(figsize=(15,8), title= 'Daily', fontsize=14, ax=axs[1])
weekly.Count.plot(figsize=(15,8), title= 'Weekly', fontsize=14, ax=axs[2]) 
monthly.Count.plot(figsize=(15,8), title= 'Monthly', fontsize=14, ax=axs[3]) 

From the graph, it is visible that the time series is becoming more and more stable when we are aggregating it on daily, weekly and monthly basis.<br>

But it would be difficult to convert the monthly and weekly predictions to hourly predictions, as first we have to convert the monthly predictions to weekly, weekly to daily and daily to hourly predictions, which will become very expanded process. So, we will work on the daily time series.

In [None]:
train.shape, test.shape

In [None]:
test.Timestamp = pd.to_datetime(test.Datetime,format='%d-%m-%Y %H:%M') 
test.index = test.Timestamp  
# Converting to daily mean 
test = test.resample('D').mean() 

In [None]:

train.Timestamp = pd.to_datetime(train.Datetime,format='%d-%m-%Y %H:%M') 
train.index = train.Timestamp 
# Converting to daily mean 
train = train.resample('D').mean()

In [None]:
Train=train.loc['2012-08-25':'2014-06-24'] 
valid=train.loc['2014-06-25':'2014-09-25']

In [None]:
Train.Count.plot(figsize=(15,8), title= 'Daily Ridership', fontsize=14, label='train') 
valid.Count.plot(figsize=(15,8), title= 'Daily Ridership', fontsize=14, label='valid') 
plt.xlabel("Datetime") 
plt.ylabel("Passenger count") 
plt.legend(loc='best') 
plt.show()

Here, we are predicting the traffic for the validation part and then visualize how accurate our predictions are. Finally we will make predictions for the test dataset.

Various models consider to forecast the time series. Methods which we will be discussing for the forecasting are:
i) Naive Approach
ii) Moving Average
iii) Simple Exponential Smoothing
iv) Holt’s Linear Trend Model
Naive Approach
In this forecasting technique, we assume that the next expected point is equal to the last observed point. So we can expect a straight horizontal line as the prediction



# Naive Approach 

In this forecasting technique, we assume that the next expected point is equal to the last observed point. So we can expect a straight horizontal line as the prediction

In [None]:
dd= np.asarray(Train.Count) 
y_hat = valid.copy() 
y_hat['naive'] = dd[len(dd)-1] 
plt.figure(figsize=(12,8)) 
plt.plot(Train.index, Train['Count'], label='Train') 
plt.plot(valid.index,valid['Count'], label='Valid') 
plt.plot(y_hat.index,y_hat['naive'], label='Naive Forecast') 
plt.legend(loc='best') 
plt.title("Naive Forecast") 
plt.show()

As naive approach consider the next expected point is equal to the last observed point, which result in a straight horizontal line for the predicted value. This is what we can see in our above graph.<br>

To validate how accurate our predictions are by using rmse(Root Mean Square Error).<br>
rmse is the standard deviation of the residuals.<br>
Residuals are a measure of how far from the regression line data points are.<br>
The formula for rmse is <br>
rmse=sqrt∑i=1N1N(p−a)2

In [None]:
# calculating RMSE to check the accuracy of our model on validation data set.

from sklearn.metrics import mean_squared_error 
from math import sqrt 
rms = sqrt(mean_squared_error(valid.Count, y_hat.naive)) 
print(rms)

It is infer that this method is not suitable for datasets with high variability.But we can reduce the rmse value by adopting different techniques.<br>

# Moving Average

In this technique we will take the average of the passenger counts for last few time periods only.


In [None]:
# Considering rolling mean for last 10, 20, 50 days and visualize the results.

y_hat_avg = valid.copy() 
y_hat_avg['moving_avg_forecast'] = Train['Count'].rolling(10).mean().iloc[-1] # average of last 10 observations. 
plt.figure(figsize=(15,5)) 
plt.plot(Train['Count'], label='Train') 
plt.plot(valid['Count'], label='Valid') 
plt.plot(y_hat_avg['moving_avg_forecast'], label='Moving Average Forecast using 10 observations') 
plt.legend(loc='best') 
plt.show() 
y_hat_avg = valid.copy() 
y_hat_avg['moving_avg_forecast'] = Train['Count'].rolling(20).mean().iloc[-1] # average of last 20 observations. 
plt.figure(figsize=(15,5)) 
plt.plot(Train['Count'], label='Train') 
plt.plot(valid['Count'], label='Valid') 
plt.plot(y_hat_avg['moving_avg_forecast'], label='Moving Average Forecast using 20 observations') 
plt.legend(loc='best') 
plt.show() 
y_hat_avg = valid.copy() 
y_hat_avg['moving_avg_forecast'] = Train['Count'].rolling(50).mean().iloc[-1] # average of last 50 observations. 
plt.figure(figsize=(15,5)) 
plt.plot(Train['Count'], label='Train') 
plt.plot(valid['Count'], label='Valid') 
plt.plot(y_hat_avg['moving_avg_forecast'], label='Moving Average Forecast using 50 observations') 
plt.legend(loc='best') 
plt.show()

**It is visible that the predictions are getting weaker as the number of observations for rolling mean increase.**

In [None]:
# RMSE value for Moving Average 

rms = sqrt(mean_squared_error(valid.Count, y_hat_avg.moving_avg_forecast)) 
print(rms)


# Simple Exponential Smoothing

In this technique, we assign larger weights to more recent observations than to observations from the distant past.<br>
The weights decrease exponentially as observations come from further in the past, the smallest weights are associated with the oldest observations.<br>

NOTE - If we give the entire weight to the last observed value only, this method will be similar to the naive approach. So, we can say that naive approach is also a simple exponential smoothing technique where the entire weight is given to the last observed value.

In [None]:
#Here the predictions are made by assigning larger weight to the recent values and lesser weight to the old values.

from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt 

y_hat_avg = valid.copy() 
fit2 = SimpleExpSmoothing(np.asarray(Train['Count'])).fit(smoothing_level=0.6,optimized=False) 
y_hat_avg['SES'] = fit2.forecast(len(valid)) 
plt.figure(figsize=(16,8)) 
plt.plot(Train['Count'], label='Train') 
plt.plot(valid['Count'], label='Valid') 
plt.plot(y_hat_avg['SES'], label='SES') 
plt.legend(loc='best') 
plt.show()

In [None]:
rms = sqrt(mean_squared_error(valid.Count, y_hat_avg.SES)) 
print(rms)

We can infer that the fit of the model has improved as the rmse value has reduced.


# Holt’s Linear Trend Model

- It is an extension of simple exponential smoothing to allow forecasting of data with a trend.<br>
- This method takes into account the trend of the dataset. The forecast function in this method is a function of level and trend.<br>

First, lets visualize the trend, seasonality and error in the series and then decompose the time series in four parts.<br>

- Observed, which is the original time series.<br>
- Trend, which shows the trend in the time series, i.e., increasing or decreasing behaviour of the time series.<br>
- Seasonal, which tells us about the seasonality in the time series.<br>
- Residual, which is obtained by removing any trend or seasonality in the time series.<br>

In [None]:
import statsmodels.api as sm 
sm.tsa.seasonal_decompose(Train.Count).plot() 
result = sm.tsa.stattools.adfuller(train.Count) 
plt.show()

<b>An increasing trend can be seen in the dataset, so now we will make a model based on the trend. </b>

In [None]:
y_hat_avg = valid.copy() 
fit1 = Holt(np.asarray(Train['Count'])).fit(smoothing_level = 0.3,smoothing_slope = 0.1) 
y_hat_avg['Holt_linear'] = fit1.forecast(len(valid)) 
plt.figure(figsize=(16,8)) 
plt.plot(Train['Count'], label='Train') 
plt.plot(valid['Count'], label='Valid') 
plt.plot(y_hat_avg['Holt_linear'], label='Holt_linear') 
plt.legend(loc='best') 
plt.show()

The inclined line here seen as the model has taken into consideration the trend of the time series.

In [None]:
# Calculating the RMSE of the model

rms = sqrt(mean_squared_error(valid.Count, y_hat_avg.Holt_linear)) 
print(rms)

In [None]:
y_hat_avg.Holt_linear.head()


In [None]:
valid.Count.shape, y_hat_avg.Holt_linear.shape

The rmse value has decreased further with Holt linear Trend Model.

 **Now predicting the passenger count for the test dataset using various models.**

# Holt’s Linear Trend Model on daily time series - Test Dataset

- Now with holt’s linear trend model on the daily time series and making predictions on the test dataset.
- We will make predictions based on the daily time series and then will distribute that daily prediction to hourly predictions.
- We have fitted the holt’s linear trend model on the train dataset and validated it using validation dataset.

In [None]:
# loading the submission file.


submission=pd.read_csv(your_local_path+"submission.csv")

Now only ID and corresponding Count needed for the final submission.

In [None]:
# Making prediction for the test dataset.

predict=fit1.forecast(len(test))

In [None]:
# saving these predictions in test file in a new column.

test['prediction']=predict

In [None]:
test.head()

Point to remember, this is a daily predictions.<br>

We have to convert these predictions to hourly basis. 
* To do so we will first calculate the ratio of passenger count for each hour of every day. 
* Then we will find the average ratio of passenger count for every hour and we will get 24 ratios. 
* Then to calculate the hourly predictions we will multiply the daily prediction with the hourly ratio.

In [None]:
# Calculating the hourly ratio of count 

train_original ['ratio']=train_original['Count']/train_original['Count'].sum()
# ratio = Count column individual value / sum of Count column values

# Grouping the hourly ratio 
temp=train_original.groupby(['Hour'])['ratio'].sum() 

# Groupby to csv format 
pd.DataFrame(temp, columns=['Hour','ratio']).to_csv('GROUPby.csv') 

temp2=pd.read_csv("GROUPby.csv") 
temp2=temp2.drop('Hour.1',1) 

# Merge Test_df and test_original on day, month and year 
merge=pd.merge(test, test_original, on=('day','month', 'year'), how='left') 
merge['Hour']=merge['Hour_y'] 
merge=merge.drop(['year', 'month', 'Datetime','Hour_x','Hour_y'], axis=1) 
# Predicting by merging merge and temp2 
prediction=pd.merge(merge, temp2, on='Hour', how='left') 

# Converting the ratio to the original scale 
prediction['Count']=prediction['prediction']*prediction['ratio']*24 
prediction['ID']=prediction['ID_y']



In [None]:
temp.head()

In [None]:
temp2.head()

In [None]:
# Dropping all other features from the submission file and keep ID and Count only.

submission=prediction.drop(['ID_x', 'day', 'ID_y','prediction','Hour', 'ratio'],axis=1) 

In [None]:
# Converting the final submission to csv format 
pd.DataFrame(submission, columns=['ID','Count']).to_csv('Holt linear.csv')

In [None]:
submission.head(), submission.shape

In [None]:
test.head(), test.shape


In [None]:
y_hat_avg.Holt_linear.shape, test.prediction.shape

In [None]:
prediction.head(), prediction.shape

# Holt winter’s model on daily time series

Datasets which show a similar set of pattern after fixed intervals of a time period suffer from seasonality.

The above models don’t take into account the seasonality of the dataset while forecasting. Hence we need a method that takes into account both trend and seasonality to forecast future prices.

One such algorithm that we can use in such a scenario is Holt’s Winter method. The idea behind Holt’s Winter is to apply exponential smoothing to the seasonal components in addition to level and trend.

Let’s first fit the model on training dataset and validate it using the validation dataset.


In [None]:

y_hat_avg = valid.copy() 
fit1 = ExponentialSmoothing(np.asarray(Train['Count']) ,seasonal_periods=7 ,trend='add', seasonal='add',).fit() 
y_hat_avg['Holt_Winter'] = fit1.forecast(len(valid)) 
plt.figure(figsize=(16,8)) 
plt.plot( Train['Count'], label='Train') 
plt.plot(valid['Count'], label='Valid') 
plt.plot(y_hat_avg['Holt_Winter'], label='Holt_Winter') 
plt.legend(loc='best') 
plt.show()


In [None]:
rms = sqrt(mean_squared_error(valid.Count, y_hat_avg.Holt_Winter)) 
print(rms)

We can see that the rmse value has reduced a lot from this method. Let’s forecast the Counts for the entire length of the Test dataset.


In [None]:

predict=fit1.forecast(len(test))


Now we will convert these daily passenger count into hourly passenger count using the same approach which we followed above.


In [None]:

test['prediction']=predict
# Merge Test and test_original on day, month and year 
merge=pd.merge(test, test_original, on=('day','month', 'year'), how='left') 
merge['Hour']=merge['Hour_y'] 
merge=merge.drop(['year', 'month', 'Datetime','Hour_x','Hour_y'], axis=1) 

# Predicting by merging merge and temp2 
prediction=pd.merge(merge, temp2, on='Hour', how='left') 

# Converting the ratio to the original scale
prediction['Count']=prediction['prediction']*prediction['ratio']*24


Let’s drop all features other than ID and Count

In [None]:
prediction['ID']=prediction['ID_y'] 
submission=prediction.drop(['day','Hour','ratio','prediction', 'ID_x', 'ID_y'],axis=1) 

# Converting the final submission to csv format 
pd.DataFrame(submission, columns=['ID','Count']).to_csv('Holt winters.csv')

So far we have made different models for trend and seasonality. Let's go for a model which will consider both the trend and seasonality of the time series?

Let's consider the ARIMA model for time series forecasting.

# Introduction to ARIMA model

ARIMA stands for Auto Regression Integrated Moving Average. It is specified by three ordered parameters (p,d,q).<br>

Here p is the order of the autoregressive model(number of time lags)<br>
d is the degree of differencing(number of times the data have had past values subtracted)<br>
q is the order of moving average model. We will discuss more about these parameters in next section.<br>

The ARIMA forecasting for a stationary time series is nothing but a linear (like a linear regression) equation.<br>

# What is a stationary time series?<br>
There are three basic criterion for a series to be classified as stationary series :<br>

The mean of the time series should not be a function of time. It should be constant.<br>
The variance of the time series should not be a function of time.<br>
THe covariance of the ith term and the (i+m)th term should not be a function of time.<br>

# Why do we have to make the time series stationary?<br>
We make the series stationary to make the variables independent. Variables can be dependent in various ways, but can only be independent in one way. So, we will get more information when they are independent. Hence the time series must be stationary.<br>

If the time series is not stationary, firstly we have to make it stationary. For doing so, we need to remove the trend and seasonality from the data. <br>

# Parameter tuning for ARIMA model
First of all we have to make sure that the time series is stationary. If the series is not stationary, we will make it stationary.<br>

# Stationarity Check

We use Dickey Fuller test to check the stationarity of the series.<br>
The intuition behind this test is that it determines how strongly a time series is defined by a trend.<br>
The null hypothesis of the test is that time series is not stationary (has some time-dependent structure).<br>
The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary.<br>

The test results comprise of a Test Statistic and some Critical Values for difference confidence levels. If the ‘Test Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.<br>

We interpret this result using the Test Statistics and critical value. If the Test Statistics is smaller than critical value, it suggests we reject the null hypothesis (stationary), otherwise a greater Test Statistics suggests we accept the null hypothesis (non-stationary).<br>

Let’s make a function which we can use to calculate the results of Dickey-Fuller test.

In [None]:
from statsmodels.tsa.stattools import adfuller 
def test_stationarity(timeseries):
        #Determing rolling statistics
    rolmean = timeseries.rolling(window=24).mean()
    rolstd = timeseries.rolling(window=24).std()
        #Plot rolling statistics:
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
        #Perform Dickey-Fuller test:
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

from matplotlib.pylab import rcParams 
rcParams['figure.figsize'] = 20,10
test_stationarity(train_original['Count'])

The statistics shows that the time series is stationary as Test Statistic < Critical value but we can see an increasing trend in the data. So, firstly we will try to make the data more stationary. For doing so, we need to remove the trend and seasonality from the data.

# Removing Trend
A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear.<br>

We see an increasing trend in the data so we can apply transformation which penalizes higher values more than smaller ones, for example log transformation.<br>

We will take rolling average here to remove the trend. We will take the window size of 24 based on the fact that each day has 24 hours.<br>

In [None]:
Train_log = np.log(Train['Count']) 
valid_log = np.log(valid['Count'])
moving_avg = Train_log.rolling(24).mean()
plt.plot(Train_log) 
plt.plot(moving_avg, color = 'red') 
plt.show()


An increasing trend is observed. To make the time series stationary, this increasing trend need to be remove.


In [None]:
train_log_moving_avg_diff = Train_log - moving_avg

Since we took the average of 24 values, rolling mean is not defined for the first 23 values. So let’s drop those null values.

In [None]:
train_log_moving_avg_diff.dropna(inplace = True) 
test_stationarity(train_log_moving_avg_diff)

As we can see that the Test Statistic is very smaller as compared to the Critical Value. So, we can be confident that the trend is almost removed.<br>

Let’s now stabilize the mean of the time series which is also a requirement for a stationary time series.<br>

Differencing can help to make the series stable and eliminate the trend.<br>

In [None]:
train_log_diff = Train_log - Train_log.shift(1) 
test_stationarity(train_log_diff.dropna())

Now we will decompose the time series into trend and seasonality and will get the residual which is the random variation in the series.

# Removing Seasonality
By seasonality, we mean periodic fluctuations. A seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week).<br>
Seasonality is always of a fixed and known period.<br>
We will use seasonal decompose to decompose the time series into trend, seasonality and residuals.

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose 
decomposition = seasonal_decompose(pd.DataFrame(Train_log).Count.values, freq = 24) 

trend = decomposition.trend 
seasonal = decomposition.seasonal 
residual = decomposition.resid 

plt.subplot(411) 
plt.plot(Train_log, label='Original') 
plt.legend(loc='best') 
plt.subplot(412) 
plt.plot(trend, label='Trend') 
plt.legend(loc='best') 
plt.subplot(413) 
plt.plot(seasonal,label='Seasonality') 
plt.legend(loc='best') 
plt.subplot(414) 
plt.plot(residual, label='Residuals') 
plt.legend(loc='best') 
plt.tight_layout() 
plt.show()


We can see the trend, residuals and the seasonality clearly in the above graph. Seasonality shows a constant trend in counter.

Let’s check stationarity of residuals.



In [None]:
train_log_decompose = pd.DataFrame(residual) 
train_log_decompose['date'] = Train_log.index 
train_log_decompose.set_index('date', inplace = True) 
train_log_decompose.dropna(inplace=True) 
test_stationarity(train_log_decompose[0])

It can be interpreted from the results that the residuals are stationary.

Now we will forecast the time series using different models.

Forecasting the time series using ARIMA
First of all we will fit the ARIMA model on our time series for that we have to find the optimized values for the p,d,q parameters.

To find the optimized values of these parameters, we will use ACF(Autocorrelation Function) and PACF(Partial Autocorrelation Function) graph.

ACF is a measure of the correlation between the TimeSeries with a lagged version of itself.

PACF measures the correlation between the TimeSeries with a lagged version of itself but after eliminating the variations already explained by the intervening comparisons.

In [None]:
from statsmodels.tsa.stattools import acf, pacf 
lag_acf = acf(train_log_diff.dropna(), nlags=25) 
lag_pacf = pacf(train_log_diff.dropna(), nlags=25, method='ols')

# ACF and PACF plot


In [None]:
plt.plot(lag_acf) 
plt.axhline(y=0,linestyle='--',color='gray') 
plt.axhline(y=-1.96/np.sqrt(len(train_log_diff.dropna())),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(train_log_diff.dropna())),linestyle='--',color='gray') 
plt.title('Autocorrelation Function') 
plt.show() 
plt.plot(lag_pacf) 
plt.axhline(y=0,linestyle='--',color='gray') 
plt.axhline(y=-1.96/np.sqrt(len(train_log_diff.dropna())),linestyle='--',color='gray') 
plt.axhline(y=1.96/np.sqrt(len(train_log_diff.dropna())),linestyle='--',color='gray') 
plt.title('Partial Autocorrelation Function') 
plt.show()

p value is the lag value where the PACF chart crosses the upper confidence interval for the first time. It can be noticed that in this case p=1.

q value is the lag value where the ACF chart crosses the upper confidence interval for the first time. It can be noticed that in this case q=1.

Now we will make the ARIMA model as we have the p,q values. We will make the AR and MA model separately and then combine them together.

# AR model
The autoregressive model specifies that the output variable depends linearly on its own previous values.


A nonseasonal ARIMA model is classified as an "ARIMA(p,d,q)" model, where:

p is the number of autoregressive terms,<br>
d is the number of nonseasonal differences needed for stationarity, and<br>
q is the number of lagged forecast errors in the prediction equation.

In [None]:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(Train_log, order=(2, 1, 0))  # here the q value is zero since it is just the AR model 
results_AR = model.fit(disp=-1)  
plt.plot(train_log_diff.dropna(), label='original') 
plt.plot(results_AR.fittedvalues, color='red', label='predictions') 
plt.legend(loc='best') 
plt.show()

Lets plot the validation curve for AR model.

We have to change the scale of the model to the original scale.

First step would be to store the predicted results as a separate series and observe it.



In [None]:
AR_predict=results_AR.predict(start="2014-06-25", end="2014-09-25") 
AR_predict=AR_predict.cumsum().shift().fillna(0) 
AR_predict1=pd.Series(np.ones(valid.shape[0]) * np.log(valid['Count'])[0], index = valid.index) 
AR_predict1=AR_predict1.add(AR_predict,fill_value=0) 
AR_predict = np.exp(AR_predict1)
plt.plot(valid['Count'], label = "Valid") 
plt.plot(AR_predict, color = 'red', label = "Predict") 
plt.legend(loc= 'best') 
plt.title('RMSE: %.4f'% (np.sqrt(np.dot(AR_predict, valid['Count']))/valid.shape[0])) 
plt.show()

Here the red line shows the prediction for the validation set. Let’s build the MA model now.

# MA model
The moving-average model specifies that the output variable depends linearly on the current and various past values of a stochastic (imperfectly predictable) term.


In [None]:
model = ARIMA(Train_log, order=(0, 1, 2))  # here the p value is zero since it is just the MA model 
results_MA = model.fit(disp=-1)  
plt.plot(train_log_diff.dropna(), label='original') 
plt.plot(results_MA.fittedvalues, color='red', label='prediction') 
plt.legend(loc='best') 
plt.show()


In [None]:
MA_predict=results_MA.predict(start="2014-06-25", end="2014-09-25") 
MA_predict=MA_predict.cumsum().shift().fillna(0) 
MA_predict1=pd.Series(np.ones(valid.shape[0]) * np.log(valid['Count'])[0], index = valid.index) 
MA_predict1=MA_predict1.add(MA_predict,fill_value=0) 
MA_predict = np.exp(MA_predict1)
plt.plot(valid['Count'], label = "Valid") 
plt.plot(MA_predict, color = 'red', label = "Predict") 
plt.legend(loc= 'best') 
plt.title('RMSE: %.4f'% (np.sqrt(np.dot(MA_predict, valid['Count']))/valid.shape[0])) 
plt.show()


Now let’s combine these two models.


# Combined model



In [None]:
model = ARIMA(Train_log, order=(2, 1, 2))  
results_ARIMA = model.fit(disp=-1)  
plt.plot(train_log_diff.dropna(),  label='original') 
plt.plot(results_ARIMA.fittedvalues, color='red', label='predicted') 
plt.legend(loc='best') 
plt.show()

Let’s define a function which can be used to change the scale of the model to the original scale.


In [None]:
def check_prediction_diff(predict_diff, given_set):
    predict_diff= predict_diff.cumsum().shift().fillna(0)
    predict_base = pd.Series(np.ones(given_set.shape[0]) * np.log(given_set['Count'])[0], index = given_set.index)
    predict_log = predict_base.add(predict_diff,fill_value=0)
    predict = np.exp(predict_log)

    plt.plot(given_set['Count'], label = "Given set")
    plt.plot(predict, color = 'red', label = "Predict")
    plt.legend(loc= 'best')
    plt.title('RMSE: %.4f'% (np.sqrt(np.dot(predict, given_set['Count']))/given_set.shape[0]))
    plt.show()
    
def check_prediction_log(predict_log, given_set):
    predict = np.exp(predict_log)
 
    plt.plot(given_set['Count'], label = "Given set")
    plt.plot(predict, color = 'red', label = "Predict")
    plt.legend(loc= 'best')
    plt.title('RMSE: %.4f'% (np.sqrt(np.dot(predict, given_set['Count']))/given_set.shape[0]))
    plt.show()

Let’s predict the values for validation set.



In [None]:
ARIMA_predict_diff=results_ARIMA.predict(start="2014-06-25", end="2014-09-25")
check_prediction_diff(ARIMA_predict_diff, valid)

# SARIMAX model on daily time series

SARIMAX model takes into account the seasonality of the time series. So we will build a SARIMAX model on the time series.

In [None]:
import statsmodels.api as sm
y_hat_avg = valid.copy() 
fit1 = sm.tsa.statespace.SARIMAX(Train.Count, order=(2, 1, 4),seasonal_order=(0,1,1,7)).fit() 
y_hat_avg['SARIMA'] = fit1.predict(start="2014-6-25", end="2014-9-25", dynamic=True) 
plt.figure(figsize=(16,8)) 
plt.plot( Train['Count'], label='Train') 
plt.plot(valid['Count'], label='Valid') 
plt.plot(y_hat_avg['SARIMA'], label='SARIMA') 
plt.legend(loc='best') 
plt.show()

Order in the above model represents the order of the autoregressive model(number of time lags), the degree of differencing(number of times the data have had past values subtracted) and the order of moving average model.

Seasonal order represents the order of the seasonal component of the model for the AR parameters, differences, MA parameters, and periodicity.

In our case the periodicity is 7 since it is daily time series and will repeat after every 7 days.

Let’s check the rmse value for the validation part.

In [None]:
rms = sqrt(mean_squared_error(valid.Count, y_hat_avg.SARIMA)) 
print(rms)

Now we will forecast the time series for Test data which starts from 2014-9-26 and ends at 2015-4-26.

Note that these are the daily predictions and we need hourly predictions. So, we will distribute this daily prediction into hourly counts. To do so, we will take the ratio of hourly distribution of passenger count from train data and then we will distribute the predictions in the same ratio.



In [None]:
test['prediction']=predict
# Merge Test and test_original on day, month and year 
merge=pd.merge(test, test_original, on=('day','month', 'year'), how='left') 
merge['Hour']=merge['Hour_y'] 
merge=merge.drop(['year', 'month', 'Datetime','Hour_x','Hour_y'], axis=1) 

# Predicting by merging merge and temp2 
prediction=pd.merge(merge, temp2, on='Hour', how='left') 

# Converting the ratio to the original scale 
prediction['Count']=prediction['prediction']*prediction['ratio']*24

#Let’s drop all variables other than ID and Count

prediction['ID']=prediction['ID_y'] 
submission=prediction.drop(['day','Hour','ratio','prediction', 'ID_x', 'ID_y'],axis=1) 

# Converting the final submission to csv format 
pd.DataFrame(submission, columns=['ID','Count']).to_csv('SARIMAX.csv')

In [None]:
submission.head()

END