In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
from statsmodels.tsa.stattools import adfuller
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# Reading the data
df = pd.read_csv("../input/portland-oregon-average-monthly-.csv")

In [None]:
# A glance on the data 
df.head()

In [None]:
df.columns = ["month", "average_monthly_ridership"]
df.head()

In [None]:
# getting some information about dataset
df.info()

In [None]:
df.dtypes

We need to convert the datatypes of month to index and ridership to integer.

In [None]:
df.tail()

In [None]:
df = df.iloc[:-1,:]   #removing last row

Changing data type of both the column
* Assign int to ```monthly_ridership_data``` column
* Assign datetime to ```month``` column

In [None]:
df['average_monthly_ridership'] = pd.to_numeric(df['average_monthly_ridership'])

In [None]:
df['month'] = pd.to_datetime(df['month'], format = '%Y-%m')

In [None]:
df.dtypes

# Time Series Analysis

As you all know how important data analysis is for data scientists.It gives us a brief understanding of the data and a very strange but intriguing confidence about our prediction model.Well, Time series analysis is no different.But time series problems have very special orientation when it comes to analysis.But before we move into that, let me introduce you to some jargons (Just Kidding it is pure and simple english) which are frequently used in this problem domain.

**Trend**:- As the name suggests trend depicts the variation in the output as time increases.It is often non-linear. Sometimes we will refer to trend as “changing direction” when it might go from an increasing trend to a decreasing trend.

**Level**:- It basically depicts baseline value for the time series.

**Seasonal**:- As its name depicts it shows the repeated pattern over time. In layman terms, it shows the seasonal variation of data over time.

**Noise**:- It is basically external noises that vary the data randomly.

In [None]:
# Normal line plot so that we can see data variation
# We can observe that average number of riders is increasing most of the time
# We'll later see decomposed analysis of that curve
df.plot.line(x = 'month', y = 'average_monthly_ridership')
plt.show()

In the above plot we can see that the graph is going upwards therfore there is upward trend.
Also, there is repeating pattern although it is not 100% consistent from start to end but still it hints us that there is some seasonal behavior. The time period for that seems to be 1 year. Now there are different ways to verify this, I personally prefer visualizing ACF and PACF plots which also show patterns at seasonal lags. Also, if you see that durin year 1967, the variance in series is higher than in earlier years(not strictly), therefore one most commonly used method to remove variance is by doing log transformation.

To model a time series, it has to be stationary i.e. we have to remove variance, trend and seasonality. It night not be possible to remove it 100% but we will perform Dickey-fuller test under claim stationarity at 5% or 1% rejection region.

In [None]:
rider = df[['average_monthly_ridership']]

# Log Transformation

In [None]:
log_ridership = np.log(df[['average_monthly_ridership']])

In [None]:
log_ridership.plot.line()

There is not much change but it is safe to perform log transformation to make it more stationary.

## Trend Removal

Now we see an upward trend, so we will use most common method of differenceing(order 1) i.e. with previous term to remove trend.

In [None]:
# 1st order differencing
rider_single_diff = (log_ridership.diff()).dropna()  # 1st term will be NAN

#NOTE: diff(diff(X)) is 2nd order differencing 
rider_double_diff = (rider_single_diff.diff()).dropna()  

#seasonal differencing of order 1
rider_single_seasonal_diff = (rider_single_diff.diff(periods=12)).dropna()  # 1st term will be NAN

rider_single_diff.plot.line()
plt.title('1st Order Diff')
plt.show()

rider_double_diff.plot.line()
plt.title('2nd Order Diff')
plt.show()

rider_single_seasonal_diff.plot.line()
plt.title('1st Order Seasonal Diff')
plt.show()

Observations from above 3 graphs:

Graph A: 1st order differencing
We can see that it has removed upward trend. However, the series is still not stationary as it is showing seasonla behavior. Therefore, we will have to remove it.

Graph B: 2nd order differencing
There is not much difference w.r.t 1st order. So we will keep our model relatively simple and go aheaf with 1st order diff i.e d=1

Graph C: Taking seasonal difference of 1st order diff series has removed both upward trend and seasonality. Therefore, d=1 and D=1 and m=12(seasonal order). Note even if you don't take seasonal difference, the series might pass stationarity test but by this method  would yield better results.

Now we have to check if it passes the Dickey-Fuller test. I am hopeful it will!

In [None]:
#Perform Dickey–Fuller test:
print('Results of Dickey Fuller Test:')
dftest = adfuller(rider_single_seasonal_diff.average_monthly_ridership, autolag='AIC') #Note: the input should not be a dataframe but a panda series
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

Great! We can see that p-value is very small. Smaller than inverse of speed of light!

Therefore, I am happy to declare that our series is now stationary. So we proceed to model it.

Now there are different packages available in python for that. Following are two most commonly used ones:

1) ARIMA(): In this you need to pass a series and order p, d, q. It will fit and generate forecast.
To generate forecasts, here we first explicitly need to create dates on which to make future predictions. Also, It doesn't have in-bult option to take seasonal difference therefore you have to first take that difference and then pass it and after generating predictions, reverse engineer the difference part which is a little lengthy process. 

2) SARIMAX(): In this you need to pass series and order (p,d,q)x(P,D,Q)x(m). It will fit and generate forecasts. Here we have flexibility to simply define the number of period over which future predictions are to be made w/o manually creating dates at initial stages.

NOTE: In both, since they don't have in-built option to do log transformation, we have to pass log transformed series and later reverse engineer the predictions to get meaningful values.

## [Periodicity and Autocorrelation](https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/)

Auto correlation is the most famous way to understand seasonal variation till now. We can calculate the correlation for time series observations with observations with previous time steps, called lags. Because the correlation of the time series observations is calculated with values of the same series at previous times, this is called a serial correlation, or an autocorrelation.In this plot vertical axis is represented by the following equations:-

> $C_n = \sum_{t = 1}^{n - h} (y(t) - \hat{y}) (y(t + n) - \hat{y}) / n$

> $C_0 = \sum_{t = 1}^{n} (y(t) - \hat{y})^2 / n$

Horizontal axis represents time lag(previous time steps)  h.

Note that the ACF and PACF plots help us to find (p,q)X(P,Q). Therefore we will use log transformed series with d=1, D=1. However, if you want to confirm the seasonality order then you can skip doing seasonal difference and then you will see seasonal patterns in those plots.

In [None]:
#ACF and PACF plots:
import statsmodels.api as sm
sm.graphics.tsa.plot_acf(rider_single_seasonal_diff.values.squeeze(), lags=40)
plt.title('ACF')
plt.show()

sm.graphics.tsa.plot_pacf(rider_single_seasonal_diff.values.squeeze(), lags=40)
plt.title('PACF')
plt.show()

Observations from above graphs:

1) ACF: We can see that most of the correlations are just noises(therefore, q=0). There is a significant spike at lag 11,12 which indicates Q=1.

2) PACF: It has same story here as well, p=0 and P=1. Note that correlation at around 24,36 are also significant therefore we can try P=2 as well. We will not try P=3 as it would make model very complicated and lead to overfitting.

# Forecasting

In [None]:
#for our model we need dates as indexes
df = df.set_index('month')

#doing log transformation on data
df['average_monthly_ridership'] = np.log(df[['average_monthly_ridership']])

In [None]:
# Applying Seasonal ARIMA model to forcast the data 
mod = sm.tsa.SARIMAX(df['average_monthly_ridership'], trend='n', order=(0,1,0), seasonal_order=(1,1,1,12))
results = mod.fit()
print(results.summary())

I tried different model with P=1,2. Following is the brief comparison:

1) AIC: Almost same

2) Prob(Q): Model_1 > Model_2 (which indicates Model_1 has lesser correlation in data, which is good)

3) P>|Z| :  Lower is the P-value better it is. It tells us about coefficients of AR, MA terms. P-value is lower when P=1

The Ljung–Box test may be defined as:

H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process).

Ha: The data are not independently distributed; they exhibit serial correlation.

In our case, p-value is significant therefore we reject the null hypothesis and say that there are no correlations.

## To check your model

In [None]:
df['forecast'] = results.predict(start = 102, end= 120, dynamic= True)  
df[['average_monthly_ridership', 'forecast']].plot(figsize=(12, 8))
plt.show()

Note that the values are still log transformed.

## To generate future forecasts

In [None]:
def forcasting_future_months(df, no_of_months):
    df_predict = df.reset_index()
    mon = df_predict['month']
    mon = mon + pd.DateOffset(months = no_of_months)
    future_dates = mon[-no_of_months -1:]
    df_predict = df_predict.set_index('month')
    future = pd.DataFrame(index=future_dates, columns= df_predict.columns)
    df_predict = pd.concat([df_predict, future])
    df_predict['forecast'] = results.predict(start = 114, end = 125, dynamic= True)  
    df_predict[['average_monthly_ridership', 'forecast']].iloc[-no_of_months - 12:].plot(figsize=(12, 8))
    plt.show()
    return df_predict[-no_of_months:]

In [None]:
predicted = forcasting_future_months(df,10)

Converting values back to normal by taking exponential

In [None]:
df = df.apply(np.exp)
forecast = predicted.apply(np.exp)
final = df.append(forecast)
final[['average_monthly_ridership', 'forecast']].plot(figsize=(12, 8))

Hurray! You have made it to THE END.