# ARIMA and Seasonal ARIMA
## Autoregressive Integrated Moving Averages
The general process for ARIMA models is the following:

* Visualize the Time Series Data
* Make the time series data stationary
* Plot the Correlation and AutoCorrelation Charts
* Construct the ARIMA Model or Seasonal ARIMA based on the data
* Use the model to make predictions

Let's go through these steps!

## Step 1: Understanding the Data

In [None]:
from datetime import datetime
import numpy as np             #for numerical computations like log,exp,sqrt etc
import pandas as pd            #for reading & storing data, pre-processing
import matplotlib.pylab as plt #for visualization
#for making sure matplotlib plots are generated in Jupyter notebook itself
%matplotlib inline             
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from matplotlib.pylab import rcParams

In [None]:
df=pd.read_csv('../input/air-passengers/AirPassengers.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df.isna().sum()

In [None]:
# Convert Month into Datetime
df['Month']=pd.to_datetime(df['Month'])

In [None]:
df.head()

In [None]:
df.set_index('Month',inplace=True)

In [None]:
df.head()

In [None]:
df.describe()

## Step 2: Visualize the Data

In [None]:
df.plot()

In [None]:
### Testing For Stationarity

from statsmodels.tsa.stattools import adfuller

In [None]:
#Ho: It is non stationary
#H1: It is stationary

def adfuller_test(sales):
    result=adfuller(sales)
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )
    if result[1] <= 0.05:
        print("strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data has no unit root and is stationary")
    else:
        print("weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary ")

In [None]:
adfuller_test(df['#Passengers'])

## Step 3: Differencing

In [None]:
#Determine rolling statistics
rolmean = df.rolling(window=12).mean() #window size 12 denotes 12 months, giving rolling mean at yearly level
rolstd = df.rolling(window=12).std()

In [None]:
#Plot rolling statistics
orig = plt.plot(df, color='blue', label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

In [None]:
#Estimating trend
df_logScale = np.log(df)
plt.plot(df_logScale)

In [None]:
#The below transformation is required to make series stationary
movingAverage = df_logScale.rolling(window=12).mean()
movingSTD = df_logScale.rolling(window=12).std()
plt.plot(df_logScale)
plt.plot(movingAverage, color='red')

In [None]:
df_logScale_ma = df_logScale - movingAverage
df_logScale_ma.head(12)

In [None]:
#Remove NAN values
df_logScale_ma.dropna(inplace=True)
df_logScale_ma.head(10)

In [None]:
## Again test dickey fuller test
adfuller_test(df_logScale_ma)

In [None]:
df_logScale_ma.plot()

#### Time Shift Transformation 

In [None]:
df_logScale_shift = df_logScale - df_logScale.shift()
plt.plot(df_logScale_shift)

In [None]:
df_logScale_shift.dropna(inplace=True)
df_logScale_shift.plot()

In [None]:
## Again test dickey fuller test
adfuller_test(df_logScale_shift)

## Step-3 : Final Thoughts on Autocorrelation and Partial Autocorrelation¶

* Identification of an AR model is often best done with the PACF.
    * For an AR model, the theoretical PACF “shuts off” past the order of the model. The phrase “shuts off” means that in theory the partial autocorrelations are equal to 0 beyond that point. Put another way, the number of non-zero partial autocorrelations gives the order of the AR model. By the “order of the model” we mean the most extreme lag of x that is used as a predictor.

* Identification of an MA model is often best done with the ACF rather than the PACF.
    * For an MA model, the theoretical PACF does not shut off, but instead tapers toward 0 in some manner. A clearer pattern for an MA model is in the ACF. The ACF will have non-zero autocorrelations only at lags involved in the model.

In [None]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf

In [None]:
import statsmodels.api as sm

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df_logScale_shift.values.squeeze(),lags=40,ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df_logScale_shift,lags=40,ax=ax2)

In [None]:
model=ARIMA(df,order=(2,1,2))
model_fit=model.fit()

In [None]:
model_fit.summary()

In [None]:
model_fit.aic

In [None]:
df['forecast']=model_fit.predict(start=90,end=103,dynamic=True)
df[['#Passengers','forecast']].plot(figsize=(12,8))

In [None]:
model=sm.tsa.statespace.SARIMAX(df['#Passengers'],order=(2,1,2),seasonal_order=(2,1,2,12))
results=model.fit()

In [None]:
results.aic

In [None]:
results.summary()

In [None]:
df['forecast']=results.predict(start=90,end=800,dynamic=True)
df[['#Passengers','forecast']].plot(figsize=(12,8))