# ARIMA (Autoregressive Integrated Moving Averages)

##### The general process for ARIMA models is the following:
    1. Visualize the Time Series Data
    2. Make the time series data stationary
    3. Plot the Correlation and AutoCorrelation Charts
    4. Construct the ARIMA Model or Seasonal ARIMA based on the data
    5. Use the model to make predictions

##### Formula of Auto Regressive Model:
    yt = c + ϕ1 yt−1 + ϕ2 yt−2 + .............. + ϕp yt−p + εt
    Here,
    p = 1, 2, 3, ......
    t-1, t-2, ..... = lags
    
<br/>
<br/>

## Load the dataset and Reconnaissance
dataset used here: Perrin Freres monthly champagne sales

In [None]:
import pandas as pd
df = pd.read_csv('../input/perrin-freres-monthly-champagne-sales/Perrin Freres monthly champagne sales millions.csv')
df.head()

In [None]:
df.tail()

<br/>

### Checking if there is any missing values in the dataset
We got one missing value in the 'Month' column and two in the rest.

In [None]:
df.isnull().sum()

### Removing the rows containing missing values

In [None]:
df.drop([105,106], axis = 0, inplace = True)

In [None]:
df.isnull().sum()

<br/>

### Changing the name of the 2nd column
We shouldn't break our teeth pronouncing any column name.

In [None]:
df.columns = ['Month', 'Sales']
df.head()

<br/>

### Shape of the dataset

In [None]:
df.shape

<br/>

### Data types of the variables

In [None]:
df.dtypes

<br/>

### Change the data type of the month column

In [None]:
df['Month'] = pd.to_datetime(df['Month'])

In [None]:
df.dtypes

#### After changing the data type, it will be looking like this

In [None]:
df.head()

<br/>

### Convert the month column into index

In [None]:
df.set_index('Month', inplace = True)

In [None]:
df.head()

<br/>

## PLot the dataset

In [None]:
df.plot()

#### Note:
From the plot, we are seeing that it's kind of a seasional data. It may not be stationary. To clarify the confusion, we can apply here the 'Dickey Fuller Test' to see whether it is stationary or not.

<br/>

## Dickey Fuller Test

In [None]:
from statsmodels.tsa.stattools import adfuller

def adfuller_test(sales):
    result = adfuller(sales)
    labels = ['ADF test statistics', 'P-value', '#Lags used', 'Number of observation used']
    for value, label in zip(result, labels):
        print(label+' : '+str(value))
    if result[1] <= 0.05:
        print('Strong evidence against the null hypothesis (Ho), Reject the null hypothesis, Data has no unit root and is stationary')
    else:
        print('Weak evidence against the null hypothesis (Ho), time series has a unit root, indicating it is non stationary. ')
        
        
adfuller_test(df['Sales'])

#### Note:
As the P-value is grater than 0.005, the Dickey Fuller Test tells us that the data is not stationary.  
Now it's time to make the data stationary

<br/>

## Making the data stationary by Differencing (Integrated)

As the data is seasional and each year consists 12 month, from the graph we are seeing that the per cycle difference of the data is 12 months.  
That's why we will shift 12 here and the substraction from the 'Sales' column will be stored in a new column titled 'seasonal_first_difference'.

In [None]:
df['seasional_first_difference'] = df['Sales'] - df['Sales'].shift(12)
df

<br>

## Again applying Dickey Fuller Test
Now, we want to see if our new data became stationary or not.  
But this time we should pay extra attention to 'dropna()'. Because for shifting 12, the 1st 12 values of the 'Sales_first_difference' will be NaN. We have to keep them aside.

In [None]:
adfuller_test(df['seasional_first_difference'].dropna())

#### Note:
Done! As our P-value this time becomes less than 0.005, we can easily tell this data a stationary one.

<br/>

## Plotting our new stational data

In [None]:
df['seasional_first_difference'].plot()

<br/>

## Plotting ACF and PACF
ACF = Auto correlation function  
PACF= Partial Auto correlation function
<br/>

ACF and PACF are used to find the best lag value for the model. 
##### PACF is most suitable for AR model.  
##### And ACF is most suitable for MA model.
<br/>

shuts off - The abrupt decrease in PACF. It normally happens in PACF only. And in ACF the decrease is exponential.

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

In [None]:
import statsmodels.api as sm
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df['seasional_first_difference'].iloc[13:],lags=40,ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df['seasional_first_difference'].iloc[13:],lags=40,ax=ax2)

#### ARIMA Model takes 3 values.

###### p - AR model lag   : find out from the PACF, where shuts off happens.  
###### d - Differencing    : how many times the shifting are done?
###### q - MA model lag : find out from ACF, where exponential decrease happens.  
<br/>

In our case,  
p = 1 (As abrupt decrease happens in 1)  
d = 1  
q = 0 (As we can't see any exponential decrease in ACF. The decrease in ACF is also abrupt. but in this case, we can also consider an exponential decrease in 1. Then q value can also be 1.)

<br/>

# ARIMA model
#### Note : ARIMA should be selected when the data is seasional. Though we have seasional data here, we are implementing ARIMA to see the process.

In [None]:
from statsmodels.tsa.arima_model import ARIMA

model_arima = ARIMA(df['Sales'], order = (1,1,1))     #order = (p, d, q)
model_arima_fit = model_arima.fit()

### Model summary

In [None]:
model_arima_fit.summary()

<br/>

## Plotting the model

In [None]:
df['forecast_arima']=model_arima_fit.predict(start=90, end=103, dynamic=True)
df[['Sales','forecast_arima']].plot()

##### As the data isn't seasional, ARIMA model doesn't perform well

<br/>

# SARIMAX model

In [None]:
model_sarimax = sm.tsa.statespace.SARIMAX(df['Sales'],
                                          order = (1,1,1),                  # order = (p, d, q)
                                          seasonal_order = (1,1,1, 12))     # seasonal_order = (p, d, q, shift)  
model_sarimax_fit = model_sarimax.fit()

<br/>

## Plotting the model

In [None]:
df['forecast_sarimax']=model_sarimax_fit.predict(start=90, end=103, dynamic=True)      # 90 and 103 are the index range to be predicted
df[['Sales','forecast_sarimax']].plot(figsize = (12, 8))

<br/>
<br/>

### Creating a additional dataset for forecasting

In [None]:
from pandas.tseries.offsets import DateOffset
future_dates = [df.index[-1] + DateOffset (months = x) for x in range(0, 24)]

In [None]:
future_date_dataset = pd.DataFrame(index = future_dates[1:], columns = df.columns)
future_date_dataset.tail()

### Now concate as well as merge the new dataset with the existing one

In [None]:
merged_df = pd.concat([df, future_date_dataset])
merged_df.tail()

In [None]:
df.tail()

In [None]:
merged_df['forecast_sarimax'] = model_sarimax_fit.predict(start =104, end = 120, dynamic= True )
merged_df[['Sales', 'forecast_sarimax']].plot(figsize = (12, 8))

<br/>
<br/>

Gratitude: Krish Naik
### Feel free to share your thoughts and if you find it helpful, please upvote. It will keep me motivated. Thyanks!