## ARIMA and SARIMA
### on Perrin-freres-Monthly-Champagne-sales

### Auto Regressive Integrated Moving Averages
The general process of arima model is the following.
1. Visualize Time Series data
2. Make the time series data stationary.
3. Plot the Correlation and Auto Correlation Charts.
4. Contruct the ARIMA and SARIMA model based on data.
5. Use the model to make predictions.

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/perrin-freres-monthly-champagne-sales/perrin-freres-monthly-champagne.csv')
df.head(5)

In [None]:
df.columns = ['Months','Sales']
df.head(5)

In [None]:
df.isnull()

In [None]:
df.drop(105,axis=0,inplace=True)
df.drop(106,axis=0,inplace=True)
df.isnull().sum()

In [None]:
df.dtypes

In [None]:
df['Months'] = pd.to_datetime(df['Months'])
df.dtypes

In [None]:
df.set_index('Months',inplace=True)
df.head(5)

## 2. Visualize the Data

In [None]:
df.plot(figsize= (12,8))

Testing for Stationarity use Dickey Fuller test

In [None]:
from statsmodels.tsa.stattools import adfuller

In [None]:
# Accept Null Hpyo means dataset is Not Stationary
# Reject Null Hypo Means dataset is Stationary
def adfuller_test(sales):
    result = adfuller(sales)
    labels = ['ADF Test statistic','p-value','#lags used','Number of Observations']
    for values, label in zip(result,labels):
        print(label +':'+ str(values))
    if(result[1] <= 0.05):
        print("The Dataset is stationary, Reject Null Hypothesis")
    else:
        print("The Dataset is Not Stationary, Accept Null Hypothesis")

In [None]:
adfuller_test(df['Sales'])

My p-value is > 0.05 so we accept Null Hypothises and prove that data is Not Stationary

## Differencing

In [None]:
df['Seasonal Sales Diff'] = df['Sales'] - df['Sales'].shift(12)
df

In [None]:
adfuller_test(df['Seasonal Sales Diff'].dropna())

In [None]:
df['Seasonal Sales Diff'].plot(figsize=(12,8))

Now This Data Looks stationary

Identify AR model by best score of PACF
shut off phase 
Identify MA model is done by best of ACF rather than PACF
shut iff phase

In [None]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf

In [None]:
fig =plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = plot_acf(df['Seasonal Sales Diff'].iloc[13:],lags=40,ax=ax1)
ax2 = fig.add_subplot(212)
fig = plot_pacf(df['Seasonal Sales Diff'].iloc[13:],lags= 40,ax=ax2)

Here we need to find three values p,d,q
p = AR model lags
d = differencing /Integrated
q = MA model Lags

In [None]:
# For Non Seasonal Data
#p=1 d=1 q can be 0 or 1
from statsmodels.tsa.arima_model import ARIMA

In [None]:
model = ARIMA(df['Sales'],order=(1,1,1))
model_fit = model.fit()

In [None]:
model_fit.summary()

In [None]:
df.tail(20)

In [None]:
df['forecast'] = model_fit.predict(start=90,end=103,dynamic=True)
df[['Sales','forecast']].plot(figsize=(12,8))

In [None]:
import statsmodels.api as sm

In [None]:
model = sm.tsa.statespace.SARIMAX(df['Sales'],order=(1,1,1),seasonal_order=(1,1,1,12))
results = model.fit()

In [None]:
df['forecast'] = results.predict(start=90,end=103,dynamic=True)
df[['Sales','forecast']].plot(figsize=(12,8))

Adding Future Dates to the Dataset.

In [None]:
from pandas.tseries.offsets import DateOffset

In [None]:
future_dates = [df.index[-1]+ DateOffset(months=x) for x in range(0,24)]

In [None]:
future_dataset_df = pd.DataFrame(index=future_dates[1:],columns=df.columns)

In [None]:
future_dataset_df.tail(20)

In [None]:
future_dataset_df.shape

In [None]:
future_df = pd.concat([df,future_dataset_df])

In [None]:
future_df['forecast'] = results.predict(start=104,end=120,dynamic=True)
future_df[['Sales','forecast']].plot(figsize=(12,8))