# <center>Monthly Beer Production</center>

## Introduction
This notebook solves following task: 

Using only the following data (https://www.kaggle.com/sergiomora823/monthly-beer-production), please provide a forecast of monthly Australian beer production for the year 1996. Verbally summarize the forecast and give a comment on what you did, why you did what you did, and how you ended up with the final forecast. Use a Kaggle notebook.

The main areas covered are:

1. Overview of the data

2. Exploratory data analysis

3. SARIMA model

4. SARIMAX model

5. Prophet model

6. Comparision of models

7. Summary

### Importing required libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

import itertools
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose

import sklearn.metrics as sme
from math import sqrt

from fbprophet import Prophet

import warnings
warnings.filterwarnings("ignore")

## Overview of the data

First, the dataset is loaded from .csv file and information about dataset are viewed. Also the check whether data contains any NaN values is done and results in zero appearance of those - there are none NaN values in this dataset.

The given dataset contains time-series monthly data representing monthly Australian beer production in Australia. The data covers information from January 1956 until August 1995.

In [None]:
df = pd.read_csv('../input/monthly-beer-production/datasets_56102_107707_monthly-beer-production-in-austr.csv', header=0)
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.isnull().sum().sum()

The dataset consists from two columns, Month and Monthly beer production. 


Month column is converted to datetime type, and both columns are renamed for easier manipulation. 

The datetime column is set as index.

In [None]:
df.Month = pd.to_datetime(df.Month)
df.columns = ['date', 'production']
df.set_index('date', inplace=True)
df.head()

## Exploratory data analysis

The series is transformed to natural logarithm for further use.

In [None]:
df['log_production']=np.log(df['production'])

The following visualization shows beer production over given period of time (in both levels and logartihms). Based on that we can make several observations:

1. There was a significant growth showing a positive trend in beer production until 1974. After this year, the trend stabilized. In early 80's the trend was negative (slight decrease in beer production). As there are different trends over the time, it indicates non-stationarity of the series.

2. The beer production is reaching its peak around Christmas (Dec), which also corresponds to Australian summer. After that production is decreasing and reaching its minimum during mid-year holidays. This pattern is repeatable, therefore the series has seasonal component. 

In [None]:
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(x=df.index, y=df.production, name="Production"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df.index, y=df.log_production, name="Log Production"),
    secondary_y=True,
)

fig.update_layout(
    title_text="Beer Production"
)

fig.update_xaxes(title_text="Month")

fig.update_yaxes(title_text="Production", secondary_y=False)
fig.update_yaxes(title_text="Log Production", secondary_y=True)

fig.show()

Using seasonal_decompose method we decompose time-series into four graphs - observed, trend, seasonal and residual.
This strengthens initial observation, that the data has trend and therefore are not stationary.

In [None]:
res = seasonal_decompose(df.log_production, model="additive")
#res = seasonal_decompose(df.production, method="additive")
fig = res.plot()
fig.set_figheight(8)
fig.set_figwidth(15)
plt.show()

To check stationary of the data is used ADF-test. The results of the augmented Dickey-Fuller test confirm that with a significance level of 5% the beer production series is not stationary.

In [None]:
stat_result = adfuller(df.log_production)
print('ADF Statistic: %f' % stat_result[0])
print('p-value: %f' % stat_result[1])
print('Critical Values:')
for key, value in stat_result[4].items():
    print('\t%s: %.3f' % (key, value))

The simple autocorrelation function gives indications that the series is non-stationary and verifies the unit root test.

In [None]:
plt.figure(figsize=(20,5))
pd.plotting.autocorrelation_plot(df['log_production'])

### Splitting the dataset into train and test set

In [None]:
X = df.drop(['production'], axis=1)
start_date = '1957-01-01'
split_date = '1986-12-01'
train = X.loc[(X.index >= start_date) & (X.index <= split_date)].copy()
test = X.loc[X.index > split_date].copy()

In [None]:
test.rename(columns={'log_production': 'Test Set'}) \
    .join(train.rename(columns={'log_production': 'Training Set'}),how='outer') \
    .plot(figsize=(15,5), title='Beer production')
plt.show()

## SARIMA model

Based on the exploratory analysis discovering the data are seasonal, Seasonal ARIMA MODEL (Auto Regressive Integrated Moving Average) will be used.

For applying the ARIMA model we need p,q and q parameters:

p is the AR model lags and can be determined from the pacf plot

d is the degree of differencing

q is the size of the moving average window (order)

For applying SARIMA model we need another parameter s:

s is the determination of seasonality

Autocorrelogram and partial autocorrelogram can be used to estimate model parameters.

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,5))
plot_acf(df['log_production'], ax=ax[0])
plot_pacf(df['log_production'], ax=ax[1])
plt.show()

To ensure the best parameters, grid search is used in order to minimize AIC. The best parameters are used in final model.

In [None]:
# set parameters range
p,d,q = range(0,3),[1],range(0,3)
P,D,Q,s = range(0,3),[1],range(0,3),[12]

# list of all parameters combinations
pdq = list(itertools.product(p, d, q))
seasonal_pdq = list(itertools.product(P, D, Q, s))
all_param = list(itertools.product(pdq,seasonal_pdq))

params = [] 
params_s = [] 
aics = [] 
mses = [] 
cnt = 0 
for param in all_param: 
    
    mod = sm.tsa.statespace.SARIMAX(train,
                                order=param[0],
                                seasonal_order=param[1],
                                enforce_stationarity=False,
                                enforce_invertibility=False)

    results = mod.fit()

    pred = results.get_prediction("1987-01-01", "1995-08-01")

    params.append(param[0])
    params_s.append(param[1])
    aics.append(results.aic)
    mses.append(mean_squared_error(test.log_production[1:],pred.predicted_mean[1:]))


    #if cnt % 8 == 0:
    print('SARIMAX{}x{} - AIC:{} - MSE:{}'.format(param[0],
                                                    param[1],
                                                    results.aic,
                                                mses[-1]))
        #cnt += 1

   

    min_ind = aics.index(min(aics)) 
    bestparam = (params[min_ind],params_s[min_ind]) 
    print('best_param_aic:',bestparam,' aic:',min(aics)) 
    min_ind = mses.index(min(mses)) 
    bestparam = (params[min_ind],params_s[min_ind]) 
    print('best_param_mse:',bestparam,' mse:',min(mses))

print('DONE')

In [None]:
sarima = sm.tsa.statespace.SARIMAX(train, order=(2,1,2), seasonal_order=(0,1,1,12),
                                enforce_stationarity=False, enforce_invertibility=False).fit()
sarima.summary()

Residual analysis

In [None]:
res = sarima.resid
fig,ax = plt.subplots(2,1,figsize=(15,8))
fig = sm.graphics.tsa.plot_acf(res, ax=ax[0])
fig = sm.graphics.tsa.plot_pacf(res, ax=ax[1])
plt.show()

Prediction on test data and calculating MSE

In [None]:
pred_sarima = sarima.predict("1987-01-01", "1995-08-01")
print('SARIMA model MSE:{}'.format(sme.mean_squared_error(test.log_production,pred_sarima)))

Visualization of predicted data with target labels (test)

In [None]:
pd.DataFrame({'test':test.log_production,'pred':pred_sarima}).plot(figsize=(15,8))
plt.show()

## SARIMAX model

SARIMAX model adds exogenous regressors to SARIMA.

Time series features from datetime index are created, together with lag features. Further  work could include adding rolling window features and expanding window features.

In [None]:
def create_features(df):
    
    # datetime features
    df = df.copy()
    df['date'] = df.index
    df['month'] = df['date'].dt.month
    df['quarter'] = df['date'].dt.quarter
    df['year'] = df['date'].dt.year
    
    # lag features
    lags = pd.DataFrame(df.log_production.values, index=df.index)
    df_lags = pd.concat([lags.shift(i) for i in [1,3,6,12]], axis=1)
    df_lags.columns = ['t-1', 't-3', 't-6', 't-12']
   
    X = df[['month', 'quarter', 'year']].join(df_lags)
    return X

X = create_features(df)
display(X)

In [None]:
# splitting features
split_date = '1986-12-01'
X_train = X.loc[X.index <= split_date].copy().dropna()
X_test = X.loc[X.index > split_date].copy().dropna()

In [None]:
# set parameters range
p,d,q = range(0,3),[1],range(0,3)
P,D,Q,s = range(0,3),[1],range(0,3),[12]

# list of all parameters combinations
pdq = list(itertools.product(p, d, q))
seasonal_pdq = list(itertools.product(P, D, Q, s))
all_param = list(itertools.product(pdq,seasonal_pdq))

params = [] 
params_s = [] 
aics = [] 
mses = [] 
cnt = 0 
for param in all_param: 
    
    mod = sm.tsa.statespace.SARIMAX(train,
                                order=param[0],
                                exog = X_train,
                                seasonal_order=param[1],
                                
                                enforce_stationarity=False,
                                enforce_invertibility=False)

    results = mod.fit()

    pred = results.get_prediction("1987-01-01", "1995-08-01", exog=X_test)

    params.append(param[0])
    params_s.append(param[1])
    aics.append(results.aic)
    mses.append(sme.mean_squared_error(test.log_production[1:],pred.predicted_mean[1:]))


    #if cnt % 8 == 0:
    print('SARIMAX{}x{} - AIC:{} - MSE:{}'.format(param[0],
                                                    param[1],
                                                    results.aic,
                                                mses[-1]))
        #cnt += 1

   

    min_ind = aics.index(min(aics)) 
    bestparam = (params[min_ind],params_s[min_ind]) 
    print('best_param_aic:',bestparam,' aic:',min(aics)) 
    min_ind = mses.index(min(mses)) 
    bestparam = (params[min_ind],params_s[min_ind]) 
    print('best_param_mse:',bestparam,' mse:',min(mses))

print('DONE')

In [None]:
sarimax = sm.tsa.statespace.SARIMAX(train,order=(2,1,1),seasonal_order=(0,1,1,12),exog = X_train,
                                  enforce_stationarity=False, enforce_invertibility=False,).fit()

In [None]:
sarimax.summary()

In [None]:
res = sarimax.resid
fig,ax = plt.subplots(2,1,figsize=(15,8))
fig = sm.graphics.tsa.plot_acf(res, ax=ax[0])
fig = sm.graphics.tsa.plot_pacf(res, ax=ax[1])
plt.show()

In [None]:
pred_sarimax = sarimax.predict("1987-01-01", "1995-08-01", exog=X_test)
print('SARIMAX model MSE:{}'.format(sme.mean_squared_error(test.log_production,pred_sarimax)))

In [None]:
pd.DataFrame({'test':test.log_production,'pred':pred_sarimax}).plot(figsize=(15,8))
plt.show()

## Prophet model

Prophet model expects the dataset to be named in a specific way - datetime data as dt and target variable as y. So the columns are renamed accordingly. 

In [None]:
prophet = Prophet()
prophet.fit(train.reset_index().rename(columns={'date':'ds','log_production':'y'}))
predictions = prophet.predict(df=test.reset_index().rename(columns={'date':'ds'}))
predictions.head()

Plot components of predictions.

In [None]:
fig = prophet.plot_components(predictions)

Visualize predictions with target data.

In [None]:
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
ax.plot(test.index, test['log_production'], color='r')
fig = prophet.plot(predictions, ax=ax)

Calculate MSE.

In [None]:
print('Prophet model MSE:{}'.format(sme.mean_squared_error(test.log_production,predictions['yhat'])))

Dataframe with resulting predictions.

In [None]:
pred_prophet = pd.DataFrame(predictions[["ds","yhat"]])
pred_prophet.columns = ["date", "yhat"]
pred_prophet.set_index("date", inplace=True)
display(pred_prophet)

## Comparision of models

To compare all models, the production values are transformed from logarithms back to the levels. Several metrics is used for evaluation.

In [None]:
df_results = test
df_results["production"]=np.exp(df_results.log_production)
df_results["pred_sarima"]=np.exp(pred_sarima)
df_results["pred_sarimax"]=np.exp(pred_sarimax)
df_results["pred_prophet"]=np.exp(pred_prophet)

In [None]:
y=df_results["production"].to_numpy()
x=df_results.index.to_numpy()
x=x[:-1]
y=y[:-1]
df_plot_1 = pd.DataFrame({'month':x, 'production':y, 'data':"label"})

y=df_results["pred_sarima"].to_numpy()
x=df_results.index.to_numpy()
x=x[:-1]
y=y[:-1]
df_plot_2 = pd.DataFrame({'month':x, 'production':y, 'data':"SARIMA"})

y=df_results["pred_sarimax"].to_numpy()
x=df_results.index.to_numpy()
x=x[:-1]
y=y[:-1]
df_plot_3 = pd.DataFrame({'month':x, 'production':y,  'data':"SARIMAX"})

y=df_results["pred_prophet"].to_numpy()
x=df_results.index.to_numpy()
x=x[:-1]
y=y[:-1]
df_plot_4 = pd.DataFrame({'month':x, 'production':y, 'data':"Prophet"})


frames = [df_plot_1, df_plot_2, df_plot_3, df_plot_4]
df_result_plot = pd.concat(frames)

In [None]:
fig = px.line(df_result_plot, x="month", y="production", color='data', title='Beer Production Predictions')
fig.show()

In [None]:
def evaluate_model(y_test, y_pred, model):
    R2 = round(sme.r2_score(y_test, y_pred), 2)
    MSE = round(sme.mean_squared_error(y_test, y_pred), 2)
    RMSE = round((np.sqrt(sme.mean_squared_error(y_test, y_pred))), 10)
    MAE = round(sme.mean_absolute_error(y_test, y_pred), 2) 
    MedAE= round(sme.median_absolute_error(y_test, y_pred), 2) 
    results = pd.DataFrame([R2, MSE, RMSE, MAE, MedAE], columns=[model], index=["R2 score", "Mean Squared Error", "RMSE", "Mean Absolute Error", "Median Absolute Error"])
    return results

In [None]:
df_results_table = evaluate_model(df_results.production, df_results.pred_sarima, "SARIMA") \
        .join(evaluate_model(df_results.production, df_results.pred_sarimax, "SARIMAX")) \
        .join(evaluate_model(df_results.production, df_results.pred_prophet, "Prophet"))

display(df_results_table)

Based on the results, the best performing model is SARIMA model. So this model is used for beer production prediction for year 1996.

In [None]:
predictions = sarima.predict("1996-01-01", "1996-12-01")

In [None]:
predictions = np.exp(predictions)
display(predictions)

In [None]:
predictions.to_csv('submission.csv')

## Summary

In this notebook were used three methods to compare forecasting ability for Australian beer production. The best performance on test data was achieved by SARIMA model, so this model was used for forecasting beer production for year 1996. Based on the resulting predictions, the beer production in 1996 will be similar to the previous three years.