# Introduction 

Recently, I started working on forecasting some time series using statistical methods. I had a course about this like 5 years ago, and I couldn't remember anything about it. So, as I always do I googled how it is done. And to my surprise everyone is talking about the time series forecasting using ARIMA and SARIMA. However, when it comes to selecting and choosing the parameters it seems blurry. 

There are some predefined libraries/modules that do this selection for you. However, if you are like me and you want to know how exactly this is done well you are in the right place. I mean even if you use automatic methods, it doesn't hurt to know how it works behind that black box. 

So I spend some time looking, watching videos, reading books etc. And I think I finally figured it out. Atleast, the results are more efficient and the methodology quite make sense. 

I, then, decided to share a summary some sort of a guide or a cheat sheet to the selection of ARIMA/SARIMA parameters. 

# Methodology 

1. First, you should know the nature of the signal/series. I suggest plotting it and see if you can recognize a pattern and if the curve has a certain trend. Sometimes, it is clear the eye other times you should look more into things. Like for example, you can see the decomposition of the serial using the seasonal_decompose method. Also, sometimes the autocorrelation ACF and the partial-autocorrelation PACF of the series can help. If the trend is not that clear you can check if the signal is stationary or not (i.e. if it has a trend and this can be done using hypothesis testing like Augmented Dickey-Fuller test). 

2. Now, you know the composition of your signal. Now we get to the math part when we will compute differenciation of the signal until it no longer has a seasonal component nor a trend. In other words until it becomes stationary.

    a. if your signal has **Trend** and **Seasonality**, mostly you will differenciate according to the seasonality period and then simple differenciate to eliminate the trend like **signal.diff(12).diff(1)** if my signal has a seasonality of 12 and a linear trend. Usually, a signal seasonality differenciation is enough. However, you might need more than one simple trend differenciation. PS: the number of times you differenciation for seasonality or for trend correspond to the D and d parameters of the SARIMA model (i.e. the Integrated part of SARIMA).
    
    b. if your signal has only **Trend**. Easy right you will only do simple differenciation and you will be using ARIMA model since no **S**easonality. 
    
    c. if only **Seasonality** you do the seasonality differenciation you'll be using SARIMA model and the trend differenciation parameter "p" equals to zero. 
    
    d. if **no** trend **nor** seasonality, this means your signal is already stationary, you can directly use ARIMA with d equals to zero. However, predicting this will be quite hard. Because, it means that it is almost random noise (White noise). 
    
3. Once you make your signal stationary, now we go to identify the other parameters (p and P for auto-regressive component and seasonal auto-regressive component) and (q and Q for moving average and seasonal moving average components). **The Tricky Part**: 

    a. ACF plot gives you an idea about how correlated your signal with it's previous lags. This plot will help identify the moving average components. So we look at low ranked lags and see those that are significant. and the first one is your **q**. Now if you have a seasonal component, look at the multiples of the seasonality period, if you find significant lags, that's your **Q**. 
    
    b. PACF plot gives the idea about the auto-regressive part of the model. Similarly to the ACF, we identify **p** with law rank lags and **P** (if we have seasonal component) is identified through the multiples of the seasonality periode. 
    
Now this gives you a good starting point. You can always go for automated modules. And you can push results further by using grid search. 

Now, let's go for some examples of this. 

##### Importing needed modules

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings 
warnings.filterwarnings('ignore')
import itertools
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm

##### Defining help/useful functions 

In [None]:
def check_stationarity(series):
    result = adfuller(series)
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    if result[1]<0.05:
        print('Time Series is stationary')
    else: 
        print('Time Series is not stationary')

In [None]:
def series_decomposition(series, method='additive'):
    result = seasonal_decompose(series, model=method)
    result.plot()
    plt.show()

In [None]:
def plot_acf_pacf_graphs(series):
    fig, ax = plt.subplots(2,1)
    fig = sm.graphics.tsa.plot_acf(series, lags=25, ax=ax[0])
    fig = sm.graphics.tsa.plot_pacf(series, lags=25, ax=ax[1])
    plt.tight_layout()
    plt.show()

In [None]:
def arima_modeling(series, params):
    mod = sm.tsa.arima.ARIMA(series,order=params)
    results = mod.fit()
    print('ARIMA{} - AIC:{}'.format(params,results.aic))
    print(results.summary())
    results.plot_diagnostics(figsize=(18, 8))
    plt.show()

In [None]:
def arima_prediction(series, params, start_point):
    model = sm.tsa.arima.ARIMA(series,order=params).fit()
    pred = model.get_prediction(start=start_point, dynamic=False)
    pred_ci = pred.conf_int()
    ax = series.plot(label='observed')
    pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 4))
    ax.fill_between(pred_ci.index, pred_ci.iloc[:, 0], pred_ci.iloc[:, 1], color='k', alpha=.2)
    ax.set_xlabel('Date')
    ax.set_ylabel('Quantity')
    plt.legend()
    plt.show()

In [None]:
def arima_walk_forward_validation(series, params, test_size):
    n_train = int(len(series) * (1-test_size))
    train, test = series.values[0:n_train], series.values[n_train:len(series)]
    history = [x for x in train]
    predictions = list()
    # walk-forward validation
    for t in range(len(test)):
        model = sm.tsa.arima.ARIMA(history, order=params)
        model_fit = model.fit()
        output = model_fit.forecast()
        yhat = output[0]
        predictions.append(yhat)
        obs = test[t]
        history.append(obs)
    # evaluate forecasts
    rmse = np.sqrt(mean_squared_error(test, predictions))
    print('Test RMSE: %.3f' % rmse)
    # plot forecasts against actual outcomes
    plt.plot(test)
    plt.plot(predictions, color='red')
    plt.show()

In [None]:
def arima_walk_forward_forecast(series, params, steps=5):
    history = series.copy()
    predictions = [history.iloc[-1]]
    predictions_ci_min = [history.iloc[-1]]
    predictions_ci_max = [history.iloc[-1]]
    predictions_ci_index = [history.index[-1]]
    for t in range(steps):
        model = sm.tsa.arima.ARIMA(history, order=params)
        model_fit = model.fit()
        predictions.append(model_fit.get_forecast().predicted_mean[0])
        predictions_ci_min.append(model_fit.get_forecast().conf_int().values[0,0])
        predictions_ci_max.append(model_fit.get_forecast().conf_int().values[0,1])
        predictions_ci_index.append(model_fit.get_forecast().conf_int().index.tolist()[0])
        history = history.append(model_fit.get_forecast().predicted_mean)
    plt.figure(figsize=(14, 4))
    plt.plot(predictions_ci_index, predictions, label='Walk-Forward ahead Forecast', alpha=.7, color='red')
    plt.plot(series, label='observed', color='blue')
    plt.fill_between(predictions_ci_index, predictions_ci_min, predictions_ci_max, color='k', alpha=.2)
    plt.xlabel('Date')
    plt.ylabel('Quantity')
    plt.legend()
    plt.show()

In [None]:
def sarimax_modeling(series, params, s_params):
    model = sm.tsa.statespace.SARIMAX(series, order=params, 
                                      seasonal_order=s_params).fit(max_iter=50, method='powell')
    print('SARIMAX{}x{} - AIC:{}'.format(params, s_params, model.aic))
    print(model.summary())
    model.plot_diagnostics(figsize=(18, 8))
    plt.show()

In [None]:
def sarimax_prediction(series, params, s_params, start_point):
    model = sm.tsa.statespace.SARIMAX(series, order=params, 
                                      seasonal_order=s_params).fit(max_iter=50, method='powell', disp=False)
    pred = model.get_prediction(start=start_point, dynamic=False)
    pred_ci = pred.conf_int()
    ax = series.plot(label='observed')
    pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 4))
    ax.fill_between(pred_ci.index, pred_ci.iloc[:, 0], pred_ci.iloc[:, 1], color='k', alpha=.2)
    ax.set_xlabel('Date')
    ax.set_ylabel('Quantity')
    plt.legend()
    plt.show()

In [None]:
def sarimax_walk_forward_validation(series, params, s_params, test_size):
    n_train = int(len(series) * (1-test_size))
    train, test = series.values[0:n_train], series.values[n_train:len(series)]
    history = [x for x in train]
    predictions = list()
    # walk-forward validation
    for t in range(len(test)):
        model = sm.tsa.statespace.SARIMAX(history, order=params, seasonal_order=s_params)
        model_fit = model.fit(max_iter=50, method='powell', disp=False)
        output = model_fit.forecast()
        yhat = output[0]
        predictions.append(yhat)
        obs = test[t]
        history.append(obs)
    # evaluate forecasts
    rmse = np.sqrt(mean_squared_error(test, predictions))
    print('Test RMSE: %.3f' % rmse)
    # plot forecasts against actual outcomes
    plt.plot(test)
    plt.plot(predictions, color='red')
    plt.show()

In [None]:
def sarimax_walk_forward_forecast(series, params, s_params, steps=5):
    history = series.copy()
    predictions = [history.iloc[-1]]
    predictions_ci_min = [history.iloc[-1]]
    predictions_ci_max = [history.iloc[-1]]
    predictions_ci_index = [history.index[-1]]
    for t in range(steps):
        model = sm.tsa.statespace.SARIMAX(history, order=params, seasonal_order=s_params)
        model_fit = model.fit(max_iter=50, method='powell', disp=False)
        predictions.append(model_fit.get_forecast().predicted_mean[0])
        predictions_ci_min.append(model_fit.get_forecast().conf_int().values[0,0])
        predictions_ci_max.append(model_fit.get_forecast().conf_int().values[0,1])
        predictions_ci_index.append(model_fit.get_forecast().conf_int().index.tolist()[0])
        history = history.append(model_fit.get_forecast().predicted_mean)
    plt.figure(figsize=(14, 4))
    plt.plot(predictions_ci_index, predictions, label='Walk-Forward ahead Forecast', alpha=.7, color='red')
    plt.plot(series, label='observed', color='blue')
    plt.fill_between(predictions_ci_index, predictions_ci_min, predictions_ci_max, color='k', alpha=.2)
    plt.xlabel('Date')
    plt.ylabel('Quantity')
    plt.legend()
    plt.show()

## Example 1: Beer Sales

In [None]:
beer = pd.read_csv('../input/for-simple-exercises-time-series-forecasting/BeerWineLiquor.csv')
beer.head()

In [None]:
beer['date'] = pd.to_datetime(beer['date'])
beer = beer.set_index('date')
beer.index = pd.DatetimeIndex(beer.index.values, freq=beer.index.inferred_freq)

In [None]:
beer.head()

Let's plot the beer sales.

In [None]:
plt.plot(beer)
plt.plot(beer.rolling(window=12).mean())
plt.show()

Well, it seems like we have a trend since the moving average is increasing. Plus, it looks like we have a seasonal pattern in the data. 

Let's decompose the time series and verify this. 

In [None]:
series_decomposition(beer)

As suspected, we have a "linear" trend plus a seasonal pattern. I suspect the seasonal period is 12 months. 

Let's validate this with the statistical test of stationarity

In [None]:
check_stationarity(beer)

Our time series is not stationary and it has both a trend and a seasonal component. Thus, we will use the SARIMAX model. Therefore, we need to define the following parameters: **SARIMAX=(p,d,q)x(P,D,Q,m)** Where : 
- p is the Auto-Regressive Component, 
- d is the order of the differenciation used to eliminate the trend, 
- q is the Moving-Average Component,  
- P is the Seasonal-Auto-Regressive Component, 
- D is the order of the Seasonal differenciation, 
- Q is the Seasonal-Moving-Average Component, and 
- m is the Seasonal-Period

Let's start differencing the time series. 

PS we have both a trend and a seasonal component. 

In [None]:
plot_acf_pacf_graphs(beer)

Using the ACF and PACF of the time series we can see that there are very significat lags at the multiple of 12. This we can deduce that the seasonal period is equal to 12-months. 

In [None]:
plt.plot(beer.diff(12))
plt.plot(beer.diff(12).rolling(window=10).mean())
plt.show()

Here, we have eliminated the seasonal component. Still we can see that we have a trend in our series. Thus, the need for the differenciation. 

In [None]:
plt.plot(beer.diff(12).diff(1))
plt.plot(beer.diff(12).diff(1).rolling(window=10).mean())
plt.show()

Now, this looks more stationary. Let's check it using the statistical test. 

In [None]:
check_stationarity(beer.diff(12).diff(1).dropna())

The time series is now stationary. Thus, we have the following parameters already found: 
- m = 12 
- d = 1 
- D = 1

Let's now find the other components.

In [None]:
diff_beer = beer.diff(12).diff(1).dropna()

In [None]:
plot_acf_pacf_graphs(diff_beer)

Now, this is the tricky part. We should look at the ACF and PACF plots to determine the components. We look at the multiples of the seasonal period to detect the seasonal components and the low rank lags for the ARIMA components. 

- ACF: we can see there are several significant low rank lags (1, 3, and 4). Let's set q = 3. 
- ACF: it seems that the seasonal component here is 0. Since the lag 12 is null and the 24th is slightly significant. We can ignore this or we can set the component to 1. Let's start with Q = 0.
- PACF: We also have some signifcant low rank lags. Let's set p = 2. 
- PACF: There might be a confusion about lag 12 we can consider it as 0 or 1. Let's start with P=0.

So now we have our model **SARIMAX=(2,1,3)x(0,1,0,12)**.

In [None]:
sarimax_modeling(beer, params=(2,1,3), s_params=(0,1,0,12))

We can see that all the coefficient of the model have quite big values. Plus, all the components of the model are statistically significant (see the P>|z| column).

Let's see how this model performs on predictions and on walk-forward validation.

In [None]:
beer.head()

In [None]:
sarimax_prediction(beer, params=(2,1,3), s_params=(0,1,0,12), start_point=pd.to_datetime('2016-01-01'))

Seems like a fair prediction. 

In [None]:
sarimax_walk_forward_validation(beer,params=(2,1,3), s_params=(0,1,0,12), test_size=0.3)

Not bad at all. :) 

Now, let's forecast the sales for the next year. 

In [None]:
beer

In [None]:
sarimax_walk_forward_forecast(beer['beer'],params=(2,1,3), s_params=(0,1,0,12), steps=12)

## Example 2: Monthly Average Temperature

In [None]:
temp = pd.read_csv('../input/open-time-series-data/dataset/Meteorology/Monthly New York City average temperature degrees C  Jan 1946  Dec 195.csv')
temp.head()

In [None]:
temp.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
len(pd.date_range('1946-01-01','1960-01-01' , freq='1M')), len(temp)

In [None]:
temp['date'] = pd.date_range('1946-01-01','1960-01-01' , freq='1M')

In [None]:
temp = temp.set_index('date')
temp.index = pd.DatetimeIndex(temp.index.values, freq=temp.index.inferred_freq)

In [None]:
plt.plot(temp['x'])
plt.show()

It does not seem like we have a trend here. Let's check stationarity

In [None]:
check_stationarity(temp['x'])

This is a temperature series. So I guess we have a seasonal component.

In [None]:
series_decomposition(temp['x'])

We validate this with the ACF and PACF graphs. 

In [None]:
plot_acf_pacf_graphs(temp['x'])

On the ACFand PC we have significant lags at the multiples of 12. Therefore, this time series have a seasonal component of period equals to 12 month. 

Let's differenciate the seasonal component.

In [None]:
plt.plot(temp['x'].diff(12))
plt.plot(temp['x'].diff(12).rolling(window=10).mean())
plt.show()

In [None]:
check_stationarity(temp['x'].diff(12).dropna())

So this series will be modeled with SARIMAX=(p,d,q)x(P,D,Q,m). We have already identified the following parameters: 
- m = 12 
- D = 1 
- d = 0 since no differenciation is made to eliminate the trend. 

In [None]:
diff_temp = temp['x'].diff(12).dropna()

In [None]:
plot_acf_pacf_graphs(diff_temp)

From these plot we can identify the following: 
- ACF: we can see only two significant lags. Lag number 1 which implies **q = 1** and Lag number 12 which implies **Q = 1**. 
- PACF: we can see that lag 1 is significant which implies **p = 1** and we have two multiples of 12 that are significant lag 12 and lag 24. Thus, we set **P = 2**.

This leads to the model **SARIMAX=(1,0,1)x(2,1,1,12)**.

In [None]:
sarimax_modeling(temp, params=(1,0,1), s_params=(2,1,1,12))

In [None]:
sarimax_prediction(temp, params=(1,0,1), s_params=(2,1,1,12), start_point=pd.to_datetime('1957-01-31'))

In [None]:
sarimax_walk_forward_validation(temp, params=(1,0,1), s_params=(2,1,1,12), test_size=0.1)

How about predicting the future? 

In [None]:
sarimax_walk_forward_forecast(temp['x'], params=(1,0,1), s_params=(2,1,1,12), steps=12)

# Example 3: Annual Average Temperature

In [None]:
temperature = pd.read_csv('../input/open-time-series-data/dataset/Meteorology/Average annual temperature central England 1723  1970.csv')
temperature.head()

In [None]:
temperature.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
len(pd.date_range('1723-01-01','1971-01-01' , freq='1Y')), len(temperature)

In [None]:
temperature['date'] = pd.date_range('1723-01-01','1971-01-01' , freq='1Y')
temperature = temperature.set_index('date')
temperature.index = pd.DatetimeIndex(temperature.index.values, freq=temperature.index.inferred_freq)

In [None]:
plt.plot(temperature['x'])
plt.show()

In [None]:
series_decomposition(temperature['x'])

We can see that we don't have a seasonal component, plus the trend seems random. This suggest that the series is already stationary. Let's verify this. 

In [None]:
check_stationarity(temperature['x'])

As suspected this series is already stationary. Thus, no diffirenciation is needed. So d = 0. Plus since there is no seasonal component, all we need is an ARIMA Model. 

In [None]:
plot_acf_pacf_graphs(temperature['x'])

From the figure, we identify the following parameters: 
- ACF, we have a significant lag at 2. Thus, q=2.
- PACF, we have also significant lag at 2. Therefore, p=2.

In [None]:
arima_modeling(temperature['x'], (2,0,2))

In [None]:
arima_prediction(temperature['x'], (2,0,2), start_point=pd.to_datetime('1923-12-31'))

In [None]:
arima_walk_forward_validation(temperature['x'], (2,0,2), test_size=0.2)

In [None]:
arima_walk_forward_forecast(temperature['x'], (2,0,2), steps=12)

This is quite the limit with the ARIMA model, it's forecasting capacity is very limited. Maybe, it is not very adapted to the considered problem, i.e. the annual average temperature. Mostly because temperature changes depend on a veriaty of parameters, like CO and CO2 emissions etc.  

# Example 4: Changes in earths rotation annually

In [None]:
df = pd.read_csv('../input/open-time-series-data/dataset/Physics/Annual changes in the earths rotation day length sec105 18211970.csv')
df.head()

In [None]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
len(pd.date_range('1821-01-01','1971-01-01' , freq='1Y')), len(df)

In [None]:
df['date'] = pd.date_range('1821-01-01','1971-01-01', freq='1Y')
df = df.set_index('date')
df.index = pd.DatetimeIndex(df.index.values, freq=df.index.inferred_freq)

In [None]:
plt.plot(df['x'])
plt.show()

In [None]:
check_stationarity(df['x'])

So the series is not stationary, there is a Trend visible in the plot. Let's see if there is a seasonality.

In [None]:
series_decomposition(df['x'])

So no seasonality, and a trend. This calls for ARIMA model. 

In [None]:
plt.plot(df['x'].diff(1))
plt.show()

In [None]:
check_stationarity(df['x'].diff(1).dropna())

Good, so the d parameter is equal to 1. Now, let's try to find the other parameters. 

In [None]:
plot_acf_pacf_graphs(df['x'].diff(1).dropna())

ACF suggest that we need a Moving Average component with q = 2 (lag 2 is significant). 
Maybe also an Auto-regressive component with p = 4. 

Let's try the combinaison of these models seperatly. So we will study 3 models: 
1. ARIMA(4,1,2)
2. ARIMA(4,1,0)
3. ARIMA(0,1,2)

These models are compared: 

In [None]:
arima_modeling(df['x'], (4,1,2))

In [None]:
arima_modeling(df['x'], (4,1,0))

In [None]:
arima_modeling(df['x'], (0,1,2))

Based on the AIC we can see that the best model is ARIMA(4,1,2). 

Let's see their performance in the predictions. 

In [None]:
arima_prediction(df['x'], (4,1,2), start_point=pd.to_datetime('1945-12-31'))

In [None]:
arima_prediction(df['x'], (4,1,0), start_point=pd.to_datetime('1945-12-31'))

In [None]:
arima_prediction(df['x'], (0,1,2), start_point=pd.to_datetime('1945-12-31'))

Their performance is quite similar we cannot decide with only plots. This result is quite suspected since the AIC doesn't vary a lot from one model to the other. Let's walk-forward validate them. 

In [None]:
arima_walk_forward_validation(df['x'], (4,1,2), test_size=0.2)

In [None]:
arima_walk_forward_validation(df['x'], (4,1,0), test_size=0.2)

In [None]:
arima_walk_forward_validation(df['x'], (0,1,2), test_size=0.2)

Surprisingly, the results from the walk-forward validation are better for ARIMA(4,1,0). 

Let's predict the evolution of this measure in the next 4 years. 

In [None]:
arima_walk_forward_forecast(df['x'], (4,1,2), steps=4)

In [None]:
arima_walk_forward_forecast(df['x'], (4,1,0), steps=4)

In [None]:
arima_walk_forward_forecast(df['x'], (0,1,2), steps=4)

We can see that the forecasted values of the first two models are quite variate, while the third one is stabilizing quickly. This is explained by the moving average component without the AutoRegressive one. The model will eventually go for the mean value as a prediction. 

So finally, we could keep either the first or the second model. 

Ofcourse, we can push more further and optimize the parameters using grid search (but this is not the objective of this notebook). 


# Conclusion and discussion: 

Let me review the objective of this notebook and what it covers: 
- I wanted to share my understanding of how the ARIMA and SARIMA models' parameters are selected since mostly the available works on the net does not explain this point, 
- I tried to explore different example to better show how this selection of parameters is done
- So we have an example with trend and seasonality 
- an example with only seasonality 
- an example with no trend nor seasonality, and 
- an example with only trend. 

Now, you don't have to go through this selection process in every example of time series you try to forecast or analyze. There are some predefined libraries that does the optimization of the parameters using/or not grid search, such for example "pmdarima" etc.. Although, I personnally prefer this long method. 

There are other methods of time series forecasting like recurrent neural network etc that are in some way more efficient that statistical methods. Such models can integrate more variables than the signal alone. However, the statistical methods could add some variation to your models (let's say if you want to make an ensemble model). Or you could use their predictions as an input to other models. 

I hope this notebook could be of help to you. Please let me know if there are some improvement to be made. 