# How to choose ARIMA Orders using **Auto_Arima**

Before we can apply an ARIMA forecasting model, we need to review the components of one.<br>
ARIMA, or Autoregressive Independent Moving Average is actually a combination of 3 models:
* <strong>AR(p)</strong> Autoregression - a regression model that utilizes the dependent relationship between a current observation and observations over a previous period.
* <strong>I(d)</strong> Integration - uses differencing of observations (subtracting an observation from an observation at the previous time step) in order to make the time series stationary
* <strong>MA(q)</strong> Moving Average - a model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive



## Import Libs and load datasets

In [0]:
import pandas as pd
import numpy as np
%matplotlib inline

# Load a non-stationary dataset
df1 = pd.read_csv('drive/My Drive/Data/airline_passengers.csv',index_col='Month',parse_dates=True)
df1.index.freq = 'MS'

# Load a stationary dataset
df2 = pd.read_csv('drive/My Drive/Data/DailyTotalFemaleBirths.csv',index_col='Date',parse_dates=True)
df2.index.freq = 'D'

## pmdarima Auto-ARIMA


In [1]:
!pip install pmdarima

Collecting pmdarima
[?25l  Downloading https://files.pythonhosted.org/packages/ff/07/7c173cc4fee44ebd62ddf03b3de84c4f151ec23facdf16baf58b8d02784c/pmdarima-1.6.0-cp36-cp36m-manylinux1_x86_64.whl (1.5MB)
[K     |▎                               | 10kB 14.0MB/s eta 0:00:01[K     |▌                               | 20kB 4.4MB/s eta 0:00:01[K     |▊                               | 30kB 5.6MB/s eta 0:00:01[K     |█                               | 40kB 5.8MB/s eta 0:00:01[K     |█▏                              | 51kB 4.8MB/s eta 0:00:01[K     |█▍                              | 61kB 5.3MB/s eta 0:00:01[K     |█▋                              | 71kB 5.7MB/s eta 0:00:01[K     |█▉                              | 81kB 6.2MB/s eta 0:00:01[K     |██                              | 92kB 6.0MB/s eta 0:00:01[K     |██▎                             | 102kB 6.2MB/s eta 0:00:01[K     |██▌                             | 112kB 6.2MB/s eta 0:00:01[K     |██▊                             | 12

In [5]:
from pmdarima import auto_arima

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

  import pandas.util.testing as tm


In [6]:
auto_arima(df2['Births'])

ARIMA(maxiter=50, method='lbfgs', order=(1, 1, 1), out_of_sample_size=0,
      scoring='mse', scoring_args=None, seasonal_order=(0, 0, 0, 0),
      with_intercept=True)

In [7]:
auto_arima(df2['Births'],error_action='ignore').summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,365.0
Model:,"SARIMAX(1, 1, 1)",Log Likelihood,-1226.077
Date:,"Wed, 06 May 2020",AIC,2460.154
Time:,09:25:25,BIC,2475.743
Sample:,0,HQIC,2466.35
,- 365,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,0.0132,0.014,0.975,0.330,-0.013,0.040
ar.L1,0.1299,0.059,2.217,0.027,0.015,0.245
ma.L1,-0.9694,0.016,-62.235,0.000,-1.000,-0.939
sigma2,48.9989,3.432,14.279,0.000,42.273,55.725

0,1,2,3
Ljung-Box (Q):,36.69,Jarque-Bera (JB):,26.17
Prob(Q):,0.62,Prob(JB):,0.0
Heteroskedasticity (H):,0.97,Skew:,0.58
Prob(H) (two-sided):,0.85,Kurtosis:,3.62


This shows a recommended (p,d,q) ARIMA Order of (1,1,1), with no seasonal_order component.

We can see how this was determined by looking at the stepwise results. The recommended order is the one with the lowest <a href='https://en.wikipedia.org/wiki/Akaike_information_criterion'>Akaike information criterion</a> or AIC score. Note that the recommended model may <em>not</em> be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best <em>forecasting</em> model.

In [8]:
stepwise_fit = auto_arima(df2['Births'], start_p=0, start_q=0,
                          max_p=6, max_q=3, m=12,
                          seasonal=False,
                          d=None, trace=True,
                          error_action='ignore',   # we don't want to know if an order does not work
                          suppress_warnings=True,  # we don't want convergence warnings
                          stepwise=True)           # set to stepwise

stepwise_fit.summary()

Performing stepwise search to minimize aic
Fit ARIMA(0,1,0)x(0,0,0,0) [intercept=True]; AIC=2650.760, BIC=2658.555, Time=0.023 seconds
Fit ARIMA(1,1,0)x(0,0,0,0) [intercept=True]; AIC=2565.234, BIC=2576.925, Time=0.057 seconds
Fit ARIMA(0,1,1)x(0,0,0,0) [intercept=True]; AIC=2463.584, BIC=2475.275, Time=0.150 seconds
Fit ARIMA(0,1,0)x(0,0,0,0) [intercept=False]; AIC=2648.768, BIC=2652.665, Time=0.018 seconds
Fit ARIMA(1,1,1)x(0,0,0,0) [intercept=True]; AIC=2460.154, BIC=2475.743, Time=0.288 seconds
Fit ARIMA(2,1,1)x(0,0,0,0) [intercept=True]; AIC=2461.271, BIC=2480.757, Time=0.406 seconds
Fit ARIMA(1,1,2)x(0,0,0,0) [intercept=True]; AIC=2460.689, BIC=2480.175, Time=0.737 seconds
Near non-invertible roots for order (1, 1, 2)(0, 0, 0, 0); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.996)
Fit ARIMA(0,1,2)x(0,0,0,0) [intercept=True]; AIC=2460.722, BIC=2476.311, Time=0.297 seconds
Fit ARIMA(2,1,0)x(0,0,0,0) [intercept=True]; AIC=2536.154, BIC

0,1,2,3
Dep. Variable:,y,No. Observations:,365.0
Model:,"SARIMAX(1, 1, 1)",Log Likelihood,-1226.077
Date:,"Wed, 06 May 2020",AIC,2460.154
Time:,09:26:12,BIC,2475.743
Sample:,0,HQIC,2466.35
,- 365,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,0.0132,0.014,0.975,0.330,-0.013,0.040
ar.L1,0.1299,0.059,2.217,0.027,0.015,0.245
ma.L1,-0.9694,0.016,-62.235,0.000,-1.000,-0.939
sigma2,48.9989,3.432,14.279,0.000,42.273,55.725

0,1,2,3
Ljung-Box (Q):,36.69,Jarque-Bera (JB):,26.17
Prob(Q):,0.62,Prob(JB):,0.0
Heteroskedasticity (H):,0.97,Skew:,0.58
Prob(H) (two-sided):,0.85,Kurtosis:,3.62


___
Now let's look at the non-stationary, seasonal <strong>Airline Passengers</strong> dataset:

In [9]:
stepwise_fit = auto_arima(df1['Thousands of Passengers'], start_p=1, start_q=1,
                          max_p=3, max_q=3, m=12,
                          start_P=0, seasonal=True,
                          d=None, D=1, trace=True,
                          error_action='ignore',   # we don't want to know if an order does not work
                          suppress_warnings=True,  # we don't want convergence warnings
                          stepwise=True)           # set to stepwise

stepwise_fit.summary()

Performing stepwise search to minimize aic
Fit ARIMA(1,1,1)x(0,1,1,12) [intercept=True]; AIC=1024.824, BIC=1039.200, Time=0.737 seconds
Fit ARIMA(0,1,0)x(0,1,0,12) [intercept=True]; AIC=1033.479, BIC=1039.229, Time=0.031 seconds
Fit ARIMA(1,1,0)x(1,1,0,12) [intercept=True]; AIC=1022.316, BIC=1033.817, Time=0.571 seconds
Fit ARIMA(0,1,1)x(0,1,1,12) [intercept=True]; AIC=1022.904, BIC=1034.405, Time=0.668 seconds
Fit ARIMA(0,1,0)x(0,1,0,12) [intercept=False]; AIC=1031.508, BIC=1034.383, Time=0.029 seconds
Fit ARIMA(1,1,0)x(0,1,0,12) [intercept=True]; AIC=1022.343, BIC=1030.968, Time=0.130 seconds
Fit ARIMA(1,1,0)x(2,1,0,12) [intercept=True]; AIC=1021.137, BIC=1035.513, Time=1.749 seconds
Fit ARIMA(1,1,0)x(2,1,1,12) [intercept=True]; AIC=1017.166, BIC=1034.417, Time=5.850 seconds
Near non-invertible roots for order (1, 1, 0)(2, 1, 1, 12); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.998)
Fit ARIMA(1,1,0)x(1,1,1,12) [intercept=True]; AIC=102

0,1,2,3
Dep. Variable:,y,No. Observations:,144.0
Model:,"SARIMAX(0, 1, 1)x(2, 1, 1, 12)",Log Likelihood,-501.92
Date:,"Wed, 06 May 2020",AIC,1015.841
Time:,09:27:03,BIC,1033.092
Sample:,0,HQIC,1022.85
,- 144,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,0.0003,0.032,0.010,0.992,-0.062,0.063
ma.L1,-0.4243,0.068,-6.211,0.000,-0.558,-0.290
ar.S.L12,0.6656,0.155,4.296,0.000,0.362,0.969
ar.S.L24,0.3330,0.096,3.479,0.001,0.145,0.521
ma.S.L12,-0.9754,1.265,-0.771,0.441,-3.454,1.503
sigma2,110.3992,117.291,0.941,0.347,-119.486,340.285

0,1,2,3
Ljung-Box (Q):,53.12,Jarque-Bera (JB):,7.57
Prob(Q):,0.08,Prob(JB):,0.02
Heteroskedasticity (H):,2.83,Skew:,0.1
Prob(H) (two-sided):,0.0,Kurtosis:,4.16
