# ARIMA and SARIMA Forecasting (statsmodels)

## Why ARIMA?
- ARIMA captures autocorrelation with a compact set of parameters.
- Differencing removes trend and makes the series approximately stationary.
- Seasonal ARIMA (SARIMA) adds seasonal autoregressive and moving-average terms.

## Notation
ARIMA(p, d, q):
- **p**: autoregressive lags
- **d**: order of differencing
- **q**: moving-average lags

SARIMA(p, d, q) x (P, D, Q, s):
- **P, D, Q**: seasonal counterparts
- **s**: seasonal period (e.g., 12 for monthly data)

## Core equations
ARIMA can be written as:

\[
(1 - \phi_1 L - \cdots - \phi_p L^p)(1 - L)^d y_t
= (1 + \theta_1 L + \cdots + \theta_q L^q)\varepsilon_t
\]

SARIMA adds seasonal polynomials in lag \(L^s\):

\[
\Phi(L^s) (1 - L^s)^D (1 - L)^d y_t
= \Theta(L^s) (1 + \theta_1 L + \cdots + \theta_q L^q)\varepsilon_t
\]


In [None]:
import numpy as npimport pandas as pdimport plotly.graph_objects as gorng = np.random.default_rng(42)n = 240period = 12t = np.arange(n)trend = 0.03 * tseasonal = 2.0 * np.sin(2 * np.pi * t / period)# Simple ARMA-style noisenoise = rng.normal(scale=1.0, size=n)arma = np.zeros(n)for i in range(1, n):    arma[i] = 0.6 * arma[i - 1] + noise[i] + 0.4 * noise[i - 1]series = 10 + trend + seasonal + armaindex = pd.date_range("2010-01-01", periods=n, freq="M")y = pd.Series(series, index=index, name="y")fig = go.Figure()fig.add_trace(go.Scatter(x=y.index, y=y, name="series"))fig.update_layout(title="Synthetic seasonal series", xaxis_title="Time", yaxis_title="Value")fig

## Train/test split
Use the last two years as a holdout set.


In [None]:
train = y.iloc[:-24]test = y.iloc[-24:]fig = go.Figure()fig.add_trace(go.Scatter(x=train.index, y=train, name="train"))fig.add_trace(go.Scatter(x=test.index, y=test, name="test"))fig.update_layout(title="Train vs test", xaxis_title="Time", yaxis_title="Value")fig

## Differencing and ACF/PACF
A first difference (and often a seasonal difference) helps remove trend and seasonality.
ACF and PACF plots can guide the choice of p and q.


In [None]:
from statsmodels.tsa.stattools import acf, pacf# First differencetrain_diff = train.diff().dropna()acf_vals = acf(train_diff, nlags=36, fft=True)pacf_vals = pacf(train_diff, nlags=36)lags = np.arange(len(acf_vals))fig = go.Figure()fig.add_trace(go.Bar(x=lags, y=acf_vals, name="ACF"))fig.update_layout(title="ACF of differenced series", xaxis_title="Lag", yaxis_title="Correlation")fig.show()fig = go.Figure()fig.add_trace(go.Bar(x=lags, y=pacf_vals, name="PACF"))fig.update_layout(title="PACF of differenced series", xaxis_title="Lag", yaxis_title="Partial correlation")fig

## Fit a SARIMA model
We start with a reasonable baseline order and refine later based on diagnostics.


In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(
    train,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, period),
    enforce_stationarity=False,
    enforce_invertibility=False,
)
result = model.fit(disp=False)
result.summary()


## Forecast and visualize


In [None]:
forecast = result.get_forecast(steps=len(test))forecast_mean = forecast.predicted_meanforecast_ci = forecast.conf_int()fig = go.Figure()fig.add_trace(go.Scatter(x=train.index, y=train, name="train"))fig.add_trace(go.Scatter(x=test.index, y=test, name="test"))fig.add_trace(go.Scatter(x=forecast_mean.index, y=forecast_mean, name="forecast"))fig.add_trace(    go.Scatter(        x=forecast_ci.index,        y=forecast_ci.iloc[:, 0],        name="lower",        line=dict(width=0),        showlegend=False,    ))fig.add_trace(    go.Scatter(        x=forecast_ci.index,        y=forecast_ci.iloc[:, 1],        name="upper",        line=dict(width=0),        fill="tonexty",        fillcolor="rgba(0, 100, 80, 0.2)",        showlegend=False,    ))fig.update_layout(title="SARIMA forecast with confidence interval", xaxis_title="Time", yaxis_title="Value")fig

## Accuracy metrics


In [None]:
mae = np.mean(np.abs(test - forecast_mean))
rmse = np.sqrt(np.mean((test - forecast_mean) ** 2))

print(f"MAE:  {mae:.3f}")
print(f"RMSE: {rmse:.3f}")


## Residual diagnostics
Well-specified models leave residuals that look like white noise.


In [None]:
resid = result.resid.dropna()fig = go.Figure()fig.add_trace(go.Histogram(x=resid, nbinsx=30, name="residuals"))fig.update_layout(title="Residual distribution", xaxis_title="Residual", yaxis_title="Count")fig.show()resid_acf = acf(resid, nlags=36, fft=True)fig = go.Figure()fig.add_trace(go.Bar(x=np.arange(len(resid_acf)), y=resid_acf, name="ACF"))fig.update_layout(title="Residual ACF", xaxis_title="Lag", yaxis_title="Correlation")fig

## Practical tips
- Start with small orders and increase only if diagnostics justify it.
- Use differencing to remove trend; use seasonal differencing for strong seasonality.
- Compare models with AIC/BIC and check residual autocorrelation.
- Always benchmark against simple baselines (last value, seasonal naive).

## Exercises
1. Change the seasonal period to 6 or 24 and refit SARIMA. How do the forecasts change?
2. Try ARIMA(2, 1, 2) without seasonal terms. Compare MAE and RMSE.
3. Replace the synthetic series with a real dataset and repeat the workflow.

## Further reading
- Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis: Forecasting and Control.
- Hyndman, R. J. and Athanasopoulos, G. (Forecasting: Principles and Practice).
- Statsmodels documentation for SARIMAX and ARIMA.
