# Autoregressive (AR) models (from scratch NumPy + Plotly)

An **autoregressive (AR)** model explains a time series using its own past values.

In this notebook you will:
- Precisely define the **AR(p)** model in math (LaTeX)
- Understand the **algorithm** (fit → diagnose → forecast)
- Learn the **assumptions**, especially **stationarity**
- Choose the lag order **p** (AIC/BIC intuition + implementation)
- Implement AR fitting + forecasting in **pure NumPy** (no statsmodels)
- Visualize with **Plotly**:
  - effect of different lag orders
  - prediction vs actual
  - residual behavior


In [None]:
import sys

import numpy as np
import plotly
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

# Plotly notebooks: force the Jupyter renderer.
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
pio.templates.default = "plotly_white"

print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("Plotly:", plotly.__version__)


## 1) What the AR model is (algorithm view)

### Model idea
An AR model says:
> **Today is a linear combination of the last _p_ values plus noise.**

### Fit / forecast algorithm (high-level)
Given a univariate series \(y_0, y_1, \dots, y_{T-1}\):

1. **Choose a lag order** \(p\) (domain knowledge, AIC/BIC, validation).
2. Build a **lagged design matrix** \(X\) where each row contains \([1, y_{t-1}, \dots, y_{t-p}]\).
3. **Estimate coefficients** \(\beta = [c, \varphi_1, \dots, \varphi_p]^\top\) by least squares.
4. Produce predictions:
   - **one-step ahead**: use real past values (best-case prediction)
   - **multi-step forecast**: feed your own predictions back in (real forecasting)
5. **Diagnose residuals** (should look like white noise):
   - mean near 0, constant variance
   - weak autocorrelation (ACF near 0 for lags > 0)
6. Iterate on \(p\) (and preprocessing like differencing/detrending) until diagnostics are acceptable.


## 2) The AR(p) model in LaTeX + step-by-step math

### Scalar form
The **AR(p)** model is:

$$
y_t = c + \sum_{i=1}^{p} \varphi_i\, y_{t-i} + \varepsilon_t
$$

- \(y_t\): value at time \(t\)
- \(c\): intercept (drift)
- \(\varphi_i\): lag-\(i\) coefficient
- \(\varepsilon_t\): noise term (ideally **white noise**)

If the process is stationary, the unconditional mean exists and is
$$
\mu = \mathbb{E}[y_t] = \frac{c}{1 - \sum_{i=1}^p \varphi_i} \quad \text{(when the denominator is nonzero and the process is stationary)}.
$$

### Matrix (regression) form
For \(t = p, p+1, \dots, T-1\), define a row vector
$$
x_t = [1,\; y_{t-1},\; y_{t-2},\; \dots,\; y_{t-p}] 
$$
and stack these into a matrix \(X\). Also stack the targets into a vector \(\mathbf{y} = [y_p, y_{p+1}, \dots, y_{T-1}]^\top\).

Then:
$$
\mathbf{y} = X\,\beta + \varepsilon, \quad \beta = [c, \varphi_1, \dots, \varphi_p]^\top
$$

### Estimation (least squares)
We choose \(\hat\beta\) to minimize squared prediction error:
$$
\hat\beta = \arg\min_{\beta} \|\mathbf{y} - X\beta\|_2^2
$$

A closed form exists if \(X^\top X\) is invertible:
$$
\hat\beta = (X^\top X)^{-1} X^\top \mathbf{y}
$$

In practice we solve the least-squares problem with numerically stable linear algebra (still "low-level": just NumPy).


## 3) Assumptions, requirements, and stationarity

### Core assumptions (what AR is "promising")
- **Linearity in the past**: \(y_t\) is a linear function of \(y_{t-1},\dots,y_{t-p}\).
- **Equally spaced observations**: the meaning of "one lag" is stable.
- **Parameter stability**: coefficients \(\varphi_i\) do not change over the segment you fit.
- **Noise behavior**: residuals \(\varepsilon_t\) should be approximately **uncorrelated**, with roughly constant variance. (Gaussianity is only needed if you want exact likelihood-based inference; OLS fitting works more generally.)

### Stationarity requirements (the important one)
AR models are usually used for **covariance-stationary** series (constant mean/variance, autocovariance depends only on lag). For AR(p), this is governed by the **characteristic polynomial**:
$$
\Phi(z) = 1 - \varphi_1 z - \varphi_2 z^2 - \dots - \varphi_p z^p
$$
The AR(p) process is stationary if **all roots** of \(\Phi(z)=0\) satisfy
$$
|z| > 1 \quad \text{(roots outside the unit circle)}.
$$

Special case AR(1):
$$
y_t = c + \varphi_1 y_{t-1} + \varepsilon_t \quad \Rightarrow \quad \text{stationary iff } |\varphi_1| < 1.
$$

### When AR is used / what patterns it fits well
AR is a good baseline when the present depends on the recent past:
- **Inertia / smoothing**: values move gradually (sensor readings, latency signals).
- **Mean reversion after incidents**: a shock causes a jump that decays over a few steps.
- **Short-memory autocorrelation**: dependence mostly captured by a small number of lags.
- **Oscillations**: alternating/underdamped behavior can be captured by AR(2)+ (complex roots).

AR is *not* a great fit (without preprocessing/extensions) for:
- strong **trend** or **unit-root** behavior (needs differencing → ARIMA)
- strong **seasonality** (needs seasonal terms/features)
- structural breaks / regime changes (coefficients change)
- heavy heteroskedasticity (e.g., volatility clustering)


## 4) How lag order \(p\) is chosen (intuition)

### Practical ways to pick \(p\)
- **Information criteria** (AIC/BIC): trade off fit vs complexity.
- **Validation**: choose \(p\) that minimizes forecast error on a holdout period.
- **(Partial) autocorrelation**: ACF/PACF heuristics can suggest a cutoff.
- **Domain knowledge**: if physics says effects last \~3 steps, start near \(p=3\).

### AIC/BIC (what they do)
If you assume Gaussian noise, the negative log-likelihood reduces to a function of the residual variance. Up to constants that do not affect the minimizer, you can use:
$$
\mathrm{AIC}(p) \propto 2k + n\log(\mathrm{RSS}/n)
$$
$$
\mathrm{BIC}(p) \propto k\log(n) + n\log(\mathrm{RSS}/n)
$$
where:
- \(\mathrm{RSS} = \sum_t \hat\varepsilon_t^2\)
- \(n\) is the number of fitted points (\(T-p\))
- \(k\) is the number of parameters (\(p+1\) if you include an intercept)

BIC penalizes complexity harder than AIC, so BIC tends to choose smaller \(p\).


In [None]:
def simulate_ar(
    phi: np.ndarray,
    *,
    c: float = 0.0,
    sigma: float = 1.0,
    n: int = 600,
    burn_in: int = 200,
    seed: int = 42,
) -> np.ndarray:
    """Simulate a univariate AR(p): y_t = c + sum_i phi_i y_{t-i} + eps_t."""
    phi = np.asarray(phi, dtype=float)
    p = int(phi.size)
    if p < 1:
        raise ValueError("phi must have length >= 1")
    if n <= 0:
        raise ValueError("n must be positive")
    if burn_in < 0:
        raise ValueError("burn_in must be >= 0")

    rng = np.random.default_rng(seed)
    eps = rng.normal(loc=0.0, scale=sigma, size=n + burn_in)
    y = np.zeros(n + burn_in, dtype=float)

    # Start at t=p so y_{t-i} exists.
    for t in range(p, n + burn_in):
        lags = y[t - 1 : t - p - 1 : -1]  # [y_{t-1}, ..., y_{t-p}]
        y[t] = c + float(phi @ lags) + eps[t]

    return y[burn_in:]


def make_lagged_matrix(y: np.ndarray, p: int, *, include_intercept: bool = True):
    """Build (X, y_target) for AR(p) regression.

    For t = p..T-1:
      y_target[t-p] = y[t]
      X[t-p] = [1, y[t-1], ..., y[t-p]]
    """
    y = np.asarray(y, dtype=float)
    p = int(p)
    if p < 1:
        raise ValueError("p must be >= 1")
    if y.ndim != 1:
        raise ValueError("y must be 1D")
    n = y.size
    if n <= p:
        raise ValueError(f"Need at least p+1 points; got n={n}, p={p}")

    # Columns are lag-1, lag-2, ..., lag-p.
    lag_cols = [y[p - i : n - i] for i in range(1, p + 1)]
    X = np.column_stack(lag_cols)
    if include_intercept:
        X = np.column_stack([np.ones(n - p, dtype=float), X])
    y_target = y[p:]
    return X, y_target


def fit_ar_ols(y: np.ndarray, p: int, *, include_intercept: bool = True):
    """Fit AR(p) by OLS and return a small result dict."""
    X, y_target = make_lagged_matrix(y, p, include_intercept=include_intercept)
    beta, *_ = np.linalg.lstsq(X, y_target, rcond=None)
    y_hat_target = X @ beta
    resid = y_target - y_hat_target
    rss = float(resid @ resid)
    n_eff = int(y_target.size)
    k = int(X.shape[1])
    sigma2 = rss / n_eff

    y_hat = np.full_like(np.asarray(y, dtype=float), np.nan)
    y_hat[p:] = y_hat_target

    return {
        "p": int(p),
        "include_intercept": bool(include_intercept),
        "beta": beta,
        "y_hat": y_hat,
        "resid": resid,
        "rss": rss,
        "sigma2": sigma2,
        "n_eff": n_eff,
        "k": k,
    }


def forecast_ar(beta: np.ndarray, y_history: np.ndarray, p: int, *, steps: int, include_intercept: bool = True):
    """Iterative multi-step forecast using the model's own predictions."""
    beta = np.asarray(beta, dtype=float)
    y_hist = list(np.asarray(y_history, dtype=float).tolist())
    p = int(p)
    if steps < 1:
        return np.array([], dtype=float)
    if len(y_hist) < p:
        raise ValueError(f"Need at least p history points; got {len(y_hist)}")

    out = []
    for _ in range(int(steps)):
        lags = np.array(y_hist[-1 : -p - 1 : -1], dtype=float)
        x = np.concatenate(([1.0], lags)) if include_intercept else lags
        y_next = float(x @ beta)
        y_hist.append(y_next)
        out.append(y_next)
    return np.array(out, dtype=float)


def aic_bic_from_rss(rss: float, n: int, k: int):
    """AIC/BIC up to additive constants (sufficient for comparing p)."""
    rss = float(rss)
    n = int(n)
    k = int(k)
    if n <= 0:
        raise ValueError("n must be positive")
    if rss <= 0:
        rss = 1e-12
    aic = 2 * k + n * np.log(rss / n)
    bic = k * np.log(n) + n * np.log(rss / n)
    return float(aic), float(bic)


def ar_stationary(phi: np.ndarray):
    """Check covariance-stationarity for AR(p) via characteristic roots."""
    phi = np.asarray(phi, dtype=float)
    coeffs = np.concatenate(([1.0], -phi))  # 1 - phi1 z - ... - phip z^p
    roots = np.roots(coeffs)
    return bool(np.all(np.abs(roots) > 1.0)), roots


def acf(x: np.ndarray, nlags: int = 30):
    """Autocorrelation function for lags 0..nlags (simple, biased estimator)."""
    x = np.asarray(x, dtype=float)
    x = x - np.mean(x)
    denom = float(x @ x)
    out = np.empty(int(nlags) + 1, dtype=float)
    out[0] = 1.0
    for k in range(1, int(nlags) + 1):
        out[k] = float(x[:-k] @ x[k:]) / denom
    return out


## 5) Synthetic example (so we know the truth)

We'll simulate a stationary AR(3) series, split it into train/test, and:
- pick \(p\) with AIC/BIC
- show how forecasts change as \(p\) changes
- examine residuals


In [None]:
# Ground-truth AR(3)
phi_true = np.array([0.65, -0.25, 0.15])
c_true = 0.2
sigma_true = 0.7

is_stat, roots = ar_stationary(phi_true)
print("Stationary (true process)?", is_stat)
print("Characteristic roots:", np.round(roots, 3))

y = simulate_ar(phi_true, c=c_true, sigma=sigma_true, n=700, burn_in=300, seed=7)
t = np.arange(y.size)

train_n = 520
y_train = y[:train_n]
y_test = y[train_n:]
t_test = t[train_n:]

fig = go.Figure()
fig.add_trace(go.Scatter(x=t, y=y, mode="lines", name="series"))
fig.add_vline(x=train_n, line_dash="dash", line_color="black")
fig.update_layout(title="Synthetic AR(3) series (train/test split)", xaxis_title="t", yaxis_title="y")
fig

## 6) Lag-order selection with AIC/BIC (NumPy)

We'll fit AR(p) for \(p=1..P_{max}\) on the training set and compute AIC/BIC (up to constants). A good \(p\) usually:
- makes residuals close to white noise
- avoids unnecessary complexity


In [None]:
P_MAX = 15

ps = np.arange(1, P_MAX + 1)
aics = []
bics = []
rss_list = []
mse_dyn = []

for p in ps:
    fit = fit_ar_ols(y_train, p, include_intercept=True)
    aic, bic = aic_bic_from_rss(fit["rss"], fit["n_eff"], fit["k"])
    aics.append(aic)
    bics.append(bic)
    rss_list.append(fit["rss"])

    # (Optional sanity metric) dynamic multi-step MSE on the test horizon
    y_fc = forecast_ar(fit["beta"], y_train, p, steps=y_test.size, include_intercept=True)
    mse_dyn.append(float(np.mean((y_test - y_fc) ** 2)))

aics = np.array(aics)
bics = np.array(bics)
mse_dyn = np.array(mse_dyn)

p_best_aic = int(ps[np.argmin(aics)])
p_best_bic = int(ps[np.argmin(bics)])
print("Best p by AIC:", p_best_aic)
print("Best p by BIC:", p_best_bic)

fig = make_subplots(rows=1, cols=2, subplot_titles=["AIC (lower is better)", "BIC (lower is better)"])
fig.add_trace(go.Scatter(x=ps, y=aics, mode="lines+markers", name="AIC"), row=1, col=1)
fig.add_vline(x=p_best_aic, line_dash="dash", line_color="#1f77b4", row=1, col=1)

fig.add_trace(go.Scatter(x=ps, y=bics, mode="lines+markers", name="BIC"), row=1, col=2)
fig.add_vline(x=p_best_bic, line_dash="dash", line_color="#1f77b4", row=1, col=2)

fig.update_xaxes(title_text="p", row=1, col=1)
fig.update_xaxes(title_text="p", row=1, col=2)
fig.update_layout(title="Lag-order selection on training data")
fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=ps, y=mse_dyn, mode="lines+markers", name="Dynamic forecast MSE"))
fig.update_layout(title="Forecast error vs lag order (dynamic multi-step)", xaxis_title="p", yaxis_title="MSE")
fig

## 7) Effect of different lag orders (forecast behavior)

To show the impact of \(p\), we'll fit several AR(p) models and compare their **multi-step forecasts** over the same test window. This is where under/over-specifying \(p\) often becomes visible.


In [None]:
P_SHOW = [1, 2, 3, 6, 12]

fig = go.Figure()
fig.add_trace(go.Scatter(x=t_test, y=y_test, mode="lines", name="actual (test)", line=dict(color="black")))

for p in P_SHOW:
    fit = fit_ar_ols(y_train, p, include_intercept=True)
    y_fc = forecast_ar(fit["beta"], y_train, p, steps=y_test.size, include_intercept=True)
    mse = float(np.mean((y_test - y_fc) ** 2))
    fig.add_trace(
        go.Scatter(
            x=t_test,
            y=y_fc,
            mode="lines",
            name=f"AR({p}) forecast (MSE={mse:.3f})",
        )
    )

fig.update_layout(
    title="Effect of lag order: multi-step forecasts on the same test window",
    xaxis_title="t",
    yaxis_title="y",
)
fig

## 8) Prediction vs actual (one-step ahead)

A **one-step-ahead** prediction uses the true lagged values at every time step (it does not feed predictions back in). This is often used for:
- measuring in-sample fit
- creating residuals for diagnostics / anomaly detection
- comparing models fairly before doing true multi-step forecasting


In [None]:
p_best = p_best_bic
fit_best = fit_ar_ols(y_train, p_best, include_intercept=True)

# One-step predictions on the full series using true lags.
X_full, y_target_full = make_lagged_matrix(y, p_best, include_intercept=True)
y_hat_target_full = X_full @ fit_best["beta"]
y_hat_full = np.full_like(y, np.nan)
y_hat_full[p_best:] = y_hat_target_full

fig = go.Figure()
fig.add_trace(go.Scatter(x=t, y=y, mode="lines", name="actual", line=dict(color="black")))
fig.add_trace(go.Scatter(x=t, y=y_hat_full, mode="lines", name=f"AR({p_best}) one-step prediction"))
fig.add_vline(x=train_n, line_dash="dash", line_color="black")
fig.update_layout(title="Prediction vs actual (one-step ahead)", xaxis_title="t", yaxis_title="y")
fig

## 9) Residual behavior (diagnostics)

After fitting AR(p), residuals should ideally behave like **white noise**:
- no obvious trend or seasonality left in residuals
- approximately constant variance
- residual ACF near 0 for lags > 0

These diagnostics help answer:
- Is \(p\) too small? (residuals still autocorrelated)
- Is the series non-stationary / missing trend or seasonality?
- Are there **incidents** (spikes) not explained by autoregression?


In [None]:
# Residuals from the training fit (one-step ahead on training window)
resid = fit_best["resid"]
fitted = y_train[p_best:] - resid

res_acf = acf(resid, nlags=30)
lags = np.arange(res_acf.size)

fig = make_subplots(
    rows=2,
    cols=2,
    subplot_titles=[
        "Residuals over time (train)",
        "Residual histogram (train)",
        "Residual ACF (train)",
        "Residuals vs fitted (train)",
    ],
)

fig.add_trace(
    go.Scatter(x=np.arange(resid.size), y=resid, mode="lines", name="residual"),
    row=1,
    col=1,
)
fig.add_hline(y=0, line_color="gray", line_width=1, row=1, col=1)

fig.add_trace(go.Histogram(x=resid, nbinsx=40, name="residuals"), row=1, col=2)

fig.add_trace(go.Bar(x=lags, y=res_acf, name="ACF"), row=2, col=1)
fig.add_hline(y=0, line_color="gray", line_width=1, row=2, col=1)

fig.add_trace(go.Scatter(x=fitted, y=resid, mode="markers", name="resid vs fitted", opacity=0.6), row=2, col=2)
fig.add_hline(y=0, line_color="gray", line_width=1, row=2, col=2)

fig.update_layout(title=f"Residual diagnostics for AR({p_best}) (fit on training)")
fig.update_xaxes(title_text="index", row=1, col=1)
fig.update_xaxes(title_text="residual", row=1, col=2)
fig.update_xaxes(title_text="lag", row=2, col=1)
fig.update_xaxes(title_text="fitted", row=2, col=2)
fig.update_yaxes(title_text="residual", row=1, col=1)
fig.update_yaxes(title_text="count", row=1, col=2)
fig.update_yaxes(title_text="ACF", row=2, col=1)
fig.update_yaxes(title_text="residual", row=2, col=2)
fig

## 10) Stationarity intuition (quick demo)

A stationary AR(1) with \(|\varphi_1| < 1\) tends to "forget" shocks (effects decay geometrically).

If \(|\varphi_1| \ge 1\), the process is **not** covariance-stationary (variance explodes or becomes undefined). This is why AR is commonly paired with **differencing** (ARIMA) when the data has a trend/unit root.


In [None]:
y_stat = simulate_ar(np.array([0.7]), c=0.0, sigma=1.0, n=250, burn_in=200, seed=0)
y_nonstat = simulate_ar(np.array([1.02]), c=0.0, sigma=1.0, n=250, burn_in=200, seed=0)

fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=["Stationary AR(1): φ=0.7", "Non-stationary AR(1): φ=1.02"])
fig.add_trace(go.Scatter(y=y_stat, mode="lines", name="φ=0.7"), row=1, col=1)
fig.add_trace(go.Scatter(y=y_nonstat, mode="lines", name="φ=1.02"), row=2, col=1)
fig.update_layout(title="Stationarity matters: |φ|<1 vs |φ|>1", xaxis_title="t", yaxis_title="y")
fig

## 11) Summary (what to remember)

- AR(p) is a **linear** model of \(y_t\) from its last \(p\) values.
- Fitting AR(p) can be done as a standard **least squares regression** on lagged features.
- **Stationarity** is crucial: for AR(p) this is a root condition on the characteristic polynomial.
- Choose \(p\) using **AIC/BIC**, validation, and especially **residual diagnostics**.
- In incident-heavy signals, AR is often used as a **baseline** and incidents show up as large residuals.

## Exercises
1. Simulate AR(2) with complex roots (oscillations). What do the series and ACF look like?
2. Fit AR(p) with and without an intercept. How does centering the series change the interpretation of \(c\)?
3. Try a trending series, apply first differencing, then refit AR(p). Compare diagnostics.
