# 📈 Time Series Analysis - ARIMA Model

## Overview
ARIMA (AutoRegressive Integrated Moving Average) is a powerful forecasting method for stationary time series.

### Components:
- **AR(p)**: AutoRegressive - uses past values to predict future
- **I(d)**: Integrated - differencing to achieve stationarity
- **MA(q)**: Moving Average - uses past forecast errors

### Model Notation: ARIMA(p, d, q)
- **p**: Number of lag observations (AR order)
- **d**: Degree of differencing (Integration order)
- **q**: Size of moving average window (MA order)

### Common ARIMA Models in Finance:
- ARIMA(1,0,0): Simple AR - mean reversion
- ARIMA(0,1,0): Random walk - stock prices
- ARIMA(1,1,1): Balanced model - general forecasting
- ARIMA(2,1,2): Complex series - economic indicators

---

In [None]:
# ═══════════════════════════════════════════════════════════════
# FICHE MÉMO ARIMA
# ═══════════════════════════════════════════════════════════════
# ARIMA = Modèle pour séries temporelles stationnaires
#
# STATIONNARITÉ
#   Stationnaire     : E[X] constante, Var constante, Cov stable
#   Non-stationnaire : E[X] variable, Var(t), Cov variable
#   Test : ACF décroît lentement → non-stationnaire → différencier
#
# COMPOSANTES
#   AR (p) : Le futur dépend du passé (autorégressif)
#   I  (d) : Différenciation pour rendre stationnaire
#   MA (q) : Correction avec erreurs passées
#
# TROUVER (p, d, q)
#   p → PACF (coupe après lag p)
#   d → Test ADF (p>0.05 → d=1, sinon d=0)
#   q → ACF (coupe après lag q)
#
# MODÈLES COURANTS EN FINANCE
#   ARIMA(1,0,0) → Returns stationnaires simples
#   ARIMA(0,1,0) → Random walk (prix)
#   ARIMA(1,1,1) → Modèle équilibré standard
#   ARIMA(2,1,2) → Séries complexes
# ═══════════════════════════════════════════════════════════════

In [None]:
# ═══════════════════════════════════════════════════════════════
# FICHE MÉMO ACF (AutoCorrelation Function)
# ═══════════════════════════════════════════════════════════════
# ACF = Corrélation entre Xt et Xt-k (totale, avec effets indirects)
# ρ(k) = Corr(Xt, Xt-k) ∈ [-1, 1]
#
# INTERPRÉTATION
#   ρ = +1 : Corrélation parfaite positive
#   ρ = 0  : Pas de corrélation
#   ρ = -1 : Corrélation parfaite négative
#
# PATTERNS ACF
#   Décroissance lente       → Non-stationnaire
#   Pic lag 1 puis chute     → AR(1)
#   Oscillations             → Saisonnalité
#   Bruit blanc              → Pas de structure
#
# UTILITÉ
#   ✓ Tester stationnarité
#   ✓ Identifier q (MA order)
#   ✓ Vérifier résidus du modèle
#   ✓ Détecter saisonnalité
# ═══════════════════════════════════════════════════════════════

In [None]:
# ═══════════════════════════════════════════════════════════════
# FICHE MÉMO PACF (Partial AutoCorrelation Function)
# ═══════════════════════════════════════════════════════════════
# PACF = Corrélation directe entre Xt et Xt-k (sans effets intermédiaires)
# Différence : ACF = totale / PACF = directe uniquement
#
# PATTERNS PACF
#   Coupure nette après lag p    → AR(p)
#   Décroissance exponentielle   → MA(q)
#   Oscillations amorties        → ARMA mixte
#   Tous lags non-significatifs  → Bruit blanc
#
# ACF vs PACF : IDENTIFIER (p, q)
#   AR(p)      : ACF décroît  / PACF coupe à p
#   MA(q)      : ACF coupe à q / PACF décroît
#   ARMA(p,q)  : Les deux décroissent
#
# RÈGLE PRATIQUE
#   p (AR) → Regarde où PACF coupe
#   q (MA) → Regarde où ACF coupe
#
# UTILITÉ
#   ✓ Identifier ordre p du modèle AR
#   ✓ Distinguer AR vs MA
#   ✓ Optimiser paramètres ARIMA
# ═══════════════════════════════════════════════════════════════

## 1. Setup & Data Loading

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Time series specific
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm

# Warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (14, 6)

In [None]:
# Load dataset - Airline Passengers (classic time series)
airline = sm.datasets.get_rdataset('AirPassengers', 'datasets').data
airline['time'] = pd.to_datetime(airline['time'])
airline = airline.set_index('time')
airline = airline.asfreq('MS')

# Extract series
series = airline['value']

print("Dataset loaded:")
print(f"Period: {series.index.min()} to {series.index.max()}")
print(f"Length: {len(series)} observations")
display(series.head())

## 2. Check for Stationarity (Critical Step)

In [None]:
# Augmented Dickey-Fuller Test for stationarity
# H0: Series is non-stationary (has unit root)
# H1: Series is stationary
# If p-value < 0.05 → Reject H0 → Series is stationary

def adf_test(series, name=''):
    """Perform Augmented Dickey-Fuller test"""
    result = adfuller(series.dropna())
    print(f'\nADF Test Results - {name}')
    print('='*50)
    print(f'ADF Statistic: {result[0]:.4f}')
    print(f'p-value: {result[1]:.4f}')
    print(f'Critical Values:')
    for key, value in result[4].items():
        print(f'  {key}: {value:.3f}')
    
    if result[1] <= 0.05:
        print(f"\n✓ Stationary (p-value = {result[1]:.4f} < 0.05)")
        print("  → Can use ARIMA with d=0")
    else:
        print(f"\n✗ Non-stationary (p-value = {result[1]:.4f} > 0.05)")
        print("  → Need differencing (d=1 or d=2)")
    return result[1]

# Test original series
p_value_original = adf_test(series, 'Original Series')

## 3. Make Series Stationary (Differencing)

In [None]:
# Apply differencing to remove trend
# First difference: Y'_t = Y_t - Y_{t-1}
series_diff = series.diff().dropna()

# Plot original vs differenced
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

# Original
series.plot(ax=axes[0], color='blue', linewidth=2)
axes[0].set_title('Original Series (Non-Stationary)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Passengers', fontsize=11)
axes[0].grid(True, alpha=0.3)

# Differenced
series_diff.plot(ax=axes[1], color='red', linewidth=2)
axes[1].set_title('First Difference (d=1)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Change in Passengers', fontsize=11)
axes[1].set_xlabel('Date', fontsize=11)
axes[1].axhline(y=0, color='black', linestyle='--', linewidth=0.8)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Test differenced series
p_value_diff = adf_test(series_diff, 'First Differenced Series')

## 4. Identify p and q using ACF/PACF

In [None]:
# Plot ACF and PACF to determine ARIMA orders
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# ACF plot - identifies q (MA order)
plot_acf(series_diff, lags=40, ax=axes[0])
axes[0].set_title('ACF - AutoCorrelation Function', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Lag', fontsize=11)
axes[0].set_ylabel('ACF', fontsize=11)

# PACF plot - identifies p (AR order)
plot_pacf(series_diff, lags=40, ax=axes[1])
axes[1].set_title('PACF - Partial AutoCorrelation Function', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Lag', fontsize=11)
axes[1].set_ylabel('PACF', fontsize=11)

plt.tight_layout()
plt.show()

print("\nHow to read ACF/PACF:")
print("="*50)
print("\n1. Look where ACF cuts off → This is q (MA order)")
print("2. Look where PACF cuts off → This is p (AR order)")
print("3. Blue shaded area = 95% confidence interval")
print("4. Spikes outside blue area = significant correlations")
print("\nCommon patterns:")
print("- AR(p): PACF cuts at lag p, ACF decays gradually")
print("- MA(q): ACF cuts at lag q, PACF decays gradually")
print("- ARMA(p,q): Both decay gradually")

## 5. Fit ARIMA Model

In [None]:
# Based on ACF/PACF analysis, let's try ARIMA(1,1,1)
# p=1: One AR term (PACF shows significance at lag 1)
# d=1: First differencing (we already confirmed series needs it)
# q=1: One MA term (ACF shows significance at lag 1)

# Fit model
model = ARIMA(series, order=(1, 1, 1))
model_fit = model.fit()

# Display model summary
print(model_fit.summary())

# Key metrics to check:
print("\nKey Model Diagnostics:")
print("="*50)
print(f"AIC (Akaike Information Criterion): {model_fit.aic:.2f}")
print(f"BIC (Bayesian Information Criterion): {model_fit.bic:.2f}")
print("  → Lower AIC/BIC = better model")
print("\nCoefficients:")
print(model_fit.params)

## 6. Model Diagnostics (Residual Analysis)

In [None]:
# Check if residuals are white noise (good model)
# Good residuals should:
# 1. Have mean ≈ 0
# 2. Have constant variance
# 3. Be normally distributed
# 4. Show no autocorrelation

residuals = model_fit.resid

fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# Residuals over time
residuals.plot(ax=axes[0, 0], color='red', alpha=0.7)
axes[0, 0].set_title('Residuals Over Time', fontsize=12, fontweight='bold')
axes[0, 0].axhline(y=0, color='black', linestyle='--')
axes[0, 0].grid(True, alpha=0.3)

# Residuals histogram
residuals.hist(bins=30, ax=axes[0, 1], edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Residuals Distribution', fontsize=12, fontweight='bold')
axes[0, 1].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[0, 1].set_xlabel('Residual Value')
axes[0, 1].grid(True, alpha=0.3)

# ACF of residuals
plot_acf(residuals.dropna(), lags=40, ax=axes[1, 0])
axes[1, 0].set_title('ACF of Residuals', fontsize=12, fontweight='bold')

# Q-Q plot (normality test)
from scipy import stats
stats.probplot(residuals.dropna(), dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Normality Test)', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nResidual Statistics:")
print("="*50)
print(f"Mean: {residuals.mean():.4f} (should be ≈ 0)")
print(f"Std Dev: {residuals.std():.4f}")
print(f"Skewness: {residuals.skew():.4f} (should be ≈ 0)")
print(f"Kurtosis: {residuals.kurtosis():.4f} (should be ≈ 0)")
print("\n→ Good residuals = random noise with no patterns")

## 7. Forecasting

In [None]:
# Make predictions
# In-sample: Fitted values for existing data
# Out-of-sample: Forecast future values

# Forecast next 24 months
n_periods = 24
forecast_object = model_fit.get_forecast(steps=n_periods)
forecast_mean = forecast_object.predicted_mean
forecast_ci = forecast_object.conf_int(alpha=0.05)  # 95% confidence interval

# Create forecast index
forecast_index = pd.date_range(
    start=series.index[-1] + pd.DateOffset(months=1),
    periods=n_periods,
    freq='MS'
)

# Plot forecast
fig, ax = plt.subplots(figsize=(14, 7))

# Historical data
series.plot(ax=ax, label='Historical', linewidth=2, color='blue')

# Forecast
forecast_series = pd.Series(forecast_mean.values, index=forecast_index)
forecast_series.plot(ax=ax, label='Forecast', linewidth=2, color='red', linestyle='--')

# Confidence interval
ax.fill_between(
    forecast_index,
    forecast_ci.iloc[:, 0],
    forecast_ci.iloc[:, 1],
    color='pink',
    alpha=0.3,
    label='95% Confidence Interval'
)

ax.set_title('ARIMA(1,1,1) Forecast - Next 24 Months', fontsize=16, fontweight='bold')
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Passengers (thousands)', fontsize=12)
ax.legend(loc='best', fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Display forecast values
print("\nForecast Summary (First 6 months):")
print("="*60)
forecast_df = pd.DataFrame({
    'Forecast': forecast_mean[:6],
    'Lower CI': forecast_ci.iloc[:6, 0],
    'Upper CI': forecast_ci.iloc[:6, 1]
}, index=forecast_index[:6])
display(forecast_df)

## 8. Model Comparison (Grid Search)

In [None]:
# Try multiple ARIMA combinations to find best model
# Use AIC/BIC to compare models (lower is better)

print("Testing different ARIMA models...\n")
print("="*70)
print(f"{'Model':<15} {'AIC':>12} {'BIC':>12} {'Status'}")
print("="*70)

best_aic = np.inf
best_order = None
results_dict = {}

# Test different combinations
p_values = range(0, 3)
d_values = range(0, 2)
q_values = range(0, 3)

for p in p_values:
    for d in d_values:
        for q in q_values:
            try:
                model = ARIMA(series, order=(p, d, q))
                fitted = model.fit()
                
                aic = fitted.aic
                bic = fitted.bic
                results_dict[(p,d,q)] = {'AIC': aic, 'BIC': bic}
                
                status = "✓" if aic < best_aic else ""
                if aic < best_aic:
                    best_aic = aic
                    best_order = (p, d, q)
                
                print(f"ARIMA({p},{d},{q})      {aic:12.2f} {bic:12.2f}   {status}")
                
            except:
                print(f"ARIMA({p},{d},{q})      Failed to converge")

print("="*70)
print(f"\n✓ Best model: ARIMA{best_order}")
print(f"  AIC: {best_aic:.2f}")

## 9. Key Takeaways & Best Practices

### ARIMA Workflow:
1. **Check stationarity** (ADF test)
2. **Make stationary** (differencing if needed)
3. **Identify p, q** (ACF/PACF plots)
4. **Fit model** (ARIMA)
5. **Diagnose residuals** (should be white noise)
6. **Forecast** (with confidence intervals)

### Choosing ARIMA Orders:

**d (Differencing)**:
- d=0: Series already stationary
- d=1: Linear trend removed (most common)
- d=2: Quadratic trend (rare)

**p (AR order)**:
- Look at PACF cutoff
- Start with p=1 or p=2
- Higher p = more complex memory

**q (MA order)**:
- Look at ACF cutoff
- Start with q=1 or q=2
- Higher q = more error correction

### Model Selection:
- **AIC**: Rewards fit, penalizes complexity
- **BIC**: Stronger penalty for complexity
- Lower = better
- Prefer simpler models if AIC/BIC similar

### Common Issues:
1. **Non-stationary residuals**: Need more differencing or different model
2. **Seasonality**: Use SARIMA instead
3. **Outliers**: Pre-process or use robust methods
4. **Structural breaks**: Split series or use regime-switching models

### Extensions:
- **SARIMA**: Seasonal ARIMA for seasonal data
- **ARIMAX**: ARIMA with external regressors
- **GARCH**: For volatility modeling (finance)
- **Prophet**: Facebook's time series tool (easier to use)

---

### Applications in Finance:
✅ Stock price forecasting (with caution - often random walk)
✅ Economic indicators (GDP, inflation, unemployment)
✅ Interest rate modeling
✅ Demand forecasting (inventory management)
✅ Risk metrics (VaR components)

**Important**: ARIMA assumes linear relationships. For complex, non-linear series, consider ML methods (LSTM, XGBoost).

---
*End of Time Series Analysis series. Next: Advanced topics (GARCH, VAR, Cointegration)*