# SRIMAX Model

## Content
* Elements
* Data Preprocessing
* Model Identification
* Model Estimation
* Model Verification
* Model Use

Import required tools

In [None]:
import time
import itertools
import joblib
import numpy as np
import scipy as sp
import pandas as pd
import statsmodels as sm
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Get required config

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

## Elements
$SARIMAX$ model is an extension of $SARIMA$ models. It incorporates the notion of exogenous variables.  

-----

### Exogenous Variables in Time Series Definition
**Endogenous Variables**  
These are included in the model as the main time series data. The model aims to explain or predict these values based on historical data and other included factors.
  
**Exogenous Variables**  
These are included in the model as additional inputs that might affect the endogenous variable. 
  
**Exogenous Variables Considerations**  
Exogenous variables not always help to buid a better model. The relevance of these must be evaluated. It's possible to evaluate this before and after trainig. 
* Exogenous variables relevance check before training:
    * ???
* Exogenous variables relevance check after ttrainig: 
    * ???  
  
It's important no note that is higly suggested to normalize the exogenous variables before taking any action with them. 

-----
### Sepecification
**SARIMAX(p,d,q)(P,D,Q)_E**   
**Seasonal Autoregressive Integrated Moving Averages Model with eXogenous Variables**  
Let $\{Z_t\}$ be a time series with seasonal period $E$ that describes a fenomenom that might be influenced by the variables $\{X_1\}$, $\{X_2\}$, ..., $\{X_n\}$. A seaosnal autorgegressive inegrated moving averages model with exogenoius variables is expresed as follows:
$$\phi(B)\Phi(B^{E})\nabla^d\nabla^D_E(Z_t) = \theta(B)\Theta(B^{E})a_t + \sum_{i = 1}^{n} X_i$$
Where:
* $\phi(B) = 1-\phi_1B - \phi_2B^{2} - ...- \phi_qB^{q}$ is a backshift polynomial of order $q$ that represents the  autoregressive part of the model. Thus, the current value of a time series is regressed on its own $q$ previous values. This is, $Z_{t}, Z_{t-1}, ..., Z_{t-q}$.
* $\theta(B) = 1-\theta_1B - \theta_2B^{2} - ...- \theta_pB^{p}$ is a backshift polynomial of order $p$ that represents the moving averages part of the  model. Thus, the current value of a time series can be expressed as a linear combination of $q$ past error terms. This is, the terms $a_{t}, a_{t-2}, ..., a_{t-p}$.
* $\Phi(B^E) = 1-\Phi_1B^E - \Phi_2B^{2E} - ...- \Phi_QB^{QE}$ is a seasonal backshift polynomial of order $Q$ that represents the seasonal autoregressive part of the model. Thus, the current value of a time series is regressed on its own $Q$ previous seasonal values. This is, $Z_{t-E}, Z_{t-2E}, ..., Z_{t-QE}$.
* $\Theta(B^E) = 1-\Theta_1B^E - \Theta_2B^{2E} - ...- \Theta_PB^{PE}$ is a seasonal backshift polynomial of order $P$ that represents the seasonal moving averages part of the  model. Thus, the current value of a time series can be expressed as a linear combination of $Q$ past seaosnal error terms. This is, the terms $a_{t-E}, a_{t-2E}, ..., a_{t-PE}$.
* $\nabla_{E}^{D}$ is an seaosnal difference operator of order $D$.
* $a_t$ is a white noise porcess.

The main process $\{Z_t\}$ is referred as de endogenous variable while the processes $X_1, X_2,...,X_n$ are referred as exogenous variables or covariates. The idea behind the SARIMAX model is that these last variables may improve the performance of the model.
 
-----
### Examples of Exogenous Variables
 Some examples of exogenous variables may include the following:
 * Macroeconomical variables
    * Interest Rates
    * Inflation Rates
    * Central Bank Reserves
    * GDP
    * Unemployment Rates
    * Exchange Rates
    * Consumer Confidence Index
    * Income Levels
    * Population Growth
    * Urbanization Trends
* Financial and Business Metrics
    * Stock Prices
    * Corporate Earnings
    * Operational Costs
* Market-Specific Factors
    * Price of Product
    * Product Availability
    * Competitor Pricing and Availability
    * Product Market Share
    * Marketing and Promotions
    * Consumer Trends and Preferences
    * Supply Chain Metrics
    * Supply and Demand Metrics
* External Events
    * Regulatory Changes
    * Geopolitical Events
    * Natural Disasters
* Commodity Prices
    * Energy Commodities
    * Metal Commodities
    * Agricultural Commodities
    * Livestock and Animal Products
    * Soft Commodities

Nevertheless, the must be adjusted to the usecase in hands. 
-----
### Forecastisting with Exogenous Variables
Let $\{Z_t\}$ be a time series with seasonal period $E$ that describes a fenomenom that might be influenced by the variables $\{X_1\}$, $\{X_2\}$, ..., $\{X_n\}$. Supose you've got $N$ data points for the  endogenous variable $Z_t$ and your forecast horizon is $K$ periods forward. Thus, in order to make such forecast, the exogenous variables $X_1,X_2,...,X_n$ must have $N+K$ data points. Therefore, the exogenous variables must be proyected $K$ periods forward. 

## Data Preprocessing

### Load data

In [None]:
# Load data
loading_path = r'Data_TS\df_inflation.csv'
df = pd.read_csv(loading_path)
df.head()

In [None]:
## Basic data manipulations
# Make sure dates are fine
df['DS'] = pd.to_datetime(df['DS'], format='mixed')
# Change column names
df = df.rename(columns = {'DS' : 'ds', 'INFLACION' : 'y'})
# Separate exogenous and endogenous variables
df_endog = df[['ds', 'y']]
df_exog = df.drop(columns = ['y'])


## For endogenous variable
# Get required dates
df_endog = df_endog[df_endog['ds'] < '2025-01-01']
df_endog_future = df_endog[df_endog['ds'] >= '2025-01-01'] # For final forecast evaluation if data exists
# Set endogenous data
endog = pd.Series(df_endog['y'].values, index = df_endog['ds'], name = 'y')
endog_future = pd.Series(df_endog_future['y'].values, index = df_endog_future['ds'])


## For exogenous variables
# Normalize data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
exog = pd.DataFrame(
    scaler.fit_transform(df_exog.drop(columns = ['ds'])),
    columns = ['MXN/USD', 'AG_MONET_M2', 'TASA_INT', 'REMESAS', 'SM_NOM_GENERAL', 'SM_NOM_ZLFN', 'PIB_PCONST_VAR_ANUAL_TRIM'],
    index = df_exog['ds']
)
# Get required dates
exog = exog[exog.index < '2025-01-01']
exog_future = exog[exog.index >= '2025-01-01'] # Proyections in order to forecast

In [None]:
# Plot endogenous variable
plt.figure(figsize=(12, 6))
sns.lineplot(data = endog, marker = 'o')
plt.title("Time Series Plot")
plt.xlabel("Date")
plt.ylabel("Value")
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Plot exogenous variables
plt.figure(figsize=(12, 6))
sns.lineplot(data = exog, marker = 'o')
plt.title("Time Series Plot")
plt.xlabel("Date")
plt.ylabel("Value")
plt.grid(True)
plt.tight_layout()
plt.show()

### Seasonality Determination

**Seasonality determination by visual inspection**  
The goal is to look for patterns or repetitive cycles that recur at regular intervals. Check if there are obvious seasonal fluctuations.

In [None]:
# Plot data
plt.figure(figsize=(18, 10))
sns.lineplot(data = endog, marker = 'o')
plt.title("Time Series Plot")
plt.xlabel("Date")
plt.ylabel("Value")
plt.grid(True)
plt.xticks(endog.index, rotation=45, fontsize = 8)
plt.tight_layout()
plt.show()

**Time Series Decomposition**  
The decomposition of time series assumes that the time series is composed of three main components: trend-cycle, which represents the long-term movement of the series; seasonality, which captures effects repeated annually with some consistency; and irregularity, which characterizes unpredictable and considered random movements. We'll focus in the seasonal part. 


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

## Decompose the time series
# 52 for weekly time series
# 12 for monthly time series
# 4 for quarterly time series
result = seasonal_decompose(endog, model='additive', period = 12) 

plt.figure(figsize=(14, 8), dpi=100)
plt.plot(result.seasonal)
plt.grid(True)
plt.xticks(endog.index, rotation=45, fontsize = 8)
plt.title('Seasonality')
plt.show()

**ACP**  
Use the ACF to detect seasonal periods by identifying regular patterns or peaks. 
* In the ACF plot, significant peaks at regular intervals indicate the presence of seasonality. For example, if your data is monthly and you see peaks every 12 lags, this suggests a seasonal period of 12 months.
* The pattern of peaks will repeat with the seasonal period. For instance, if the ACF has a peak at lag 12, 24, 36, etc., this implies a seasonal period of 12.

In [None]:
from statsmodels.graphics.tsaplots import plot_acf

# Plot ACF
plt.figure(figsize=(28, 16))  
plot_acf(endog, lags = endog.shape[0] - 1) 

# Customize x-axis ticks
plt.xticks(range(0, endog.shape[0], 12), rotation=45, fontsize=8)

plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.title('Autocorrelation Function (ACF)')
plt.grid(True)
plt.show()

### Relevance Analysis for Exogenous Variables (Before Training)

In [None]:
# Set data
df_present = pd.concat([endog, exog], axis = 1)
endog_var = endog.name
exog_vars = exog.columns
df_present.head()

**Correlation Analysis**  
This analysis covers the following:
* Pearson and Spearman correlation of each exogenous variable with endogenous variable in order to get linear or non-linear relationships, respectively. 
* Pearson and Spearman correlation of each exogeonus variable eith each other in order to avoid multicollinearity.

In [None]:
# Pearson correlation for each exogenous variable with endogenous variable
corr_pearson = pd.DataFrame(df_present.corr(method = 'pearson').iloc[0,1:])

# Plot
plt.figure(figsize=(10, 6))
sns.heatmap(corr_pearson, annot=True, cmap='coolwarm_r', fmt='.2f', square=True, cbar_kws={"shrink": .8})
plt.title('Pearson Correlations')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Spearman correlation for each exogenous variable with endogenous variable
corr_spearman = pd.DataFrame(df_present.corr(method = 'spearman').iloc[0,1:])

# Plot
plt.figure(figsize=(10, 6))
sns.heatmap(corr_spearman, annot=True, cmap='coolwarm_r', fmt='.2f', square=True, cbar_kws={"shrink": .8})
plt.title('Spearman Correlations')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Pearson correlation between exogenous variables
corr_exog_pearson  = pd.DataFrame(df_present[exog_vars].corr(method = 'pearson'))

# Plot
plt.figure(figsize=(10, 6))
custom_cmap = sns.diverging_palette(145, 10, s=85, l=45, as_cmap=True) 
sns.heatmap(corr_exog_pearson, annot=True, cmap = 'coolwarm_r', fmt='.2f', square=True, cbar_kws={"shrink": .8})
plt.title('Pearson Correlations between Exogenous Variables')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Spearman correlation between exogenous variables
corr_exog_spearman  = pd.DataFrame(df_present[exog_vars].corr(method = 'spearman'))

# Plot
plt.figure(figsize=(10, 6))
custom_cmap = sns.diverging_palette(145, 10, s=85, l=45, as_cmap=True) 
sns.heatmap(corr_exog_spearman, annot=True, cmap = 'coolwarm_r', fmt='.2f', square=True, cbar_kws={"shrink": .8})
plt.title('Spearman Correlations between Exogenous Variables')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

**Lagged Cross-Correlation**  
Consider time series $\{Y_t\}$ -endogenous variable- and $\{X_t\}$ -exogenous variable-. The lagged cross-correlation refers to calcule $corr(Y_t, X_{t-p})$ for $p\in\mathbb{N}$, typically using $p = 1,2,3$ in order to see how the lagged variable $X_{t-p}$ is related to $Y_t$. 

In [None]:
# Function to calculate lagged correlations
def past_lag_correlations(exog: pd.DataFrame, endog: pd.Series, exog_lags=[1, 2, 3]) -> pd.DataFrame:
    """
    Compute correlations between lagged exogenous variables and the endogenous variable.
    
    Parameters:
        exog : Exogenous variables, indexed by timestamp (pd.DataFrame)
        endog : Endogenous variable, indexed by timestamp (pd.Series)
        exog_lags : List of integer lags to compute (list)
        
    Returns:
        pd.DataFrame: Correlation matrix with lags as rows and exog variable names as columns.
    """
    results = pd.DataFrame(index=exog_lags, columns=exog.columns)

    for lag in exog_lags:
        # Lag all exogenous variables by `lag` steps into the past
        exog_lagged = exog.shift(lag)
        
        # Align lagged exog and endog on timestamps, drop missing data
        df = pd.concat([endog, exog_lagged], axis=1).dropna()
        
        # Compute correlation between endog and each lagged exog column
        for col in exog.columns:
            corr = df[endog.name].corr(df[col])
            results.at[lag, col] = corr

    return results.astype(float)

In [None]:
# Calculate lagged correlations
lagged_corrs = past_lag_correlations(exog = exog, endog = endog, exog_lags=[1, 2, 3])
lagged_corrs.head()

In [None]:
# Plot heatmaps
for lag in lagged_corrs.index:
    plt.figure(figsize=(8, 4))
    sns.heatmap(lagged_corrs.loc[[lag]], annot=True, cmap='coolwarm_r', fmt='.2f', cbar_kws={"shrink": .8})
    plt.title(f'Lagged Correlations (Lag = {lag})')
    plt.xticks(rotation=45)
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()

**Regression Analysis**  
Consider time series $\{Y_t\}$ -endogenous variable- and $\{X_t\}$ -exogenous variable-. The regression analysis refers to run a simple regression $Y_t = X_{t-1} + X_{t-2},...,X_{t-p}$ with $p\in\mathbb{N}$, typically using $p = 1,2,3$, in order to get the statistical significance of the lagged exogenous variables regarding the endogenous variable through $t$-tests. Any $p$-value such that $p < 0.05$ suggests statistical significance, thus a relevant variable. Any other such that $p \geq 0.05$ suggest non-significance, thus an irrelevant variable. 

In [None]:
# Function to run OLS regression with lagged exogenous variables
def ols_lagged_regression(endog: pd.Series, exog: pd.DataFrame, exog_lags=[1, 2, 3]):
    """
    Fit an OLS regression of endogenous variable on lagged exogenous variables.

    IN:
    -> exog : Exogenous variables, indexed by timestamp (pd.DataFrame)
    -> endog : Endogenous variable, indexed by timestamp (pd.Series)
    -> exog_lags : List of integer lags to compute (list)

    OUT:
    -> results (RegressionResultsWrapper): The fitted OLS model result
    """
    # Create lagged versions of exog variables
    X_lagged = pd.concat({
        f"{col}_lag{lag}": exog[col].shift(lag)
        for col in exog.columns
        for lag in exog_lags
    }, axis=1)
    print(X_lagged.head(20))

    # Combine and drop missing rows
    df_OLS = pd.concat([endog, X_lagged], axis=1).dropna()
    y = df_OLS[endog.name]
    X = df_OLS.drop(columns=[endog.name])

    # Add constant term
    X = sm.add_constant(X)

    # Fit OLS
    model = sm.OLS(y, X).fit()

    return model


Pre-training checks
* OLS regression
* Granger Causality Test
* Mutual Information
* Variance ans Stationarity Check   

Post-training checks
* Statistical significance
* Joint interactions amog variables
* True contribution to AIC/BIC 


In [None]:
Option A: Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression