<a href="https://colab.research.google.com/github/zuzka05/stat_learn/blob/main/endogenous_vs_exogenous.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering: Endogenous vs Exogenous

A few on Discord have asked me to cover feature engineering. There was recently a great question by dkbdu on Discord and thought I would create a notebook going into more detail about endogenous vs exogenous features and how you can create features from this knowledge.

In this notebook, you will learn:

1. What are endogenous features?
2. What are exogenous features?
3. Creating features and models using this knowledge.  

In [1]:
import numpy as np
import pandas as pd
import polars as pl

### Load BTCUSDT

In [3]:
btcusdt_ts = pl.read_csv('BTCUSDT-1h-ohlc.csv', try_parse_dates=True)
btcusdt_ts

open_time,open,high,low,close,volume,close_time,quote_volume,count,taker_buy_volume,taker_buy_quote_volume,ignore
datetime[μs],f64,f64,f64,f64,f64,datetime[μs],f64,i64,f64,f64,i64
2024-01-01 00:00:00,42314.0,42603.2,42289.6,42503.5,8459.477,2024-01-01 00:59:59.999,3.5920e8,88278,4687.976,1.9903e8,0
2024-01-01 01:00:00,42503.5,42832.0,42462.0,42647.9,9043.411,2024-01-01 01:59:59.999,3.8597e8,90351,4783.838,2.0418e8,0
2024-01-01 02:00:00,42647.9,42676.9,42530.0,42620.4,4653.067,2024-01-01 02:59:59.999,1.9822e8,52550,2141.259,9.1221e7,0
2024-01-01 03:00:00,42620.5,42630.0,42270.0,42369.8,8119.88,2024-01-01 03:59:59.999,3.4442e8,84345,3501.132,1.4847e8,0
2024-01-01 04:00:00,42369.8,42439.8,42235.2,42436.6,6356.536,2024-01-01 04:59:59.999,2.6915e8,68026,3155.432,1.3363e8,0
…,…,…,…,…,…,…,…,…,…,…,…
2025-12-20 19:00:00,88277.1,88298.0,88157.0,88196.3,684.85,2025-12-20 19:59:59.999,6.0420e7,23228,315.869,2.7867e7,0
2025-12-20 20:00:00,88196.4,88285.2,88175.3,88250.4,514.301,2025-12-20 20:59:59.999,4.5370e7,17316,272.216,2.4014e7,0
2025-12-20 21:00:00,88250.3,88434.9,88140.3,88179.6,1055.058,2025-12-20 21:59:59.999,9.3132e7,29069,528.547,4.6660e7,0
2025-12-20 22:00:00,88179.6,88306.1,88111.1,88244.1,677.809,2025-12-20 22:59:59.999,5.9783e7,23504,353.874,3.1212e7,0


### Load ETHUSDT

In [4]:
ethusdt_ts = pl.read_csv('ETHUSDT-1h-ohlc.csv', try_parse_dates=True)
ethusdt_ts

open_time,open,high,low,close,volume,close_time,quote_volume,count,taker_buy_volume,taker_buy_quote_volume,ignore
datetime[μs],f64,f64,f64,f64,f64,datetime[μs],f64,i64,f64,f64,i64
2024-01-01 00:00:00,2283.84,2299.16,2282.97,2297.41,75593.183,2024-01-01 00:59:59.999,1.7340e8,75195,40291.257,9.2415e7,0
2024-01-01 01:00:00,2297.41,2308.74,2294.77,2305.71,58591.895,2024-01-01 01:59:59.999,1.3494e8,60112,31397.246,7.2306e7,0
2024-01-01 02:00:00,2305.72,2306.95,2293.15,2295.19,44611.328,2024-01-01 02:59:59.999,1.0258e8,45679,19798.81,4.5524e7,0
2024-01-01 03:00:00,2295.19,2296.83,2272.04,2275.59,88559.028,2024-01-01 03:59:59.999,2.0211e8,87063,40900.389,9.3320e7,0
2024-01-01 04:00:00,2275.59,2281.71,2266.67,2281.45,80010.451,2024-01-01 04:59:59.999,1.8206e8,82831,40686.056,9.2595e7,0
…,…,…,…,…,…,…,…,…,…,…,…
2025-12-20 19:00:00,2976.91,2979.6,2973.33,2976.57,13658.304,2025-12-20 19:59:59.999,4.0662e7,37741,6510.24,1.9381e7,0
2025-12-20 20:00:00,2976.57,2984.75,2976.0,2983.07,17926.407,2025-12-20 20:59:59.999,5.3429e7,38100,10565.718,3.1493e7,0
2025-12-20 21:00:00,2983.06,2984.62,2977.18,2977.5,14494.642,2025-12-20 21:59:59.999,4.3216e7,39439,5998.943,1.7887e7,0
2025-12-20 22:00:00,2977.51,2979.13,2973.07,2976.03,18548.207,2025-12-20 22:59:59.999,5.5189e7,38181,8989.708,2.6749e7,0


### Join ETH and BTC together

In [5]:
btcusdt_renamed = btcusdt_ts.rename({c: f"btcusdt_{c}" for c in btcusdt_ts.columns})
ethusdt_renamed = ethusdt_ts.rename({c: f"ethusdt_{c}" for c in ethusdt_ts.columns})
ts = pl.concat([btcusdt_renamed, ethusdt_renamed], how="horizontal")
ts

btcusdt_open_time,btcusdt_open,btcusdt_high,btcusdt_low,btcusdt_close,btcusdt_volume,btcusdt_close_time,btcusdt_quote_volume,btcusdt_count,btcusdt_taker_buy_volume,btcusdt_taker_buy_quote_volume,btcusdt_ignore,ethusdt_open_time,ethusdt_open,ethusdt_high,ethusdt_low,ethusdt_close,ethusdt_volume,ethusdt_close_time,ethusdt_quote_volume,ethusdt_count,ethusdt_taker_buy_volume,ethusdt_taker_buy_quote_volume,ethusdt_ignore
datetime[μs],f64,f64,f64,f64,f64,datetime[μs],f64,i64,f64,f64,i64,datetime[μs],f64,f64,f64,f64,f64,datetime[μs],f64,i64,f64,f64,i64
2024-01-01 00:00:00,42314.0,42603.2,42289.6,42503.5,8459.477,2024-01-01 00:59:59.999,3.5920e8,88278,4687.976,1.9903e8,0,2024-01-01 00:00:00,2283.84,2299.16,2282.97,2297.41,75593.183,2024-01-01 00:59:59.999,1.7340e8,75195,40291.257,9.2415e7,0
2024-01-01 01:00:00,42503.5,42832.0,42462.0,42647.9,9043.411,2024-01-01 01:59:59.999,3.8597e8,90351,4783.838,2.0418e8,0,2024-01-01 01:00:00,2297.41,2308.74,2294.77,2305.71,58591.895,2024-01-01 01:59:59.999,1.3494e8,60112,31397.246,7.2306e7,0
2024-01-01 02:00:00,42647.9,42676.9,42530.0,42620.4,4653.067,2024-01-01 02:59:59.999,1.9822e8,52550,2141.259,9.1221e7,0,2024-01-01 02:00:00,2305.72,2306.95,2293.15,2295.19,44611.328,2024-01-01 02:59:59.999,1.0258e8,45679,19798.81,4.5524e7,0
2024-01-01 03:00:00,42620.5,42630.0,42270.0,42369.8,8119.88,2024-01-01 03:59:59.999,3.4442e8,84345,3501.132,1.4847e8,0,2024-01-01 03:00:00,2295.19,2296.83,2272.04,2275.59,88559.028,2024-01-01 03:59:59.999,2.0211e8,87063,40900.389,9.3320e7,0
2024-01-01 04:00:00,42369.8,42439.8,42235.2,42436.6,6356.536,2024-01-01 04:59:59.999,2.6915e8,68026,3155.432,1.3363e8,0,2024-01-01 04:00:00,2275.59,2281.71,2266.67,2281.45,80010.451,2024-01-01 04:59:59.999,1.8206e8,82831,40686.056,9.2595e7,0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2025-12-20 19:00:00,88277.1,88298.0,88157.0,88196.3,684.85,2025-12-20 19:59:59.999,6.0420e7,23228,315.869,2.7867e7,0,2025-12-20 19:00:00,2976.91,2979.6,2973.33,2976.57,13658.304,2025-12-20 19:59:59.999,4.0662e7,37741,6510.24,1.9381e7,0
2025-12-20 20:00:00,88196.4,88285.2,88175.3,88250.4,514.301,2025-12-20 20:59:59.999,4.5370e7,17316,272.216,2.4014e7,0,2025-12-20 20:00:00,2976.57,2984.75,2976.0,2983.07,17926.407,2025-12-20 20:59:59.999,5.3429e7,38100,10565.718,3.1493e7,0
2025-12-20 21:00:00,88250.3,88434.9,88140.3,88179.6,1055.058,2025-12-20 21:59:59.999,9.3132e7,29069,528.547,4.6660e7,0,2025-12-20 21:00:00,2983.06,2984.62,2977.18,2977.5,14494.642,2025-12-20 21:59:59.999,4.3216e7,39439,5998.943,1.7887e7,0
2025-12-20 22:00:00,88179.6,88306.1,88111.1,88244.1,677.809,2025-12-20 22:59:59.999,5.9783e7,23504,353.874,3.1212e7,0,2025-12-20 22:00:00,2977.51,2979.13,2973.07,2976.03,18548.207,2025-12-20 22:59:59.999,5.5189e7,38181,8989.708,2.6749e7,0


In [6]:
# the no of steps that represent our forecast horizon
forecast_horizon = 1

ts = ts.with_columns(
    (pl.col('btcusdt_close')/pl.col('btcusdt_close').shift()).log().alias('btcusdt_close_log_return'),
    (pl.col('ethusdt_close')/pl.col('ethusdt_close').shift()).log().alias('ethusdt_close_log_return'),
).drop_nulls()

ts.select('btcusdt_close_log_return','ethusdt_close_log_return')

btcusdt_close_log_return,ethusdt_close_log_return
f64,f64
0.003392,0.003606
-0.000645,-0.004573
-0.005897,-0.008576
0.001575,0.002572
-0.003872,-0.001531
…,…
-0.000916,-0.000111
0.000613,0.002181
-0.000803,-0.001869
0.000731,-0.000494


### Auto-Regression

The main principle of applying machine learning for trading is auto-regression. Auto-regression is predicting the future value from the past values. It's a very simple concept.

In [7]:
ts = ts.with_columns(
    pl.col('btcusdt_close_log_return').shift().alias('btcusdt_close_log_return_lag_1'),
    pl.col('ethusdt_close_log_return').shift().alias('ethusdt_close_log_return_lag_1'),
).drop_nulls()
ar_cols = ['btcusdt_close_log_return', 'btcusdt_close_log_return_lag_1', 'ethusdt_close_log_return','ethusdt_close_log_return_lag_1']
ts.select(ar_cols)

btcusdt_close_log_return,btcusdt_close_log_return_lag_1,ethusdt_close_log_return,ethusdt_close_log_return_lag_1
f64,f64,f64,f64
-0.000645,0.003392,-0.004573,0.003606
-0.005897,-0.000645,-0.008576,-0.004573
0.001575,-0.005897,0.002572,-0.008576
-0.003872,0.001575,-0.001531,0.002572
0.003559,-0.003872,0.002403,-0.001531
…,…,…,…
-0.000916,0.001503,-0.000111,0.000138
0.000613,-0.000916,0.002181,-0.000111
-0.000803,0.000613,-0.001869,0.002181
0.000731,-0.000803,-0.000494,-0.001869


### Look at the auto-correlation

We can see that btcusdt lag as a negative correlation so this means that there is a tiny amount of mean reversion dynamics.

In [8]:
ts[['btcusdt_close_log_return','btcusdt_close_log_return_lag_1']].corr()

btcusdt_close_log_return,btcusdt_close_log_return_lag_1
f64,f64
1.0,-0.012863
-0.012863,1.0


For eth, we can see there's a tiny postive correlation so there seems to be a very small degree of momentum trading dynamics.

In [9]:
ts[['ethusdt_close_log_return','ethusdt_close_log_return_lag_1']].corr()

ethusdt_close_log_return,ethusdt_close_log_return_lag_1
f64,f64
1.0,0.004275
0.004275,1.0


## What are Endogenous Variables?

In the context of auto-regression, endogenous variables are the variables being predicted or modeled within the system itself, whose values are determined by the model based on their own past values in the system. They are the "dependent" variables that the auto-regressive model attempts to forecast using lagged versions of themselves, as opposed to exogenous variables which are determined outside the model.

If we are predicting the future log return of BTCUSDT then the endogenous variables are its lags.

### Auto-Regressive Model with Endogenous Variables

$$
r_{t}^{\text{BTC}} = \alpha + \sum_{i=1}^{p} \beta_i r_{t-i}^{\text{BTC}}
$$

Where:
- $r_{t}^{\text{BTC}}$ is the BTCUSDT log return at time $t$ (endogenous variable)
- $r_{t-i}^{\text{BTC}}$ are the lagged BTCUSDT log returns (endogenous, auto-regressive terms)
- $\alpha$ is the intercept
- $\beta_i$ are the auto-regressive coefficients


In more concrete terms, if we are predicting the BTCUSDT log return from its previous values, then it's an auto-regressive model based on endogenous variables.


In [10]:
weight, bias = -0.0001, -0.0000001
ts.select('btcusdt_close_log_return', (weight * pl.col('btcusdt_close_log_return_lag_1') + bias).alias('y_hat'))

btcusdt_close_log_return,y_hat
f64,f64
-0.000645,-4.3916e-7
-0.005897,-3.5498e-8
0.001575,4.8972e-7
-0.003872,-2.5754e-7
0.003559,2.8721e-7
…,…
-0.000916,-2.5032e-7
0.000613,-8.4281e-9
-0.000803,-1.6132e-7
0.000731,-1.9742e-8


In [None]:
"""
You've manually set weights close to zero to show what happens when a model learns nothing. This would give you an R² close to -1 (as bad as possible), since you're predicting near-zero while actual returns vary significantly.
The lesson: This illustrates why your neural network models struggled - if the model doesn't learn appropriate coefficient magnitudes (around -0.09 to -0.14 like PyTorch found), it will have zero predictive power. The TensorFlow model with +0.64 overshot, while this -0.0001 model is basically sleeping on the job.
"""

Here, we are modelling auto-regressive mean reversion behaviour as the weight (coefficient) is negative.

## What are exogenous variables?

In a model predicting future BTCUSDT log returns using previous lags of BTCUSDT itself, the BTCUSDT log return is the endogenous variable because it's what the model is forecasting based on its own history.

The previous lag of ETHUSDT log return would be an exogenous variable because it comes from outside the BTCUSDT system - it's an external input that influences BTCUSDT but isn't being predicted by this particular auto-regressive model.

### Auto-Regressive Model with Exogenous Variable

$$
r_{t}^{\text{BTC}} = \alpha + \sum_{i=1}^{p} \beta_i r_{t-i}^{\text{BTC}} + \gamma r_{t-1}^{\text{ETH}} + \epsilon_t
$$

Where:
- $r_{t}^{\text{BTC}}$ is the BTCUSDT log return at time $t$ (endogenous variable)
- $r_{t-i}^{\text{BTC}}$ are the lagged BTCUSDT log returns (endogenous, auto-regressive terms)
- $r_{t-1}^{\text{ETH}}$ is the lagged ETHUSDT log return (exogenous variable)
- $\alpha$ is the intercept
- $\beta_i$ are the auto-regressive coefficients
- $\gamma$ is the coefficient for the exogenous variable

In more concrete terms, this is how you model this in code:

In [11]:
w1, w2 =-0.0002, 0.0001
bias = -0.0000001

btc_lag = pl.col('btcusdt_close_log_return_lag_1')
eth_lag = pl.col('ethusdt_close_log_return_lag_1')
y_hat = w1 * btc_lag + w2 * eth_lag + bias

ts.select('btcusdt_close_log_return', y_hat.alias('y_hat'))

btcusdt_close_log_return,y_hat
f64,f64
-0.000645,-4.1770e-7
-0.005897,-4.2830e-7
0.001575,2.2181e-7
-0.003872,-1.5789e-7
0.003559,5.2133e-7
…,…
-0.000916,-3.8687e-7
0.000613,7.2058e-8
-0.000803,-4.5092e-9
0.000731,-1.2638e-7


In [None]:
"""
You're predicting BTC returns using both its own lag (autoregressive) AND ETH's lag (exogenous)
This captures two effects: BTC mean reversion + BTC-ETH relationship
The model assumes ETH movements help predict future BTC movements

Coefficient Interpretation
w1 = -0.0002 (BTC lag): Weak mean reversion in BTC
w2 = +0.0001 (ETH lag): Positive relationship - when ETH had positive returns, BTC tends to follow

Mean reversion: BTC reversing after large moves (w1 < 0)
Cross-asset momentum: ETH leading BTC (w2 > 0 makes sense - high correlation between crypto assets)
In reality, BTC-ETH correlation is ~0.8-0.9, so w2 should be significantly positive

BTC and ETH are highly correlated (move together)
ETH sometimes leads/lags BTC due to different liquidity, trader bases
If ETH had a big move yesterday, it might predict BTC's move today
"""

## Auto-Regressive Model with Relative Value Constraint

$$
r_{t}^{\text{BTC}} = \alpha + \sum_{i=1}^{p} \beta_i r_{t-i}^{\text{BTC}} - \gamma r_{t-1}^{\text{ETH}}
$$

Where:
- $r_{t}^{\text{BTC}}$ is the BTCUSDT log return at time $t$ (endogenous variable)
- $r_{t-i}^{\text{BTC}}$ are the lagged BTCUSDT log returns (endogenous, auto-regressive terms)
- $r_{t-1}^{\text{ETH}}$ is the lagged ETHUSDT log return (exogenous variable)
- $\alpha$ is the intercept
- $\beta_i$ are the auto-regressive coefficients
- $\gamma \geq 0$ is constrained to be non-negative (enforcing negative relationship via the minus sign)

This models the relative value dynamics where the expectation of BTCUSDT future log return is greater when the previous ETHUSDT log return is lower (i.e., when ETH underperforms, BTC is expected to outperform).

In [12]:
w1, w2 =-0.0002, 0.0001
bias = -0.0000001

btc_lag = pl.col('btcusdt_close_log_return_lag_1')
eth_lag = pl.col('ethusdt_close_log_return_lag_1')
y_hat = w1 * btc_lag - w2 * eth_lag + bias

ts.select('btcusdt_close_log_return', y_hat.alias('y_hat'))

btcusdt_close_log_return,y_hat
f64,f64
-0.000645,-0.000001
-0.005897,4.8631e-7
0.001575,0.000002
-0.003872,-6.7226e-7
0.003559,8.2751e-7
…,…
-0.000916,-4.1442e-7
0.000613,9.4230e-8
-0.000803,-4.4078e-7
0.000731,2.4741e-7


## Auto-Regressive Spread Model

$$
r_{t}^{\text{BTC}} = \alpha + \sum_{i=1}^{p} \beta_i \left(r_{t-i}^{\text{BTC}} - r_{t-i}^{\text{ETH}}\right) + \epsilon_t
$$

Where:
- $r_{t}^{\text{BTC}}$ is the BTCUSDT log return at time $t$ (endogenous variable)
- $\left(r_{t-i}^{\text{BTC}} - r_{t-i}^{\text{ETH}}\right)$ is the lagged return spread at lag $i$ (single feature per lag)
- $\alpha$ is the intercept
- $\beta_i$ are the spread coefficients for each lag
- $\epsilon_t$ is the error term

This models the spread dynamics. The expectation of the future BTCUSDT log return depends on the historical spread values. The sign and magnitude of $\beta_i$ determine whether the model exhibits mean reversion ($\beta_i < 0$) or momentum ($\beta_i > 0$) behavior in the relative value between BTC and ETH.

In [13]:
w1 =-0.0001
bias = -0.0000001

btc_lag = pl.col('btcusdt_close_log_return_lag_1')
eth_lag = pl.col('ethusdt_close_log_return_lag_1')
spread_lag= btc_lag - eth_lag
y_hat = w1 * spread_lag + bias

ts.select('btcusdt_close_log_return', y_hat.alias('y_hat'))

btcusdt_close_log_return,y_hat
f64,f64
-0.000645,-7.8536e-8
-0.005897,-4.9280e-7
0.001575,-3.6791e-7
-0.003872,-3.5068e-10
0.003559,1.3412e-7
…,…
-0.000916,-2.3655e-7
0.000613,-1.9514e-8
-0.000803,5.6812e-8
0.000731,-2.0664e-7
