<a href="https://colab.research.google.com/github/zuzka05/stat_learn/blob/main/linear_vs_xgb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression vs XGBoost

This notebook will compare the performance and profitability of linear regression versus a XGB regressor.
XGB has a strong history of working very well with tabular data (like time series data) and has won several Kaggle competitions.

### Install deps

In [1]:
%pip install scikit-learn xgboost pandas



### Load Data

We will use 1 hour OHLC time for the Melania crypto coin. It doesn't matter what the asset is.

In [2]:
import pandas as pd

df = pd.read_csv('MELANIAUSDT-1h-ohlc.csv')

df

Unnamed: 0,open_time,open,high,low,close,volume,close_time,quote_volume,count,taker_buy_volume,taker_buy_quote_volume,ignore
0,2025-01-20T09:00:00.000+0000,11.0000,11.4810,9.5700,10.6620,11384199.07,2025-01-20T09:59:59.999+0000,1.231390e+08,260948,5445930.66,5.899404e+07,0
1,2025-01-20T10:00:00.000+0000,10.6610,11.6940,10.5000,10.8480,12075271.07,2025-01-20T10:59:59.999+0000,1.343633e+08,382369,5854709.03,6.518769e+07,0
2,2025-01-20T11:00:00.000+0000,10.8480,11.0840,9.3810,9.5930,12843489.69,2025-01-20T11:59:59.999+0000,1.291159e+08,395669,6221973.06,6.251548e+07,0
3,2025-01-20T12:00:00.000+0000,9.5920,10.1100,7.8890,8.5610,21465129.93,2025-01-20T12:59:59.999+0000,1.919756e+08,484420,10232073.31,9.171677e+07,0
4,2025-01-20T13:00:00.000+0000,8.5650,8.7850,7.4100,7.9000,19772796.14,2025-01-20T13:59:59.999+0000,1.590931e+08,463466,9713763.14,7.826701e+07,0
...,...,...,...,...,...,...,...,...,...,...,...,...
7402,2025-11-24T19:00:00.000+0000,0.1325,0.1337,0.1292,0.1298,2802078.94,2025-11-24T19:59:59.999+0000,3.680209e+05,3166,1125561.23,1.479018e+05,0
7403,2025-11-24T20:00:00.000+0000,0.1299,0.1325,0.1294,0.1322,2156768.69,2025-11-24T20:59:59.999+0000,2.820491e+05,2749,1351718.72,1.767220e+05,0
7404,2025-11-24T21:00:00.000+0000,0.1321,0.1322,0.1290,0.1297,1666444.52,2025-11-24T21:59:59.999+0000,2.173586e+05,2703,656671.32,8.559823e+04,0
7405,2025-11-24T22:00:00.000+0000,0.1297,0.1312,0.1297,0.1304,709641.03,2025-11-24T22:59:59.999+0000,9.254184e+04,1311,401705.44,5.238444e+04,0


### Add Log Returns and its lags

In [3]:
import numpy as np

df['close_log_return'] = np.log(df['close'] / df['close'].shift())
df['close_log_return_lag_1'] = df['close_log_return'].shift()
df['close_log_return_lag_2'] = df['close_log_return'].shift(2)

### Examine Serial Correlation

In [4]:
df[['close_log_return','close_log_return_lag_1','close_log_return_lag_2']].corr()

Unnamed: 0,close_log_return,close_log_return_lag_1,close_log_return_lag_2
close_log_return,1.0,-0.09287,0.063261
close_log_return_lag_1,-0.09287,1.0,-0.092853
close_log_return_lag_2,0.063261,-0.092853,1.0


There's a tiny, yet significant, negative auto-correlation between close log return and its first lag: -0.09287

### Split Data into Test/Train using time split

In [5]:
def time_split(df, train_size=0.75):
    i = int(len(df) * train_size)
    return df[:i], df[i:]

df = df.dropna()
df_train, df_test = time_split(df)

### Empirically check the split is correct

In [6]:
df_train

Unnamed: 0,open_time,open,high,low,close,volume,close_time,quote_volume,count,taker_buy_volume,taker_buy_quote_volume,ignore,close_log_return,close_log_return_lag_1,close_log_return_lag_2
3,2025-01-20T12:00:00.000+0000,9.5920,10.1100,7.8890,8.5610,21465129.93,2025-01-20T12:59:59.999+0000,1.919756e+08,484420,10232073.31,9.171677e+07,0,-0.113817,-0.122947,0.017295
4,2025-01-20T13:00:00.000+0000,8.5650,8.7850,7.4100,7.9000,19772796.14,2025-01-20T13:59:59.999+0000,1.590931e+08,463466,9713763.14,7.826701e+07,0,-0.080354,-0.113817,-0.122947
5,2025-01-20T14:00:00.000+0000,7.9000,9.3620,7.9000,9.2330,17759999.80,2025-01-20T14:59:59.999+0000,1.527149e+08,501861,9025218.75,7.759323e+07,0,0.155921,-0.080354,-0.113817
6,2025-01-20T15:00:00.000+0000,9.2330,9.2790,7.4440,7.6960,22842738.76,2025-01-20T15:59:59.999+0000,1.887794e+08,581312,10730193.18,8.871690e+07,0,-0.182083,0.155921,-0.080354
7,2025-01-20T16:00:00.000+0000,7.7000,8.1910,7.3880,8.0420,18755577.50,2025-01-20T16:59:59.999+0000,1.467476e+08,541512,9096828.79,7.122142e+07,0,0.043977,-0.182083,0.155921
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5551,2025-09-08T16:00:00.000+0000,0.2019,0.2028,0.2000,0.2004,2091935.56,2025-09-08T16:59:59.999+0000,4.217470e+05,2319,979580.96,1.975355e+05,0,-0.007952,0.001982,0.015496
5552,2025-09-08T17:00:00.000+0000,0.2005,0.2019,0.1998,0.2004,1628770.10,2025-09-08T17:59:59.999+0000,3.268314e+05,3380,885163.20,1.777163e+05,0,0.000000,-0.007952,0.001982
5553,2025-09-08T18:00:00.000+0000,0.2004,0.2018,0.2000,0.2011,2445280.59,2025-09-08T18:59:59.999+0000,4.912744e+05,3025,1062316.67,2.134323e+05,0,0.003487,0.000000,-0.007952
5554,2025-09-08T19:00:00.000+0000,0.2012,0.2015,0.1990,0.1998,2312755.13,2025-09-08T19:59:59.999+0000,4.627622e+05,3218,1046063.72,2.091958e+05,0,-0.006485,0.003487,0.000000


In [7]:
df_test

Unnamed: 0,open_time,open,high,low,close,volume,close_time,quote_volume,count,taker_buy_volume,taker_buy_quote_volume,ignore,close_log_return,close_log_return_lag_1,close_log_return_lag_2
5556,2025-09-08T21:00:00.000+0000,0.1990,0.2007,0.1988,0.2002,1092088.45,2025-09-08T21:59:59.999+0000,218236.491004,1710,538461.47,107655.530634,0,0.006515,-0.004515,-0.006485
5557,2025-09-08T22:00:00.000+0000,0.2001,0.2006,0.1989,0.1993,2279239.84,2025-09-08T22:59:59.999+0000,455394.303374,1972,1015413.56,202707.556904,0,-0.004506,0.006515,-0.004515
5558,2025-09-08T23:00:00.000+0000,0.1994,0.1997,0.1987,0.1991,559000.51,2025-09-08T23:59:59.999+0000,111348.526789,839,268537.57,53478.448064,0,-0.001004,-0.004506,0.006515
5559,2025-09-09T00:00:00.000+0000,0.1992,0.2009,0.1987,0.2008,2646895.00,2025-09-09T00:59:59.999+0000,529172.646929,2978,1705709.64,340908.933952,0,0.008502,-0.001004,-0.004506
5560,2025-09-09T01:00:00.000+0000,0.2007,0.2011,0.1986,0.1992,1602138.76,2025-09-09T01:59:59.999+0000,319931.180180,1880,736448.14,147131.266583,0,-0.008000,0.008502,-0.001004
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7402,2025-11-24T19:00:00.000+0000,0.1325,0.1337,0.1292,0.1298,2802078.94,2025-11-24T19:59:59.999+0000,368020.932035,3166,1125561.23,147901.843397,0,-0.020588,-0.005269,0.066743
7403,2025-11-24T20:00:00.000+0000,0.1299,0.1325,0.1294,0.1322,2156768.69,2025-11-24T20:59:59.999+0000,282049.053477,2749,1351718.72,176722.012247,0,0.018321,-0.020588,-0.005269
7404,2025-11-24T21:00:00.000+0000,0.1321,0.1322,0.1290,0.1297,1666444.52,2025-11-24T21:59:59.999+0000,217358.594789,2703,656671.32,85598.225699,0,-0.019092,0.018321,-0.020588
7405,2025-11-24T22:00:00.000+0000,0.1297,0.1312,0.1297,0.1304,709641.03,2025-11-24T22:59:59.999+0000,92541.838018,1311,401705.44,52384.437988,0,0.005383,-0.019092,0.018321


### Create Features and Target

The features are the input to the model and the target is what we want to predict.

* X => denotes the model input
* y => denotes the model output

In [8]:
features = ['close_log_return_lag_2']
target = 'close_log_return'

X_train, X_test = df_train[features], df_test[features]
y_train, y_test = df_train[target], df_test[target]

In [9]:
X_train = X_train.to_numpy().astype("float32")
X_test  = X_test.to_numpy().astype("float32")
y_train = y_train.to_numpy().astype("float32")
y_test  = y_test.to_numpy().astype("float32")

## Linear Regression

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 1. Create the model
model = LinearRegression()

# 2. Train (fit) the model
model.fit(X_train, y_train)

# 3. Predict on the test set
y_pred = model.predict(X_test)

# 4. Evaluate
linear_mse = mean_squared_error(y_test, y_pred)
linear_r2 = r2_score(y_test, y_pred)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", linear_mse)
print("R² score:", linear_r2)

df_test = df_test.copy()
df_test['y_hat'] = y_pred
df_test['dir_signal'] = np.sign(y_pred)
df_test['trade_log_return'] = df_test['dir_signal']  * df_test['close_log_return']

linear_ev = df_test['trade_log_return'].mean()
print(f'EV = {linear_ev}')
linear_return = np.exp(df_test['trade_log_return'].sum()) - 1
print(f'Total Return = {linear_return}')
linear_std = df_test['trade_log_return'].std()
linear_sharpe = linear_ev / linear_std * np.sqrt(24 * 365)
print(f'Sharpe = {linear_sharpe}')


Coefficients: [0.06194299]
Intercept: -0.0006537067
MSE: 0.0003216602490283549
R² score: 0.003679037094116211
EV = 0.000527362352889373
Total Return = 1.6542117426771465
Sharpe = 2.747233547572052


We can see the total return is nearly 2x. This is very impressive for just a univariate model. It's just basic model yet it is profitable. Now can a more complex model be more profitable?

In [None]:
"""
Despite terrible predictive accuracy, the strategy is profitable because you only need directional edge, not precise predictions.
The model correctly identifies slight positive bias in your feature's relationship to future returns, which compounds significantly when trading every hour over the full test period.
This is classic quantitative trading - you're exploiting a weak but persistent signal through high-frequency execution.
The R² being near zero isn't concerning here; what matters is that sign(y_pred) correlates positively with actual return direction often enough to generate alpha after costs.

Model is likely picking up mean reversion in your lagged return feature
"""

### XGBoost

XGBoost is well known for winning Kaggle competitions. It's a non-linear model that works well with tabular data. Surely it must outpeform our basic model that is taught at high-school? Let's find out!

In [None]:
import xgboost as xgb

# 1. Create the model with default parameters (try experiementing with different hyperparameters)
model = xgb.XGBRegressor(
    # n_estimators=100,          # small number of trees
    # max_depth=2,               # very shallow trees
    # learning_rate=0.05,        # slow learning
    # subsample=0.8,             # row subsampling
    # colsample_bytree=1.0,      # only one feature anyway
    # min_child_weight=5,        # prevents small noisy splits
    # reg_alpha=0.1,             # L1 regularization
    # reg_lambda=1.0,            # L2 regularization
    # gamma=0.1,                 # minimum split gain
    # objective="reg:squarederror",
    # random_state=42
)

# 2. Train (fit) the model
model.fit(X_train, y_train)

# 3. Predict on the test set
y_pred = model.predict(X_test)

# 4. Evaluate Performance
xgb_mse = mean_squared_error(y_test, y_pred)
xgb_r2 = r2_score(y_test, y_pred)

print("MSE:", xgb_mse)
print("R² score:", xgb_r2)

# 5. Evaluate Prediction
df_test = df_test.copy()
df_test['y_hat'] = y_pred
df_test['dir_signal'] = np.sign(y_pred)
df_test['trade_log_return'] = df_test['dir_signal']  * df_test['close_log_return']
xgb_ev = df_test['trade_log_return'].mean()
print(f'EV = {xgb_ev}')
xgb_return = np.exp(df_test['trade_log_return'].sum()) - 1
print(f'Total Return = {xgb_return}')
xgb_std = df_test['trade_log_return'].std()
xgb_sharpe = xgb_ev / xgb_std * np.sqrt(24 * 365)
print(f'Sharpe = {xgb_sharpe}')


MSE: 0.0003310023748781532
R² score: -0.02478170394897461
EV = 0.00046680758713326965
Total Return = 1.3716692596654263
Sharpe = 2.4309822642772163


In [None]:
"""
XGBoost Default Parameters Explained
Tree Structure:

n_estimators=100 - Number of boosting rounds (trees built sequentially)
max_depth=6 - Maximum tree depth; deeper = more complex interactions but higher overfit risk
min_child_weight=1 - Minimum sum of instance weights needed in a child; higher = more conservative splits

Learning:

learning_rate=0.3 - Shrinkage applied to each tree; lower = slower learning, needs more trees
gamma=0 - Minimum loss reduction required to split; higher = more conservative
subsample=1.0 - Fraction of samples used per tree; <1.0 adds randomness, prevents overfitting
colsample_bytree=1.0 - Fraction of features used per tree; less relevant with single feature

Regularization:

reg_alpha=0 - L1 regularization on weights; encourages sparsity
reg_lambda=1 - L2 regularization on weights; smooths predictions
objective='reg:squarederror' - Loss function (MSE for regression)

Context for Your Strategy:
With only one feature (lagged returns), XGBoost can still capture non-linear patterns that linear regression misses - like asymmetric mean reversion or regime-dependent behavior.
The commented hyperparameters suggest you're trying to prevent overfitting through aggressive regularization, which is smart for high-frequency financial data.

"""

### Conclusion

Here's the summary:

In [None]:
summary = pd.DataFrame({
    'model': ['linear','xgb'],
    'mse': [linear_mse, xgb_mse],
    'r2': [linear_r2, xgb_r2],
    'ev': [linear_ev, xgb_ev],
    'total_return': [linear_return, xgb_return],
    'sharpe':[linear_sharpe, xgb_sharpe],
})
summary

Unnamed: 0,model,mse,r2,ev,total_return,sharpe
0,linear,0.000322,0.003779,0.000587,1.96069,3.056018
1,xgb,0.000331,-0.024782,0.000467,1.371669,2.430982


In [None]:
"""
Simpler is better here - Linear regression outperforms XGBoost on both prediction accuracy and trading returns
Single feature limitation - With only lagged returns as your feature, there's minimal non-linearity for XGBoost to exploit. You're just adding noise and overfitting.
Both strategies are profitable - Even the "worse" model generates 137% returns with a 2.4 Sharpe, suggesting your underlying signal (mean reversion in hourly Bitcoin returns) has genuine alpha
Default XGBoost too aggressive - The commented-out regularization parameters you had would likely help. Current defaults (max_depth=6, learning_rate=0.3) are overfitting to training noise.
"""

You can clearly see that the most basic model in machine learning beats the xgb model in all measured metrics. The most important metric is the total return as it is 96% returns and the xgb model is only 37%. On top of that, it has way better risk-adjusted returns as it has a higher Sharpe. So not only does it have higher returns but also reduced risk too!

**Key Lesson** = **Simplicity** over **Complexity**