<a href="https://colab.research.google.com/github/zuzka05/stat_learn/blob/main/comparison_ml_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparison of Sci-Kit vs Keras vs PyTorch

This notebook will compare building a linear regression between scikit-learn, keras and pytorch.

### Load Data

We will use 1 hour OHLC time for the Melania crypto coin. It doesn't matter what the underlying instrument is. I chose it because it's a small dataset.

In [1]:
import pandas as pd

df = pd.read_csv('MELANIAUSDT-1h-ohlc.csv')

df

Unnamed: 0,open_time,open,high,low,close,volume,close_time,quote_volume,count,taker_buy_volume,taker_buy_quote_volume,ignore
0,2025-01-20T09:00:00.000+0000,11.0000,11.4810,9.5700,10.6620,11384199.07,2025-01-20T09:59:59.999+0000,1.231390e+08,260948,5445930.66,5.899404e+07,0
1,2025-01-20T10:00:00.000+0000,10.6610,11.6940,10.5000,10.8480,12075271.07,2025-01-20T10:59:59.999+0000,1.343633e+08,382369,5854709.03,6.518769e+07,0
2,2025-01-20T11:00:00.000+0000,10.8480,11.0840,9.3810,9.5930,12843489.69,2025-01-20T11:59:59.999+0000,1.291159e+08,395669,6221973.06,6.251548e+07,0
3,2025-01-20T12:00:00.000+0000,9.5920,10.1100,7.8890,8.5610,21465129.93,2025-01-20T12:59:59.999+0000,1.919756e+08,484420,10232073.31,9.171677e+07,0
4,2025-01-20T13:00:00.000+0000,8.5650,8.7850,7.4100,7.9000,19772796.14,2025-01-20T13:59:59.999+0000,1.590931e+08,463466,9713763.14,7.826701e+07,0
...,...,...,...,...,...,...,...,...,...,...,...,...
7402,2025-11-24T19:00:00.000+0000,0.1325,0.1337,0.1292,0.1298,2802078.94,2025-11-24T19:59:59.999+0000,3.680209e+05,3166,1125561.23,1.479018e+05,0
7403,2025-11-24T20:00:00.000+0000,0.1299,0.1325,0.1294,0.1322,2156768.69,2025-11-24T20:59:59.999+0000,2.820491e+05,2749,1351718.72,1.767220e+05,0
7404,2025-11-24T21:00:00.000+0000,0.1321,0.1322,0.1290,0.1297,1666444.52,2025-11-24T21:59:59.999+0000,2.173586e+05,2703,656671.32,8.559823e+04,0
7405,2025-11-24T22:00:00.000+0000,0.1297,0.1312,0.1297,0.1304,709641.03,2025-11-24T22:59:59.999+0000,9.254184e+04,1311,401705.44,5.238444e+04,0


### Add Auto-Regression Log Returns

In [2]:
import numpy as np

df['close_log_return'] = np.log(df['close'] / df['close'].shift())
df['close_log_return_lag_1'] = df['close_log_return'].shift()

### Examine Serial Correlation

In [3]:
df[['close_log_return','close_log_return_lag_1']].corr()

Unnamed: 0,close_log_return,close_log_return_lag_1
close_log_return,1.0,-0.09287
close_log_return_lag_1,-0.09287,1.0


In [None]:
#The -0.09287 correlation is evidence of mean reversion in Bitcoin returns at this frequency
#Feature value: The lagged return should be a useful predictor in your ML models - negative past returns weakly predict positive future returns


There's a tiny, yet significant, negative auto-correlation between close log return and its first lag: -0.09287

### Split Data into Test/Train using time split

In [4]:
def time_split(df, train_size=0.75):
    i = int(len(df) * train_size)
    return df[:i], df[i:]

df = df.dropna()
df_train, df_test = time_split(df)

### Empirically check the split is correct

In [5]:
df_train

Unnamed: 0,open_time,open,high,low,close,volume,close_time,quote_volume,count,taker_buy_volume,taker_buy_quote_volume,ignore,close_log_return,close_log_return_lag_1
2,2025-01-20T11:00:00.000+0000,10.8480,11.0840,9.3810,9.5930,12843489.69,2025-01-20T11:59:59.999+0000,1.291159e+08,395669,6221973.06,6.251548e+07,0,-0.122947,0.017295
3,2025-01-20T12:00:00.000+0000,9.5920,10.1100,7.8890,8.5610,21465129.93,2025-01-20T12:59:59.999+0000,1.919756e+08,484420,10232073.31,9.171677e+07,0,-0.113817,-0.122947
4,2025-01-20T13:00:00.000+0000,8.5650,8.7850,7.4100,7.9000,19772796.14,2025-01-20T13:59:59.999+0000,1.590931e+08,463466,9713763.14,7.826701e+07,0,-0.080354,-0.113817
5,2025-01-20T14:00:00.000+0000,7.9000,9.3620,7.9000,9.2330,17759999.80,2025-01-20T14:59:59.999+0000,1.527149e+08,501861,9025218.75,7.759323e+07,0,0.155921,-0.080354
6,2025-01-20T15:00:00.000+0000,9.2330,9.2790,7.4440,7.6960,22842738.76,2025-01-20T15:59:59.999+0000,1.887794e+08,581312,10730193.18,8.871690e+07,0,-0.182083,0.155921
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5550,2025-09-08T15:00:00.000+0000,0.2016,0.2042,0.2008,0.2020,4680366.57,2025-09-08T15:59:59.999+0000,9.482345e+05,4510,2280901.61,4.620526e+05,0,0.001982,0.015496
5551,2025-09-08T16:00:00.000+0000,0.2019,0.2028,0.2000,0.2004,2091935.56,2025-09-08T16:59:59.999+0000,4.217470e+05,2319,979580.96,1.975355e+05,0,-0.007952,0.001982
5552,2025-09-08T17:00:00.000+0000,0.2005,0.2019,0.1998,0.2004,1628770.10,2025-09-08T17:59:59.999+0000,3.268314e+05,3380,885163.20,1.777163e+05,0,0.000000,-0.007952
5553,2025-09-08T18:00:00.000+0000,0.2004,0.2018,0.2000,0.2011,2445280.59,2025-09-08T18:59:59.999+0000,4.912744e+05,3025,1062316.67,2.134323e+05,0,0.003487,0.000000


In [6]:
df_test

Unnamed: 0,open_time,open,high,low,close,volume,close_time,quote_volume,count,taker_buy_volume,taker_buy_quote_volume,ignore,close_log_return,close_log_return_lag_1
5555,2025-09-08T20:00:00.000+0000,0.1998,0.2005,0.1974,0.1989,3011576.01,2025-09-08T20:59:59.999+0000,598252.878467,3724,1354996.32,269221.298790,0,-0.004515,-0.006485
5556,2025-09-08T21:00:00.000+0000,0.1990,0.2007,0.1988,0.2002,1092088.45,2025-09-08T21:59:59.999+0000,218236.491004,1710,538461.47,107655.530634,0,0.006515,-0.004515
5557,2025-09-08T22:00:00.000+0000,0.2001,0.2006,0.1989,0.1993,2279239.84,2025-09-08T22:59:59.999+0000,455394.303374,1972,1015413.56,202707.556904,0,-0.004506,0.006515
5558,2025-09-08T23:00:00.000+0000,0.1994,0.1997,0.1987,0.1991,559000.51,2025-09-08T23:59:59.999+0000,111348.526789,839,268537.57,53478.448064,0,-0.001004,-0.004506
5559,2025-09-09T00:00:00.000+0000,0.1992,0.2009,0.1987,0.2008,2646895.00,2025-09-09T00:59:59.999+0000,529172.646929,2978,1705709.64,340908.933952,0,0.008502,-0.001004
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7402,2025-11-24T19:00:00.000+0000,0.1325,0.1337,0.1292,0.1298,2802078.94,2025-11-24T19:59:59.999+0000,368020.932035,3166,1125561.23,147901.843397,0,-0.020588,-0.005269
7403,2025-11-24T20:00:00.000+0000,0.1299,0.1325,0.1294,0.1322,2156768.69,2025-11-24T20:59:59.999+0000,282049.053477,2749,1351718.72,176722.012247,0,0.018321,-0.020588
7404,2025-11-24T21:00:00.000+0000,0.1321,0.1322,0.1290,0.1297,1666444.52,2025-11-24T21:59:59.999+0000,217358.594789,2703,656671.32,85598.225699,0,-0.019092,0.018321
7405,2025-11-24T22:00:00.000+0000,0.1297,0.1312,0.1297,0.1304,709641.03,2025-11-24T22:59:59.999+0000,92541.838018,1311,401705.44,52384.437988,0,0.005383,-0.019092


### Create Features and Target

The features are the input to the model and the target is what we want to predict.

* X => denotes the model input
* y => denotes the model output

In [7]:
features = ['close_log_return_lag_1']
target = 'close_log_return'

X_train, X_test = df_train[features], df_test[features]
y_train, y_test = df_train[target], df_test[target]

In [8]:
X_train = X_train.to_numpy().astype("float32")
X_test  = X_test.to_numpy().astype("float32")
y_train = y_train.to_numpy().astype("float32")
y_test  = y_test.to_numpy().astype("float32")

## Scikit

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 1. Create the model
model = LinearRegression()

# 2. Train (fit) the model
model.fit(X_train, y_train)

# 3. Predict on the test set
y_pred = model.predict(X_test)

# 4. Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mse)
print("R² score:", r2)


Coefficients: [-0.1366018]
Intercept: -0.0008170026
MSE: 0.00033087999327108264
R² score: -0.025400757789611816


### PyTorch

In [10]:
import torch
import torch.nn as nn
import torch.optim as optim

# Convert to tensors
X_train_t = torch.tensor(X_train)
y_train_t = torch.tensor(y_train).view(-1, 1)
X_test_t  = torch.tensor(X_test)

# 1. Create the model
model = nn.Linear(X_train.shape[1], 1)

# Define loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# 2. Train the model
epochs = 1000
for epoch in range(epochs):
    model.train()

    # Forward pass
    y_pred = model(X_train_t)

    # Compute loss
    loss = criterion(y_pred, y_train_t)

    # Backprop
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# 3. Predict on the test set
model.eval()
with torch.no_grad():
    y_pred_t = model(X_test_t).flatten()
y_pred = y_pred_t.numpy()

# 4. Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Extract coefficients and intercept
weights = model.weight.detach().numpy().flatten()
bias = model.bias.detach().numpy()[0]

print("Coefficients:", weights)
print("Intercept:", bias)
print("MSE:", mse)
print("R² score:", r2)


Coefficients: [-0.13660182]
Intercept: -0.0008170026
MSE: 0.00033087999327108264
R² score: -0.025400757789611816


### Keras

In [11]:
from tensorflow import keras
from tensorflow.keras import layers, Input
from sklearn.metrics import mean_squared_error, r2_score

# 1. Create the model
model = keras.Sequential([
    Input((X_train.shape[1],)),
    layers.Dense(1)
])

model.compile(optimizer="adam", loss="mse")

# 2. Train
model.fit(X_train, y_train, epochs=1000, verbose=0, batch_size=len(X_train))

# 3. Predict
y_pred = model.predict(X_test).flatten()

# 4. Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Extract coefficients and intercept
dense = model.layers[0]
weights = dense.get_weights()[0].flatten()
bias = dense.get_weights()[1][0]

print("Coefficients:", weights)
print("Intercept:", bias)
print("MSE:", mse)
print("R² score:", r2)


[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step  
Coefficients: [0.6406051]
Intercept: -0.00026124957
MSE: 0.0004463524091988802
R² score: -0.3832510709762573


## Conclusion

To conclude, I provide the following recommendations:

* Beginners => Start with SciKit. It's less intimidating than Keras and PyTorch but move to Keras or PyTorch once you're ready to build more sophisticated models.
* Quant Researchers => PyTorch. It gives you fine-grained control over the ML pipeline and can quickly test new research ideas.

### SciKit

Pros:

* Very simply API that is the easiest to use
* Great for small data sets
* Perfect API for beginners

Cons:

* It doesn't give any ability to customize the learning process
* It is a closed form solution so won't work well for large data sets
* It provides 0% flexibility to research new techniques and models

### Keras

Pros:

* Human-friendly API that allows to quickly speed up prototyping new models
* Remove a lot of the boilerplate code for training a ML Model
* Supports different backends: CPU, GPU, JAX, Torch

Cons:

* You have to learn the API to really understand how to customize it.
* If you need functionality that doesn't exist, it's hard to add it.
* It's less Python-ic code

### PyTorch

Pros:

* Industry Standard: a lot of tech and quant companies are using it now
* Research friendly: it is very easy to try out your research ideas
* Extremely customizable: you have fine-grain control over every detail of the model building
* Python-ic: it feels like writing textbook python code than a framework that does lots of voodoo magic.
* Works well with large data sets
* Easier to debug

Cons:

* You will need to write the boilerplate training code that Keras would take care for you
* Not beginner-friendly; especially compared to SciKit
* You will write more code compared to SciKit