# Naive Models

This notebook can be used to test the performance of some simple models that only use the features at time $t$ to predict the response at time $t$ and ignore any information contained in the past. The models we tried first applied some feature reduction using principal component analysis or partial least squares with respect to the response. On the new features we then trained classifiers to predict whether the response is positive or negative. The classifiers we tried are gradient-boosted decision trees and support vector machines with a Gaussian kernel.

The random forest seems highly overfit and the linear model highly underfit...

If one prevents temporal data-leakage by choosing test and training sets consisting of different days, the best results seem be an F1 score of about 57%, with the recall being much higher than the precision (in fact the precision is generally terrible around 52%).

In [None]:
%%capture
%pip install datatable

import os

import numpy as np

import datatable as dt
import pandas as pd

from sklearn.cross_decomposition import PLSRegression
# from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
# from sklearn.kernel_approximation import Nystroem
# from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score
# from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier

Load the data.

In [None]:
# location of data files
comp_folder = os.path.join(os.pardir, "input", "jane-street-market-prediction")

# read the data with datatables, then convert to pandas (faster)
df = dt.fread(os.path.join(comp_folder, "train.csv")).to_pandas()
df.set_index("ts_id", inplace=True)

# reduce memory usage
df = df.astype({c: np.float32 for c in df.select_dtypes(include="float64").columns})

# split into training and test sets
# train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# split by date, to reduce temporal correlations between training/test
train_df = df[df["date"] < 350]
test_df = df[df["date"] >= 400]

# split into features and target
feat_cols = [c for c in train_df.columns if "feature" in c]
train_X = train_df[feat_cols]
test_X = test_df[feat_cols]
train_y = train_df["resp"]
test_y = test_df["resp"]
train_weights = train_df["weight"]
test_weights = test_df["weight"]

Train a model.

In [None]:
# z-score the targets
train_y = train_y / train_y.std()

# targets as classification problem
train_y_pos = train_y.gt(0).astype(int)

# replace missing values by median
imp = SimpleImputer(strategy="median")
flow = imp.fit_transform(train_X)

# z-score the features
ss = StandardScaler()
flow = ss.fit_transform(flow)

# rotate features onto directions that cause maximal
# variance in the response
pls = PLSRegression(n_components=60) # 40 PCA components carry 95% variance
pls.fit(flow, train_y)
flow = pls.transform(flow)

# pca = PCA(n_components=50)
# flow = pca.fit_transform(flow)

# train an ensemble of gradient boosted
# decision trees on the new features
clf = XGBClassifier(
    n_estimators=200,
    max_depth=11,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.7,
    random_state=42,
    tree_method="gpu_hist"
)
clf.fit(flow, train_y_pos)
pred = clf.predict(flow)

# transform the features with and approximate
# RBF kernel
# ker = Nystroem(kernel="rbf", n_components=300, random_state=42)
# flow = ker.fit_transform(flow)

# re-z-score the features
# ss2 = StandardScaler()
# flow = ss2.fit_transform(flow)

# fit a linear classifier
# svc = SGDClassifier(loss="log", random_state=42)
# svc.fit(flow, train_y_pos)
# pred = svc.predict(flow)

# metric on training data
print("TRAINING SET:")
print(f"Confusion matrix:")
print(confusion_matrix(train_y_pos, pred))
print(f"Precision: {precision_score(train_y_pos, pred)}")
print(f"Recall: {recall_score(train_y_pos, pred)}")
print(f"F1: {f1_score(train_y_pos, pred)}")

Evaluate on test set.

In [None]:
flow = imp.transform(test_X)
flow = ss.transform(flow)
flow = pls.transform(flow)
# flow = pca.transform(flow)
pred = clf.predict(flow)
# flow = ker.transform(flow)
# flow = ss2.transform(flow)
# pred = svc.predict(flow)

test_y_pos = test_y.gt(0).astype(int)

print("TEST SET:")
print(f"Confusion matrix:")
print(confusion_matrix(test_y_pos, pred))
print(f"Precision: {precision_score(test_y_pos, pred)}")
print(f"Recall: {recall_score(test_y_pos, pred)}")
print(f"F1: {f1_score(test_y_pos, pred)}")