# Jane Street Market Prediction

## Scikit-learn 0.24 install

Required for TimeSeriesSplit with the gap parameter. For more details on getting the install file, check [this notebook](https://www.kaggle.com/heylav/time-series-split-with-gap-using-just-sklearn).

In [None]:
# install from file; ignore the message regarding autogloun-core since that is not used 
!pip install ../input/scikitlearn024/scikit_learn-0.24.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null

In [None]:
# success check (should say 0.24)
!pip freeze | grep scikit-learn

## Data loading

Read the training data.

In [None]:
import pandas as pd

train_data = pd.read_csv("/kaggle/input/jane-street-market-prediction/train.csv", index_col='ts_id')

## Data filtering

* Operations with non-positive weight are irrelevant for scoring, so ignored.

* For RAM saving, numeric (int or float) data is made to use 32 bits instead of 64. 

In [None]:
import numpy as np

train_data = train_data[train_data["weight"] > 0]
train_data.reset_index(drop=True, inplace=True)
train_data.set_index('ts_id')

train_data = train_data.astype({col: np.float32 for col in train_data.select_dtypes('float64').columns})
train_data = train_data.astype({col: np.int32 for col in train_data.select_dtypes('int64').columns})

## Features and target definition

Set which are the predictors (x) and the variable to predict (y).

In [None]:
# the target variable is the action (1 to make the trading operation and 0 to skip it);
# in training, the operation is considered positive if it has positive return (the future time horizons are not used in evaluation metric)
train_data['action'] = (train_data['resp'] > 0.0001).astype('int')

# the only target variable is the action: this is a binary classification problem
full_y_train = train_data['action']

# the predictor variables are the feature columns
x_cols = ['feature_' + str(i) for i in range(0, 130)]
full_x_train = train_data[x_cols]

# date series to be used in splitting
date_series = train_data['date']

## Time Series Splitting

The data in the competition is a time series, so to avoid potential information leakage a splitting technique where validation data is always temporarily later than training data.

Find the date range: dates will be the splitting unit, to avoid having operations of the same day in multiple splits (which could happen if we splitted at operation level instead).

In [None]:
min_date = date_series.min()
max_date = date_series.max()

dates = list(range(min_date, max_date + 1))
print(dates)

Perform the date-level splitting.

In [None]:
from sklearn.model_selection import TimeSeriesSplit

split_num = 5
date_gap_num = 1

splitter = TimeSeriesSplit(n_splits=split_num, gap=date_gap_num)
date_splits = list(splitter.split(dates))
for i, (train_dates, valid_dates) in enumerate(date_splits):
    print("Date split #{}\n train: {}\nvalid: {}\n\n".format(i, train_dates, valid_dates))

Find the data frame indices associated to each split, based on dates. Note: if instead of the index you want whole data frame or series splits, see the Version 1 of this notebook.

In [None]:
def get_indices_from_dates(dates):
    return [ i - 1 for i in date_series[date_series.isin(dates)].index]

In [None]:
split_indices = list()
for train_dates, valid_dates in date_splits:
    train_indices = get_indices_from_dates(train_dates)
    valid_indices = get_indices_from_dates(valid_dates)
    split_indices.append((train_indices, valid_indices))

## Cross validation

Use the splits to run cross validation using a classification model.

In [None]:
from xgboost import XGBClassifier

# Replace with your model here, this is just a non-optimized example, with few estimators for fast run-time
model = XGBClassifier(tree_method='gpu_hist', n_estimators=10)

In [None]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, full_x_train.values, full_y_train, cv=split_indices, scoring='f1')
print("Scores:", cv_scores, "\tmean score:", np.mean(cv_scores))