# Kaggle Competition : Jane Street Market Prediction

**Competition Description**

“Buy low, sell high.” It sounds so easy….

In reality, trading for profit has always been a difficult problem to solve, even more so in today’s fast-moving and complex financial markets. Electronic trading allows for thousands of transactions to occur within a fraction of a second, resulting in nearly unlimited opportunities to potentially find and take advantage of price differences in real time.
In a perfectly efficient market, buyers and sellers would have all the agency and information needed to make rational trading decisions. As a result, products would always remain at their “fair values” and never be undervalued or overpriced. However, financial markets are not perfectly efficient in the real world.
Developing trading strategies to identify and take advantage of inefficiencies is challenging. Even if a strategy is profitable now, it may not be in the future, and market volatility makes it impossible to predict the profitability of any given trade with certainty. As a result, it can be hard to distinguish good luck from having made a good trading decision.
In the first three months of this challenge, you will build your own quantitative trading model to maximize returns using market data from a major global stock exchange. Next, you’ll test the predictiveness of your models against future market returns and receive feedback on the leaderboard.
Your challenge will be to use the historical data, mathematical tools, and technological tools at your disposal to create a model that gets as close to certainty as possible. You will be presented with a number of potential trading opportunities, which your model must choose whether to accept or reject.
In general, if one is able to generate a highly predictive model which selects the right trades to execute, they would also be playing an important role in sending the market signals that push prices closer to “fair” values. That is, a better model will mean the market will be more efficient going forward. However, developing good models will be challenging for many reasons, including a very low signal-to-noise ratio, potential redundancy, strong feature correlation, and difficulty of coming up with a proper mathematical formulation.
Jane Street has spent decades developing their own trading models and machine learning solutions to identify profitable opportunities and quickly decide whether to execute trades. These models help Jane Street trade thousands of financial products each day across 200 trading venues around the world.
Admittedly, this challenge far oversimplifies the depth of the quantitative problems Jane Streeters work on daily, and Jane Street is happy with the performance of its existing trading model for this particular question. However, there’s nothing like a good puzzle, and this challenge will hopefully serve as a fun introduction to a type of data science problem that a Jane Streeter might tackle on a daily basis. Jane Street looks forward to seeing the new and creative approaches the Kaggle community will take to solve this trading challenge.

**Data Description**

This dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade, while ts_id represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.

In the training set, train.csv, you are provided a resp value, as well as several other resp_{1,2,3,4} values that represent returns over different time horizons. These variables are not included in the test set. Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation.

This is a code competition that relies on a time-series API to ensure models do not peek forward in time. To use the API, follow the instructions on the Evaluation page. When you submit your notebook, it will be rerun on an unseen test:

    During the model training phase of the competition, this unseen test set is comprised of approximately 1 million rows of historical data.
    During the live forecasting phase, the test set will use periodically updated live market data.

Note that during the second (forecasting) phase of the competition, the notebook time limits will scale with the number of trades presented in the test set. Refer to the Code Requirements for details.
Files

    train.csv - the training set, contains historical data and returns
    example_test.csv - a mock test set which represents the structure of the unseen test set. You will not be directly using the test set or sample submission in this competition, as the time-series API will get/set the test set and predictions.
    example_sample_submission.csv - a mock sample submission file in the correct format
    features.csv - metadata pertaining to the anonymized features


**Evaluation**

This competition is evaluated on a utility score. Each row in the test set represents a trading opportunity for which you will be predicting an action value, 1 to make the trade and 0 to pass on it. Each trade j has an associated weight and resp, which represents a return.

For each date i, we define:

   $p_i = \sum_{i=0} (weight_{ij} * resp_{ij} * action_{ij})$,

   $t = \frac{\sum_{p_i}}{\sqrt{\sum_{p_i^2}}} * \sqrt{\frac{250}{|i|}}$,

                       
where $|i|$ is the number of unique dates in the test set. The utility is then defined as:

   $u = \min{(\max{(t, 0)}, 6)} \sum{p_i}$
 

 $\rule{20cm}{0.1pt}$

## Notebook details

In the first part of the model (training only) 29 MLP are trained, where each MLP's inputs are features corresponding to tags described in features.csv. Output for each model is the probability that $\textit{resp} > 0$.

Then, a new (big) MLP is built in the second part (cf. Pytorch -- TagModels + Stacking 2/2) which combined these 29 outputs to a final probability. 

The objective is to build a global model where features are used in accordance with the way Jane Street has provided metadata --> grouped.

$\underline{\textit{NB :}}$
- $\textit{feature_0}$ is added to each tag.
- Data whith $\textit{date} < 85$ are deleted (cf. https://www.kaggle.com/c/jane-street-market-prediction/discussion/201930)
- Data whith $\textit{weight} = 0$ are deleted
- A customized loss is built manually in order to maximize the first part of the $\textit{utility}$ evaluation function (Kaggle competition metric) : $1 - \frac{\sum_{i} {p(pred)_i}}{\sum_{i} {p_i}}$

    where :
    
    $p(pred)_i = \sum_{j=0} (weight_{ij} * resp_{ij} * action(pred)_{ij})$,

    $p_i = \sum_{j=0} (weight_{ij} * resp_{ij} * action_{ij})$ and

    $action_{ij} = \mathbb{1}_{\{resp_{ij} > 0\}}$
            
            
- Early stopping is used($\searrow$ overfitting).
- Train set used for training TagModel1 is different than train set used for training TagModel2 etc ($\searrow$ overfitting).
- Time ordering is partialy or will be broken by : 
    - the deletion of data with $\textit{weight = 0}$.
    - the gap between train set and eval set (Private LeaderBoard) ?
    - the delay between trades is not equal ?
    - the fact that the model is not a RNN.
    
    So, I decided to fully break time ordering by shuffling data before training. It seems to have the effect of reducing overfitting too.

- shuffling involves that $|i|$ on validation is inconsistent so I added an option into utility function for fixing $\sqrt{\frac{250}{|i|}} = 1$.
    
- $resp_1, resp_2, resp_3, resp_4$ are not used but it could be a good idea to integrate them.

- Warning GPU Quota : this notebook runs in 9 hours on my computer so weights for each TagModel are available in input\weights-tagmodels.

**Load Libraries**

In [None]:
import os
import sys
import json
import time
import random
from datetime import datetime

import numpy as np
import pandas as pd
import datatable as dt
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score, precision_score, recall_score
from scipy.special import expit

import torch
import torch.nn as nn

print(f"Pytorch version : {torch.__version__}")

if torch.cuda.is_available():
    print(f"GPU : {torch.cuda.get_device_name()} available")
else:
    print("Error : GPU not available")
    sys.exit(1)

**Functions**

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    return None

def utility(dates, weights, true_resp, actions, use_mult=False):
    """Jane Street evaluation metric"""
    Pi = weights * true_resp * actions
    if use_mult:
        mult = np.sqrt(250 / np.bincount(dates).shape[0])
    else:
        mult = 1
    sum_Pi = Pi.sum() 
    sum_pi_squared = np.sqrt((Pi ** 2).sum())
    t = (sum_Pi / sum_pi_squared) * mult
    u = min(max(t, 0), 6) * sum_Pi
    return u

def compute_utility_many(predictions, dates, weights, true_resp, interval=np.linspace(0, 1, 101)):
    """given predictions probability compute utility for many threshold"""
    lst_ut = []
    for v in interval:
        actions =  (predictions > v).astype(int)
        ut = utility(dates=dates, weights=weights, true_resp=true_resp, actions=actions)
        lst_ut.append((v, ut))
    return lst_ut

def build_dic_records(datasets=[], dic_tags=None, records=[]):
    """save training data into python dict"""
    dic = {}
    for dataset in datasets:
        dic[dataset] = {}
        for tag in range(len(dic_tags)):
            dic[dataset]['tag_'+str(tag)] = {}
            for record in records:
                dic[dataset]['tag_'+str(tag)][record] = []
    return dic

**Settings**

In [None]:
SEED = 2021
DATE_NOW = datetime.now().__format__("%Y-%m-%d_%H:%M:%S")
DEVICE = torch.device("cuda")

PATH_ROOT = "/kaggle/input/jane-street-market-prediction"
PATH_WEIGHT_TAGMODEL = "../input/weights-tagmodels"
PATH_DATA = os.path.join(PATH_ROOT, "train.csv")
PATH_FEATURES =  os.path.join(PATH_ROOT, "features.csv")

LST_FEATURES = ["feature_"+str(n_feat) for n_feat in range(0, 130, 1)]
LST_TARGETS = ["resp", "resp_1", "resp_2", "resp_3", "resp_4"]

SIZE_TRAIN = .85
TRAINING_TAGS = False
SAVE_DICT = False

**Preprocessing**

In [None]:
%%time
seed_everything(seed=SEED)

data = dt.fread(PATH_DATA).to_pandas()  # fast loading
data = data[data.date > 85]  # delete date < 85
data = data.sample(frac=1)  # shuffle
data.reset_index(drop=True, inplace=True)
print(f"shape : {data.shape}")

df_tags = pd.read_csv(PATH_FEATURES, index_col="feature")
dic_tags = {}
for n, tag in enumerate(df_tags.columns):
    lst_features = df_tags[tag][df_tags[tag] == True].index.tolist()
    lst_num_features = [int(e.strip("feature_")) for e in lst_features]
    dic_tags[str(n)] = lst_num_features

# add feature_0 in all tags
for e in dic_tags.keys():
    dic_tags[e].append(0)
    
f_mean = data[LST_FEATURES[1:]].mean()
data = data.loc[data.weight > 0].reset_index(drop = True)  # delete 0 weight data
data[LST_FEATURES[1:]] = data[LST_FEATURES[1:]].fillna(f_mean)  # filling NaN by mean of each feature

**Pytorch utils**

In [None]:
class EarlyStopping:
    
    def __init__(self, patience=7, mode="max", delta=0.0, verbose=False, trace_func=print, path="checkpoint.pt"):
        
        self.patience = patience
        self.counter = 0
        self.mode = mode
        self.best_score = None
        self.early_stop = False
        self.delta = delta
        self.verbose = verbose
        self.trace_func = trace_func
        self.path = path
        
        if self.mode == "min":
            self.val_score = np.Inf
            
        else:
            self.val_score = -np.Inf

    def __call__(self, epoch_score, model):

        if self.mode == "min":
            score = -1.0 * epoch_score
            
        elif self.mode == "max":
            score = np.copy(epoch_score)
        
        # first epoch
        if self.best_score is None:
            self.best_score = score
            self.save_checkpoint(epoch_score, model)
        
        # best score NOT modified
        elif score < self.best_score + self.delta:
            self.counter += 1
            self.trace_func(f"EarlyStopping counter: {self.counter} out of {self.patience}")
            if self.counter >= self.patience:
                self.early_stop = True
                
        # best score modified        
        else:
            self.best_score = score
            self.save_checkpoint(epoch_score, model)
            self.counter = 0

    def save_checkpoint(self, epoch_score, model):
        """
        Save model when validation loss decrease.
        """
        if self.verbose:
            self.trace_func(f'Validation metric moving : ({self.val_score:.6f} --> {epoch_score:.6f}).  Saving model ...')
            
        if epoch_score not in [-np.inf, np.inf, -np.nan, np.nan]:
            torch.save(model.state_dict(), self.path)
            
        self.val_score = epoch_score

class BuildDataset:
    
    def __init__(self, df, col_x, target):
        self.col_x = col_x.copy()
        self.target = target.copy()
        self.X = df[self.col_x].values
        self.y = (df[self.target] > 0).astype('int').values
        self.weights = df.weight.values
        self.resps = df.resp.values
        self.actions = (df.resp > 0 ).astype('int').values

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return {
            'features': torch.tensor(self.X[idx], dtype=torch.float),
            'label': torch.tensor(self.y[idx], dtype=torch.float),
            'weights': torch.tensor(self.weights[idx], dtype=torch.float),
            'resps': torch.tensor(self.resps[idx], dtype=torch.float),
            'actions': torch.tensor(self.actions[idx], dtype=torch.float)
        }
    
class P1UtilityLoss(torch.nn.Module):
    """
    customized loss based on first part of utility evaluation metric
    """

    def __init__(self, threshold=.5):
        super(P1UtilityLoss, self).__init__()
        self.threshold = torch.tensor(threshold)
        
    def forward(self, true_actions, pred, weights, resps):
        w_r = torch.mul(weights, resps)
        pi_true = torch.mul(w_r, true_actions)
        pi_pred = torch.mul(w_r, pred)
        pi_true_sum = torch.sum(pi_true)
        pi_pred_sum = torch.sum(pi_pred)
        res = torch.sub(torch.tensor(1), torch.div(pi_pred_sum, pi_true_sum))
        return res

class TagModel(torch.nn.Module):
    """
    MLP : batchnorm0 > dropout > dense0 > relu > batchnorm1 > dropout > dense1 > sigmoid
    """
    
    def __init__(self, dic_tags, tag_number, input_size, output_size, threshold=.5, rate_dropout=.1):
        
        super(TagModel, self).__init__()
        self.dic_tags = dic_tags
        self.tag_number = str(tag_number)
        self.input_size = input_size
        self.output_size = output_size
        self.rate_dropout = rate_dropout
        
        self.sigmoid = torch.nn.Sigmoid()
        self.dropout = nn.Dropout(self.rate_dropout)
        self.batch_norm0 = nn.BatchNorm1d(len(self.dic_tags[self.tag_number]))
        self.batch_norm1 = nn.BatchNorm1d(2 * len(self.dic_tags[self.tag_number]))
        self.dense0 = torch.nn.Linear(
            len(self.dic_tags[self.tag_number]),
            2 * len(self.dic_tags[self.tag_number])
        )
        self.dense1 = torch.nn.Linear(
            2 * len(self.dic_tags[self.tag_number]),
            self.output_size
        )
        self.relu = torch.nn.ReLU()
        
    def forward(self, x):
        
        x = self.batch_norm0(x[:, self.dic_tags[self.tag_number]])
        x = self.dropout(x)
        x = self.dense0(x)
        
        x = self.relu(x)
        
        x = self.batch_norm1(x)
        x = self.dropout(x)
        x = self.dense1(x)
        
        x = self.sigmoid(x)
        return x

**Training all TagModel**

In [None]:
# Model : architecture
batch_size = 16384
input_size = 130
output_size = 1

# Model : training
num_epochs = 100
learning_rate = 0.001
threshold =.5
es_mode = "min"
patience = 3

dic_records = build_dic_records(
            datasets=["train", "val"],
            dic_tags=dic_tags,
            records=["lst_loss_epoch", "lst_loss_batch", "lst_utility", "lst_accuracy", "lst_precision", "lst_recall"]
)

if TRAINING_TAGS:

    pbar_tag = tqdm(total=len(dic_tags), position=0)

    for tag in range(len(dic_tags)):
        str_tag = str(tag)

        # sample data
        data = data.sample(frac=1)
        data.reset_index(drop=True, inplace=True)

        train_index = [e for e in range(0, int(data.shape[0] * SIZE_TRAIN), 1)]
        val_index = [e for e in range(max(train_index), data.shape[0], 1)]

        # build sets for training
        train = data.loc[train_index]
        val = data.loc[val_index]
        train_set = BuildDataset(df=train.loc[train_index], col_x=LST_FEATURES, target=LST_TARGETS)
        val_set = BuildDataset(df=val.loc[val_index], col_x=LST_FEATURES, target=LST_TARGETS)
        train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)
        val_loader = torch.utils.data.DataLoader(val_set, batch_size=batch_size)

        # compute and save perfect utility for each set
        perfect_train_utility = utility(dates=train.date.values, weights=train.weight.values, true_resp=train.resp.values, actions=(train.resp > 0).astype(int))
        perfect_val_utility = utility(dates=val.date.values, weights=val.weight.values, true_resp=val.resp.values, actions=(val.resp > 0).astype(int))
        dic_records["train"]["tag_"+str_tag]["perfect_utility"] = perfect_train_utility
        dic_records["val"]["tag_"+str_tag]["perfect_utility"] = perfect_val_utility
        dic_records["train"]["tag_"+str_tag]["lst_index"] = train_index
        dic_records["val"]["tag_"+str_tag]["lst_index"] = val_index

        # define utils for model
        torch.cuda.empty_cache()
        model = TagModel(dic_tags, tag, input_size, output_size).to(DEVICE)
        p1_utility_loss = P1UtilityLoss(threshold)
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
        path_model = f"./TagModel{str_tag}_{DATE_NOW}.pt"
        early_stopping = EarlyStopping(mode=es_mode, patience=patience, verbose=True, path=path_model)

        pbar_epoch = tqdm(total=num_epochs, position=1)

        for epoch in range(num_epochs):

            dic_records["train"]["tag_"+str_tag]["lst_loss_batch"] = []
            dic_records["val"]["tag_"+str_tag]["lst_loss_batch"] = []
            train_outputs = np.empty(shape=(len(train)))
            val_outputs = np.empty(shape=(len(val)))

            pbar_batch = tqdm(total=len(train_loader), position=2)

            for i, train_batch in enumerate(train_loader):

                # Allow training <=> moving weights
                model.train()

                # Remember index on train data
                start_ind = i * batch_size
                end_ind = start_ind + batch_size

                # Send to GPU
                X = train_batch["features"].to(DEVICE)
                y = train_batch["label"].to(DEVICE)
                w = train_batch["weights"].to(DEVICE)
                r = train_batch["resps"].to(DEVICE)
                a = train_batch["actions"].to(DEVICE)

                # Forward pass
                outputs = model(X.float())
                loss = p1_utility_loss(a.to(DEVICE), outputs[:,0].to(DEVICE), w.to(DEVICE), r.to(DEVICE))
                # Backward and optimize
                optimizer.zero_grad() 
                loss.backward()
                optimizer.step()

                # Save 
                dic_records["train"]["tag_"+str_tag]["lst_loss_batch"].append(loss.item())
                train_outputs[start_ind:end_ind] = outputs[:,0].cpu().detach().numpy()
                pbar_batch.update(1)

            # compute loss on validation
            model.eval()
            with torch.no_grad():
                for i_, val_batch in enumerate(val_loader):

                    # Remember index on validation data
                    start_ind_ = i_ * batch_size
                    end_ind_ = start_ind_ + batch_size

                    # Send to GPU
                    X_ = val_batch["features"].to(DEVICE)
                    y_ = val_batch["label"].to(DEVICE)
                    w_ = val_batch["weights"].to(DEVICE)
                    r_ = val_batch["resps"].to(DEVICE)
                    a_ = val_batch["actions"].to(DEVICE)

                    # Compute
                    outputs_ = model(X_.float())
                    loss_ = p1_utility_loss(a_.to(DEVICE), outputs_[:,0].to(DEVICE), w_.to(DEVICE), r_.to(DEVICE))
                    # Save
                    dic_records["val"]["tag_"+str_tag]["lst_loss_batch"].append(loss_.item())
                    val_outputs[start_ind_:end_ind_] = outputs_[:,0].cpu().detach().numpy()
            dic_records["train"]["tag_"+str_tag]["lst_loss_epoch"].append(np.mean(dic_records["train"]["tag_"+str_tag]["lst_loss_batch"]))
            dic_records["val"]["tag_"+str_tag]["lst_loss_epoch"].append(np.mean(dic_records["val"]["tag_"+str_tag]["lst_loss_batch"]))

            # Early Stopping on loss function
            train_actions = (train_outputs > threshold).astype(int)
            train_utility = utility(dates=train.date.values, weights=train.weight.values, true_resp=train.resp.values, actions=train_actions)
            train_accuracy = accuracy_score((train.resp > 0).astype(int), train_actions)
            train_precision = precision_score((train.resp > 0).astype(int), train_actions)
            train_recall = recall_score((train.resp > 0).astype(int), train_actions)
            dic_records["train"]["tag_"+str_tag]["lst_utility"].append(train_utility)
            dic_records["train"]["tag_"+str_tag]["lst_accuracy"].append(train_accuracy)
            dic_records["train"]["tag_"+str_tag]["lst_precision"].append(train_precision)
            dic_records["train"]["tag_"+str_tag]["lst_recall"].append(train_recall)
            
            val_actions = (val_outputs > threshold).astype(int)
            val_utility = utility(dates=val.date.values, weights=val.weight.values, true_resp=val.resp.values, actions=val_actions)
            val_accuracy = accuracy_score((val.resp > 0).astype(int), val_actions)
            val_precision = precision_score((val.resp > 0).astype(int), val_actions)
            val_recall = recall_score((val.resp > 0).astype(int), val_actions)
            dic_records["val"]["tag_"+str_tag]["lst_utility"].append(val_utility)
            dic_records["val"]["tag_"+str_tag]["lst_accuracy"].append(val_accuracy)
            dic_records["val"]["tag_"+str_tag]["lst_precision"].append(val_precision)
            dic_records["val"]["tag_"+str_tag]["lst_recall"].append(val_recall)
            
            msg1 = '~~~ Epoch [{}/{}], Loss train: {:.4f}, Loss val {:.4f}'.format(
                    epoch + 1,
                    num_epochs,
                    dic_records["train"]["tag_"+str_tag]["lst_loss_epoch"][-1],
                    dic_records["val"]["tag_"+str_tag]["lst_loss_epoch"][-1]
            )
            msg2 = '~~~ train utility: {:.4f}, val utility {:.4f}'.format(
                    dic_records["train"]["tag_"+str_tag]["lst_utility"][-1],
                    dic_records["val"]["tag_"+str_tag]["lst_utility"][-1]
            )
            pbar_epoch.update(1)
            pbar_epoch.write(msg1)
            pbar_epoch.write(msg2)

            early_stopping(dic_records["val"]["tag_"+str_tag]["lst_loss_epoch"][-1], model)
            if early_stopping.early_stop:
                print("Early stopping")
                break
                
        pbar_tag.update(1)
        pbar_tag.write(f"End training TagModel {str_tag}")

**Save dict**

In [None]:
if SAVE_DICT:
    with open("./dic_records_training_TagModel.json", "w") as out:  
        json.dump(dic_records, out) 