## Notes about the data: 
> 1. The training data has over 2 million rows, which the largest I have ever dealt with so far.
> 2. <strike>The difference between train and test data is resp columns, which I don't understand what they signify so far.</strike> 
> 3. resp represent market returns over different time horizons
> 4. weight and resp represent the return on the trade, but resp isn't included in the test data, so one thought is to predict resp in the test data based on features and then predict the action.
> 5. <strike>action isn't included in the training set, why?</strike>
> 6. Point no. 4 was on point, but I didn't comprehend that I would use the return to determine the action directly. 
> 7. What should be done here is to simply set the action to 1 if the return is positive, or 0 if the return is negative. 
> 8. Since the scoring function is a utility function based on the presence of real test <strike>resps</strike> resp during evaluation but absence in prediction, the action we determine would increase the positive score of the sum of the utility function or decrease it by adding a trade that has an overall negative resp.
> 9. Also the weight of the trade would increase the effect which a single trade have on the scoring function.
> 10. Should the sign of <strike>resps sum</strike> resp be enough for setting an action to 1 or 0, or should there be a threshold? If there shall be a threshold, how would that be determined?
> 11. resp isn't the sum of resp_{1, 2, 3, 4}
> 12. Weights are heavily right skewed, which could mean that there are two strategies here, focusing on the high weighted trades, and focusing on the low wighted trades. Maybe for each range there should be a different model.
> 13. Different weight ranges should be treated differently by a model. Or maybe different models would be suited for different weight ranges, as resp distribution differs according to weight. Also maybe models outputs should be clipped at their specific weight range of resps?
> 14. resp_4 is most similar to resp in distribution, and also highly correlated with resp.
> 15. Trade counts evidently differ between days. Would this have any implication during training?
> 16. Mean resp_{1, 2, 3, 4} differ over days and could be using for feature engineering by mean 
encodings and more.
> 17. Several features blocks show high correlation with other blocks or with each other.
> 18. Features ranges differ greatly and need standardization before training.
> 19. Most popular cv strategy in kernels is GroupTimeSeriesSplit.
> 20. Should we use rows with weight 0? For example, we could predict resps instead of action, then calculate the action based on (weight * resp) > 0. I guess this would boost our model's ability by giving it more data?
> 21. What if we predicted resp_{1, 2, 3, 4} and then decided a majority vote on action based on all?
> 22. If we chose action to be resp > 0, should we evaluate model performance based on precision and recall, or ROC AUC? That might depend on the distribution of action.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import cudf
import janestreet
import gc

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
PLOT = False

In [None]:
# load data
dir_path = '/kaggle/input/jane-street-market-prediction/'

train = cudf.read_csv(dir_path+'train.csv')
# feats = pd.read_csv(dir_path+'features.csv')
# test = pd.read_csv(dir_path+'example_test.csv')
# submission = pd.read_csv(dir_path+'example_sample_submission.csv')

# Data Exploration

## What are the dimensions of the data?

In [None]:
print("Dimensions\n")
print("TRAIN:", train.shape)
# print("FEATS:", feats.shape)
# print("TEST:", test.shape)
# print("SUB:", submission.shape)

## Which features aren't present in test data?

In [None]:
# cols difference between train and test
if PLOT:
    set(train.columns.values.tolist()).difference(test.columns.values.tolist())

## How does the training data look like?

In [None]:
if PLOT:
    train.head()

## How many unique days do we have in the training data?

In [None]:
# How many days do we have?
if PLOT:
    train.date.nunique()

**We have 500 unique days of data in the training set.**

## Is resp that sum of resp_{1, 2, 3, 4}?

In [None]:
# Is resp the sum resp_{1, 2, 3, 4}?
if PLOT:
    train[[f'resp_{i}' for i in range(1, 5)]].sum(axis=1) == train['resp']

**resp isn't the sum of resp_{1, 2, 3, 4}**

## How are weights distributed?

In [None]:
# What is the distribution of weights
if PLOT:
    n = 2
    bins = np.arange(0, 167+n, n)
    train.weight.hist(bins=bins);
    plt.xlim(0, 50);

**Weights are heavily right skewed, which could mean that there are two strategies here, focusing on the high weighted trades, and focusing on the low wighted trades. Maybe for each range there should be a different model.**

## Are weights and resps in any way related?

In [None]:
# Are weights and resps in any way related?
if PLOT:
    train.plot(kind='scatter', x='weight', y='resp', figsize=(8, 8));
    plt.xticks(np.arange(0, 170, 10));

**Since the concentration of trades decreases with increasing weights, we can also see a decrease in the spread of the resp with increasing weight, which maybe indicates that different weight ranges should be treated differently by a model. Or maybe that different models would be suited for different ranges. Also that models outputs should be clipped at their specific weight range of resps?**

## How are resp and resp_{1, 2, 3, 4} distributed and are they related?

In [None]:
# we could answer this question using a scatter matrix

if PLOT:
    resp_cols = train.columns[train.columns.str.contains('resp')]

    sns.pairplot(train[resp_cols], plot_kws={'alpha': 0.1}, 
                                   diag_kws={'bins': 50});

### Distributions
1. resp_4 is most similar to resp in distribution.


### Correlations
1. resp_1 is most correlated with resp_2, and it's correlations with resp_3, resp_4 and resp decrease gradually.
2. reps_2 is most correlated with resp_1 and resp_3.
3. resp_3 shows moderate correlations with resp_1, resp_4 and resp.
4. resp_4 is highly correlated with resp.

## How many trades do we have per day?

In [None]:
if PLOT:
    train.date.value_counts().sort_index().plot(figsize=(12, 6), title='Trades over days',
                                                xlabel='Day',
                                                ylabel='count');

**Trade counts evidently differ between days. Would this have any implication during training?**

## How do mean returns differ each day?

In [None]:
if PLOT:
    fig, axes = plt.subplots(3, 2, figsize=(14, 10))
    group_date = train.groupby('date')

    plt.subplots_adjust(hspace=0.5)
    group_date.resp.mean().plot(title='Mean resp over Days',
                                           xlabel='Day',
                                           ylabel='Mean resp', ax=axes[0][0]);

    group_date.resp_1.mean().plot(title='Mean resp_1 over Days',
                                           xlabel='Day',
                                           ylabel='Mean resp_1', ax=axes[0][1]);

    group_date.resp_2.mean().plot(title='Mean resp_2 over Days',
                                           xlabel='Day',
                                           ylabel='Mean resp_2', ax=axes[1][0]);

    group_date.resp_3.mean().plot(title='Mean resp_3 over Days',
                                           xlabel='Day',
                                           ylabel='Mean resp_3', ax=axes[1][1]);

    group_date.resp_4.mean().plot(title='Mean resp_4 over Days',
                                           xlabel='Day',
                                           ylabel='Mean resp_4', ax=axes[2][0]);

    group_date['resp_1', 'resp_2', 'resp_3', 'resp_4'].mean().sum(axis=1).plot(title='Mean resp_{1, 2, 3, 4} sum over Days',
                                           xlabel='Day',
                                           ylabel='Mean resp_{1, 2, 3, 4} sum', ax=axes[2][1]);
    
    del group_date
    
    rubbish = gc.collect()

**I guess due to their difference, we could predict resp_{1, 2, 3, 4} and use them for mean encodings, or predict their means over days without predicting them. I guess there are alot of uses for predicting them and using them as features, and engineering more features using them.**

## How are features related?

In [None]:
if PLOT:
    heatmap_cols = train.columns[1: -1]
    heatmap_corr = train[heatmap_cols].corr()

    plt.figure(figsize=(14, 14))
    sns.heatmap(heatmap_corr, cmap='coolwarm', alpha=0.75)
    plt.title('Correlation between features', fontsize=15, weight='bold')
    
    del heatmap_corr
    rubbish = gc.collect()

**Some feature blocks show high correlation, which could indicate that their source might be the same? Unfortunately I have no idea what these anonymized features could be, but it wouldn't hurt to look into their distributions.**

## How are features distributed?

In [None]:
if PLOT:
    plt.figure(figsize=(16, 16))

    sns.set_style('whitegrid')

    rand_feats = np.random.choice(130, 16)
    for i, col in enumerate(rand_feats):
        plt.subplot(4, 4, i+1)
        plt.hist(train.loc[:, f'feature_{col}'], bins=50)
        plt.title(f'feature_{col}')

**The ranges of the features are extremely different, which indicates that some type of standardization should be carried out before training any model.**

## Understanding the utility function

$p_i = \sum_j(weight_{ij} * resp_{ij} * action_{ij})$

> **First for each date, we get the sum of product of the weight, return and action of each trade.**

$t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{|i|}}$

> **Then we sum each date ps and divide it by square root sum of squares of date ps, then multiply it by square root 250 divided by the number of dates.**

$u = min(max(t,0), 6)  \sum p_i$

> **Then the minimum of either t or 0 is chosen, and the maximum of either the result or 6 is multiplied by the sum of date ps.**

**What all of this means is that we need t to be positive to have a score above 0, and more than 6 in order to maximize the score. This in turn is decided upon the action threshold. let's try different thresholds for action in the training set an calculate the different ts and us.**

In [None]:
def determine_action(df, thresh):
    """Determines action based on defined threshold."""
    action = (df.weight * df.resp).astype(int)
    return action

def date_weighted_resp(df):
    """Calculates the sum of weight, resp, action product."""
    cols = ['weight', 'resp', 'action']
    weighted_resp = np.prod(df[cols], axis=1)
    return weighted_resp.sum()

def calculate_t(dates_p):
    """Calculate t based on dates sum of weighted returns"""
    e_1 =  dates_p.sum() / np.sqrt((dates_p**2).sum())
    e_2 = np.sqrt(250/np.abs(len(dates_p)))
    return e_1 * e_2

def calculate_u(df, thresh):
    """Calculates utility score, and return t and u."""
    df = df.copy()
    
    # determines action based on threshold
    df['action'] = determine_action(df, thresh)
    
    # calculates sum of dates weighted returns
    dates_p = df.groupby('date').apply(date_weighted_resp)
        
    # calculate t
    t = calculate_t(dates_p)
    

    return t, min(max(t, 0), 6) * dates_p.sum()

In [None]:
# # Testing function time
# thresh = 0

# %time u = calculate_u(train, thresh)

In [None]:
if PLOT:
    threshs = np.linspace(-0.5, 0.5, 100)
    ts = []
    us = []

    for thresh in threshs:
        t, u = calculate_u(train, thresh)
        ts.append(t)
        us.append(u)
        
    fig, axes = plt.subplots(1, 2, figsize=(14, 4))
    axes[0].plot(threshs, ts)
    axes[0].set_title('Different t scores by threshold')
    axes[0].set_xlabel('Threshold')

    axes[1].plot(threshs, us)
    axes[1].set_title('Different u scores by threshold')
    axes[1].set_xlabel('Threshold');

# Data Preparation

## Filling missing values

Since my baseline model will be XGBoost, I'll set NaNs to -999.


## Rows with 0 weight

I'll also begin by training the model on all data, even data with weights set to zero.


## Features

I'll use only features_{0, 129}

In [None]:
# NaN filling
train = train.fillna(-999)

# Action determination
train['action'] = (train['weight'] * train['resp'] > 0).astype(int)

# Features
features = train.columns[train.columns.str.contains('feature')]
# train[features] = train[features].astype(np.float64)


# Modelling

## Model Parameters

<strike>I'll stick with parameters provided by this Yirun's kernel: https://www.kaggle.com/gogo827jz/jane-street-xgboost-grouptimesplitkfold</strike>

At first I was going to tackle this as a classification problem, but then I decided that I'd like to begin by tackling it as a regression one. So I'll start with just regular parameters to check.

## Problem definition

I have a hunch that says treat the problem as regression, then determine the action post model using the equation (weight * resp) > thresh, where thresh right now is 0, but I think it could be changed.

In [None]:
seed = 1995

# params = {'booster': 'gblinear',
#           'objective': 'reg:squarederror',
#           'ntree_limit' : 0,
#           'colsample_bytree': 0.5,
#           'learning_rate': 0.05,
#           'max_depth': 5,
#           'alpha': 10,
#           'n_estimators': 10, 
#           'tree_method': 'gpu_hist', 
#           'random_state': seed,
#          }

# target = 'resp'


params = {'learning_rate': 0.050055662268729532,
          'max_depth': 6, 
          'gamma': 0.07902741481945934, 
          'min_child_weight': 9.9404564544994, 
          'subsample': 0.7001330243186357, 
          'colsample_bytree': 0.7064645381596891, 
          'objective': 'binary:logistic',
          'eval_metric': 'auc', 
          'tree_method': 'gpu_hist', 
          'random_state': seed,
         }

target = 'action'

## CV Strategy

I don't know the most optimal strategy here, so maybe I should just start simple with a normal kfold, then I could try more complex stuff once this one passes the test.

<strike>A couple of ideas the I have are:
1. Splitting Stratified Kfold according to reset ts_id based on day so each fold has equal number of ordered trades.
2. Splitting Stratified Kfold based on date.
3. Splitting Stratified Kfold based on action.
4. Splitting Multiple Stratfied Kfold based on any combination of the 3 previous splits.</strike>

I took a look at the kernels, and the consensus seems that using TimeSeries splits is the popular strategy here. As far as I understand, GroupTimeSeriesSplit splits the data with respect to it's temporal significance group, which is date, in an ascending manner. Where for example if KFolds was the number of dates, the first training set is just date 1 and validation is date 2, and so on, until the training data is dates 1-499 and validation is 500. That's of course is just a leave one out per group, but it simplifies the idea behind it.


## GroupTimeSplitKFold


In [None]:
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args

# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class GroupTimeSeriesSplit(_BaseKFold):
    """Time Series cross-validator variant with non-overlapping groups.
    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals according to a
    third-party provided group.
    In each split, test indices must be higher than before, and thus shuffling
    in cross validator is inappropriate.
    This cross-validation object is a variation of :class:`KFold`.
    In the kth split, it returns first k folds as train set and the
    (k+1)th fold as test set.
    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).
    Note that unlike standard cross-validation methods, successive
    training sets are supersets of those that come before them.
    Read more in the :ref:`User Guide <cross_validation>`.
    Parameters
    ----------
    n_splits : int, default=5
        Number of splits. Must be at least 2.
    max_train_size : int, default=None
        Maximum size for a single training set.
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import GroupTimeSeriesSplit
    >>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
                           'b', 'b', 'b', 'b', 'b',\
                           'c', 'c', 'c', 'c',\
                           'd', 'd', 'd'])
    >>> gtss = GroupTimeSeriesSplit(n_splits=3)
    >>> for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...     print("TRAIN:", train_idx, "TEST:", test_idx)
    ...     print("TRAIN GROUP:", groups[train_idx],\
                  "TEST GROUP:", groups[test_idx])
    TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']\
    TEST GROUP: ['b' 'b' 'b' 'b' 'b']
    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']\
    TEST GROUP: ['c' 'c' 'c' 'c']
    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\
    TEST: [15, 16, 17]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']\
    TEST GROUP: ['d' 'd' 'd']
    """
    @_deprecate_positional_args
    def __init__(self,
                 n_splits=5,
                 *,
                 max_train_size=None
                 ):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_size = max_train_size

    def split(self, X, y=None, groups=None):
        """Generate indices to split data into training and test set.
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like of shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like of shape (n_samples,)
            Group labels for the samples used while splitting the dataset into
            train/test set.
        Yields
        ------
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        """
        if groups is None:
            raise ValueError(
                "The 'groups' parameter should not be None")
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        group_dict = {}
        u, ind = np.unique(groups, return_index=True)
        unique_groups = u[np.argsort(ind)]
        n_samples = _num_samples(X)
        n_groups = _num_samples(unique_groups)
        for idx in np.arange(n_samples):
            if (groups[idx] in group_dict):
                group_dict[groups[idx]].append(idx)
            else:
                group_dict[groups[idx]] = [idx]
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds={0} greater than"
                 " the number of groups={1}").format(n_folds,
                                                     n_groups))
        group_test_size = n_groups // n_folds
        group_test_starts = range(n_groups - n_splits * group_test_size,
                                  n_groups, group_test_size)
        for group_test_start in group_test_starts:
            train_array = []
            test_array = []
            for train_group_idx in unique_groups[:group_test_start]:
                train_array_tmp = group_dict[train_group_idx]
                train_array = np.sort(np.unique(
                                      np.concatenate((train_array,
                                                      train_array_tmp)),
                                      axis=None), axis=None)
            train_end = train_array.size
            if self.max_train_size and self.max_train_size < train_end:
                train_array = train_array[train_end -
                                          self.max_train_size:train_end]
            for test_group_idx in unique_groups[group_test_start:
                                                group_test_start +
                                                group_test_size]:
                test_array_tmp = group_dict[test_group_idx]
                test_array = np.sort(np.unique(
                                              np.concatenate((test_array,
                                                              test_array_tmp)),
                                     axis=None), axis=None)
            yield [int(i) for i in train_array], [int(i) for i in test_array]


In [None]:
# Just for testing the pipeline's readiness

# train = train.iloc[:10000]

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, roc_auc_score, roc_curve, precision_recall_curve
import xgboost as xgb

n_splits = 5
thresh = 0

# oof validation prediction array
oof = np.zeros(len(train['action']))

# validation indices in case of time series split
val_idx_all = []

# a list to store k-folds models
models = []

# cv strategy
gkf = GroupTimeSeriesSplit(n_splits=n_splits)
# kfold = KFold(n_splits=n_splits)


for fold, (train_idx, val_idx) in enumerate(gkf.split(train.action.values.get(), groups=train.date.values.get())):
    
    X_train, X_val = train.loc[train_idx, features].values.get(), train.loc[val_idx, features].values.get()
    y_train, y_val = train.loc[train_idx, target].values.get(), train.loc[val_idx, target].values.get()
    
    # init dmatrix for optimized learning
    D_train = xgb.DMatrix(X_train, label=y_train)
    D_val = xgb.DMatrix(X_val, label=y_val)
    
    # training and evaluation score
    xg_reg = xgb.train(params, D_train, 10000, [(D_val, 'eval')], early_stopping_rounds=100, verbose_eval=0)

#     xg_reg = xgb.XGBRegressor(**params)
#     xg_reg.fit(X_train, y_train)
    
    # evaluation of validation predictin using rmse
    oof[val_idx] += xg_reg.predict(D_val, ntree_limit=0)
    score = roc_auc_score(y_val, oof[val_idx])
    print(f'FOLD {fold} ROC AUC:\t {score}')
    
    # appending model to list of models for further inferences
    models.append(xg_reg)
    
    # appending val_idx in case of group time series split
    val_idx_all.append(val_idx)
    
    # deleting excess data to avoid running out of memory
    del X_train, X_val, y_train, y_val, D_train, D_val
    gc.collect()
    

# concatenation of all val_idx for further acessing
val_idx = np.concatenate(val_idx_all)

In [None]:
# calculating predicted weighted resp
oof_weighted_resp = train.loc[val_idx, 'weight'].values.get() * oof[val_idx]

# calculating action based on predicted resp same way as train (weight * resp > 0)
oof_action = oof_weighted_resp.astype(int)

# holding a resp only array with trades having zero weight set to 0 action
oof_zero = np.where(train.loc[val_idx, 'weight'].values.get() == 0, 0, oof[val_idx])

# settings targets
targets_val = train.loc[val_idx, 'action'].values.get()

auc_oof = roc_auc_score(targets_val, oof_action)
print(auc_oof)

## Evaluation of different (weight x resp) thresholds using ROC AUC

In [None]:
fpr, tpr, thresholds = roc_curve(targets_val, oof_weighted_resp)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--', label='Random')  # dashed diagonal
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc='lower right')
    plt.grid()
    
plot_roc_curve(fpr, tpr, 'XGB')

## Evaluation of different resp thresholds using ROC AUC

In [None]:
fpr, tpr, thresholds = roc_curve(targets_val, oof_zero)
plot_roc_curve(fpr, tpr, 'XGB')

## Evaluation of different (weight x resp) thresholds using Precision/Recall curve

In [None]:
def plot_precision_recall_curve(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], 'b--', label='Precision')
    plt.plot(thresholds, recalls[:-1], 'g-', label='Recall')
    plt.xlabel('Thresholds')
    plt.legend(loc='lower left')
    plt.grid()
    
precisions, recalls, thresholds = precision_recall_curve(targets_val, oof_weighted_resp)
plot_precision_recall_curve(precisions, recalls, thresholds)

## Evaluation of different resp thresholds using Precision/Recall curve

In [None]:
precisions, recalls, thresholds = precision_recall_curve(targets_val, oof_zero)
plot_precision_recall_curve(precisions, recalls, thresholds)

## Evaluating different (weight * resp) thresholds based on utility score

In [None]:
threshs = np.linspace(-0.5, 0.5, 100)
ts = []
us = []


for thresh in threshs:
    train.loc[val_idx, 'action'] = (oof_weighted_resp > thresh).astype(int)
    print(train['action'].value_counts())
    t, u = calculate_u(train, thresh)
    ts.append(t)
    us.append(u)

fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].plot(threshs, ts)
axes[0].set_title('Different t scores by threshold')
axes[0].set_xlabel('Threshold')

axes[1].plot(threshs, us)
axes[1].set_title('Different u scores by threshold')
axes[1].set_xlabel('Threshold');

## Evaluating different resp thresholds based on utility score

In [None]:
threshs = np.linspace(-0.5, 0.5, 100)
ts = []
us = []


for thresh in threshs:
    train.loc[val_idx, 'action'] = (oof_zero > thresh).astype(int)
    print(train['action'].value_counts())
    t, u = calculate_u(train, thresh)
    ts.append(t)
    us.append(u)

fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].plot(threshs, ts)
axes[0].set_title('Different t scores by threshold')
axes[0].set_xlabel('Threshold')

axes[1].plot(threshs, us)
axes[1].set_title('Different u scores by threshold')
axes[1].set_xlabel('Threshold');

# Submission

In [None]:
import tqdm

env = janestreet.make_env()
env_iter = env.iter_test()
    
for test_df, pred_df in tqdm.tqdm(env_iter):
    test_df = test_df.fillna(-999)
    D_test = xgb.DMatrix(test_df.loc[:, features])
    for i, reg in enumerate(models):
        if i == 0:
            pred = reg.predict(D_test, ntree_limit=0) / len(models)
        else:
            pred += reg.predict(D_test, ntree_limit=0) / len(models)
        
    # set according to different action strategty
    pred_action = (test_df['weight'] * pred).astype(int)

    pred_df.action = pred_action
    
    env.predict(pred_df)
    
    del test_df, pred_df, D_test
    gc.collect()