<center><h2>Jane Street Market Prediction | Numerai Baseline | katsu1110 </h2></center><hr>

Here I use an example model (XGBoost) used for the [Numerai tournament](https://numer.ai/tournament). This model [performs well](https://numer.ai/integration_test) in Numerai, but how about this competition?

This notebook loads feathered-data from [my another notebook](https://www.kaggle.com/code1110/janestreet-save-as-feather?scriptVersionId=47635784) such that we don't have to spend our time on waiting long for loading csv files.

In this notebook we treat the task as a binary classification.

In [None]:
import janestreet
import numpy as np
import pandas as pd

import os, sys
import gc
import math
import random
import pathlib
from tqdm import tqdm
from typing import List, NoReturn, Union, Tuple, Optional, Text, Generic, Callable, Dict
from sklearn.preprocessing import MinMaxScaler, StandardScaler, QuantileTransformer
from sklearn.decomposition import PCA
from sklearn import linear_model
import operator
import xgboost as xgb
import lightgbm as lgb
from tqdm import tqdm_notebook as tqdm

# visualize
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from matplotlib_venn import venn2
from matplotlib import pyplot
from matplotlib.ticker import ScalarFormatter
sns.set_context("talk")
style.use('fivethirtyeight')
pd.options.display.max_columns = None

import warnings
warnings.filterwarnings('ignore')

# Config
Some configuration setups.

In [None]:
SEED = 20201225 # Merry Christmas!
# INPUT_DIR = '../input/jane-street-market-prediction/'
INPUT_DIR = '../input/janestreet-save-as-feather/'
TRADING_THRESHOLD = 0.502 # 0 ~ 1: The smaller, the more aggressive
DATE_BEGIN = 86 # 0 ~ 499: set 0 for model training using the complete data 

# Load data
I have already saved the training data in the feather-format in [my another notebook](https://www.kaggle.com/code1110/janestreet-save-as-feather?scriptVersionId=47635784). Loading csv takes time but loading feather is really light:)

In [None]:
os.listdir(INPUT_DIR)

In [None]:
%%time

# load data blitz fast!
def load_data(input_dir=INPUT_DIR):
    train = pd.read_feather(pathlib.Path(input_dir + 'train.feather'))
    features = pd.read_feather(pathlib.Path(input_dir + 'features.feather'))
    example_test = pd.read_feather(pathlib.Path(input_dir + 'example_test.feather'))
    ss = pd.read_feather(pathlib.Path(input_dir + 'example_sample_submission.feather'))
    return train, features, example_test, ss

train, features, example_test, ss = load_data(INPUT_DIR)

# EDA (Exploratory Data Analysis)
Let's briefly look at data.

## Train
>train.csv - the training set, contains historical data and returns

In [None]:
print(train.shape)
train.head()

In [None]:
print('Date range from {} to {}.'.format(train['date'].min(), train['date'].max()))
print('ts_id range from {} to {}.'.format(train['ts_id'].min(), train['ts_id'].max()))

In [None]:
# histograms for non-features
for f in ['weight', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp']:
    train[f].plot.hist(bins=100, title=f)
    plt.show()

>In the training set, train.csv, you are provided a resp value, as well as several other resp_{1,2,3,4} values that represent returns over different time horizons.

Let's look at 'resp': how they are correlated with one another.

In [None]:
# resp...correlated with one another?
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
sns.heatmap(train[['resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp']].corr(), 
            annot=True, square=True, ax=ax);
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right');
ax.set_title('resp correlations');

>Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade

OK, how does the return look like?

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 5))
for f in ['resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp']:
    train.iloc[-10000:].plot(x='ts_id', y=f, alpha=0.4, label=f, ax=ax) # last 10000
ax.legend(frameon=False)
ax.set_ylabel('return')

Not always winning but overall many chances of winning?:D

In [None]:
# target
train['action'] = train['resp'] * train['weight']
train['action'].describe()

The target 'action' has its median of 0, meaning that we can take this task as a binary classification without label unbalance. 

Alternatively we can also take it as a regression problem. 

## features
>features.csv - metadata pertaining to the anonymized features



In [None]:
print(features.shape)
features.head()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 10))
sns.heatmap(features[[f for f in features.columns.values.tolist() if f.startswith('tag')]], 
            ax=ax);
ax.set_yticklabels(features['feature'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right');
ax.set_title('features.csv');

Looks beautiful but what does this indicate???

## Test, submission
Note that this is a code competition...which means, the actual test file is hidden (the example_test is just an example!) ...your score is calculated on the hidden dataset only when you submit.

In [None]:
print(example_test.shape)
example_test.head()

In [None]:
print(ss.shape)
ss.head()

In [None]:
# delete irrelevant files to save memory
del features, example_test, ss
gc.collect()

# Model fitting
For now, let's use a simple XGBoost which is also used as an example in the Numerai Tournament.

There are several columns of returns (resp 1-4 and resp). Let's predict all as our targets and ensemble the results for our final prediction.

In [None]:
# remove weight = 0 for saving memory 
original_size = train.shape[0]
train = train.query('weight > 0').reset_index(drop=True)

# use data later than DATE_BEGIN
train = train.query(f'date >= {DATE_BEGIN}')

print('Train size reduced from {:,} to {:,}.'.format(original_size, train.shape[0]))

In [None]:
train['action'].describe()

In [None]:
train['action'].hist(bins=100)

In [None]:
# features to use
feats = [f for f in train.columns.values.tolist() if f.startswith('feature')]
print('There are {:,} features.'.format(len(feats)))

In [None]:
%%time

# same hyperparameters from an numerai example (https://github.com/numerai/example-scripts/blob/master/example_model.py)
params = {
    'colsample_bytree': 0.1,                 
    'learning_rate': 0.01,
    'max_depth': 5,
    'seed': SEED,
    'n_jobs': -1,
    'n_estimators': 2000,
#     'tree_method': 'gpu_hist' # Let's use GPU for a faster experiment
}

# params["objective"] = 'reg:squarederror'
# params["eval_metric"] = 'rmse'
# model = xgb.XGBRegressor(**params)

params["objective"] = 'binary:logistic'
params["eval_metric"] = 'logloss'
    
model = xgb.XGBClassifier(**params)
model.fit(train[feats], train['action'])

# Feature importance
Let's see feature importance given by the model.

In [None]:
pd.DataFrame(model.feature_importances_, index=feats, columns=['importance']).sort_values(by='importance', ascending=False).style.background_gradient(cmap='viridis')

# Submit

In [None]:
env = janestreet.make_env()
test = env.iter_test()
        
weight_sum = np.sum(np.array(AVERAGE_WEIGHTS))

for (test_df, pred_df) in tqdm(env.iter_test()):
    if test_df['weight'].item() > 0:
        pred_df.action = (model.predict_proba(test_df[feats])[:, 1] > TRADING_THRESHOLD).astype('int')
    else:
        pred_df.action = 0
    env.predict(pred_df)