
## <a name="Jane Street Market Prediction">About this Competition</a>

In this competition, if one is able to generate a highly predictive model which selects the right trades to execute, they would also be playing an important role in sending the market signals that push prices closer to “fair” values. That is, a better model will mean the market will be more efficient going forward. However, developing good models will be challenging for many reasons, including a very low signal-to-noise ratio, potential redundancy, strong feature correlation, and difficulty of coming up with a proper mathematical formulation. This is a Code Competition and you need to submit notebooks for evaluation.


## <a name="dataset_description"> Mission of this Kernal </a>: 

This kernal dive into more details on model interpretability and help to understand the featues interaction with target variables , It greatly helps in feature engineering 

## <a name="dataset_description">Dataset Description</a>: 


This dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade, while ts_id represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.

In the training set, train.csv, you are provided a resp value, as well as several other resp_{1,2,3,4} values that represent returns over different time horizons. These variables are not included in the test set. Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation.

This is a code competition that relies on a time-series API to ensure models do not peek forward in time. To use the API, follow the instructions on the Evaluation page. When you submit your notebook, it will be rerun on an unseen test:

During the model training phase of the competition, this unseen test set is comprised of approximately 1 million rows of historical data.
During the live forecasting phase, the test set will use periodically updated live market data.
Note that during the second (forecasting) phase of the competition, the notebook time limits will scale with the number of trades presented in the test set. Refer to the Code Requirements for details.

Files
1. train.csv - the training set, contains historical data and returns
2. example_test.csv - a mock test set which represents the structure of the unseen test set. You will not be directly using the test set or sample submission in this competition, as the time-series API will get/set the test set and predictions.
3. example_sample_submission.csv - a mock sample submission file in the correct format
4. features.csv - metadata pertaining to the anonymized features

   

# Evaluation metrics 


This competition is evaluated on a utility score. Each row in the test set represents a trading opportunity for which you will be predicting an action value, 1 to make the trade and 0 to pass on it. Each trade j has an associated weight and resp, which represents a return.

For each date i, we define: p_i = \sumj(weight{ij} resp_{ij} action_{ij})

t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{|i|}}

where |i| is the number of unique dates in the test set. The utility is then defined as: u = min(max(t,0), 6) \sum p_i.


In [None]:
!pip install ../input/datatable0110/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import os
import datatable as dt
import janestreet
import xgboost as xgb
import lightgbm as lgbm
from sklearn.metrics import make_scorer

import seaborn as sns
from sklearn import ensemble 
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import cross_val_score,StratifiedKFold
from sklearn import decomposition
from sklearn import preprocessing 
from sklearn import pipeline
from skopt import gp_minimize
from functools import partial
from skopt import space 
from hyperopt import hp,fmin,tpe,Trials
from hyperopt.pyll.base import scope 
import lightgbm as lgbm

from catboost import CatBoostClassifier, Pool
import torch



import plotly.express as px
import plotly.graph_objects as go 
from sklearn.model_selection import StratifiedKFold , KFold, RepeatedKFold,GroupKFold , GridSearchCV , train_test_split ,TimeSeriesSplit
from sklearn.metrics import roc_auc_score
color = sns.color_palette()


import multiprocessing as mp 


import warnings 
warnings.filterwarnings('ignore')

In [None]:
%%time
train = dt.fread('/kaggle/input/jane-street-market-prediction/train.csv').to_pandas()

### Consider last 800,000 records for ease of model building, While submitting prediction consider full dataset

In [None]:
train = train[1590491+400000:]

In [None]:
train['action'] = (train['resp'] > 0).astype('int')
features = [c for c in train.columns if 'feature' in c] + ['weight']
X_Train = train.loc[:, features]
y_train = train.loc[:, 'action']

In [None]:
params = dict(
    objective='binary:logistic',
    max_depth=8,
    learning_rate=0.01,
    subsample=0.9,
    colsample_bytree=0.9,
    random_state=42,
    tree_method='gpu_hist')

<font color="red" size=3>Please upvote this kernel if you like it. It motivates me to create kernal with great content  :) </font>

In [None]:
dtrain = xgb.DMatrix(X_Train, y_train)
xg_boost_clf = xgb.train(params, dtrain, num_boost_round=500)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_Train, y_train, random_state=42, test_size=0.2)

In [None]:
import shap 

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(xg_boost_clf)
shap_values = explainer.shap_values(X_test)

#use matplotlib=True

In [None]:
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_test.iloc[0,:])

In [None]:
shap_values = explainer.shap_values(X_test[2:3])
shap.force_plot(explainer.expected_value, shap_values, X_test[2:3])

In [None]:
# sort the features indexes by their importance in the model
# (sum of SHAP value magnitudes over the validation dataset)


explainer = shap.TreeExplainer(xg_boost_clf)
shap_values = explainer.shap_values(X_test)


top_inds = np.argsort(-np.sum(np.abs(shap_values), 0))

# make SHAP plots of the three most important features
for i in range(len(top_inds)):
    shap.dependence_plot(top_inds[i], shap_values, X_test)

<font color="red" size=3>Please upvote this kernel if you like it. It motivates me to create kernal with great content  :) </font>