Adversarial validation is one method to test for differences in distributions between the training and test set. The idea is to the use an auxiliary model and evaluating it's predictive power in distinguishing whether a given observation belongs to the training or test set. 

--> If the auxiliary model predicts with good performance, it hints that there are some features that are very different across training and test set, which allowed for such good performance.

For this competition, although the test set is not given to us, we can proxy the test sets by splitting our training set according to dates. Let us mimic the private LB by creating a 6 month gap (~ 125 days):
1. first 188 date 
2. last 188 dates. 

Doing so will give us a clue of how the features/data is different across dates and thus make some inference on how the test set (private LB) will behave. 

Reference kernel: 
[Adversarial Rainforest](https://www.kaggle.com/tunguz/adversarial-rainforest)

In [None]:
import numpy as np 
import pandas as pd 
import datatable as dt

import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn import model_selection, preprocessing, metrics

import matplotlib.pyplot as plt
import seaborn as sns
import shap

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
df = dt.fread('../input/jane-street-market-prediction/train.csv')
df = df.to_pandas()
print(df.shape)

In [None]:
## split data by first n dates and last n dates
first_n = 188
last_n = 188

temp_df = df[(df['date'] < first_n) | (df['date'] > df['date'].max() - last_n)].copy()

print(temp_df.shape)
print(temp_df['date'].nunique())

In [None]:
del df

In [None]:
## categorise first n dates and last n dates
temp_df.loc[:, 'target'] = 0
temp_df.loc[temp_df['date'] < first_n, 'target'] = 1

In [None]:
features = [c for c in temp_df.columns if 'feature' in c] + ['resp']

Y = temp_df['target'].values
X = temp_df[features]

In [None]:
# train test split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,Y,test_size=0.33, random_state=42)

print(X_train.shape)
print(X_test.shape)

# prepare data for lgb
train = lgb.Dataset(X_train, label=y_train)
test = lgb.Dataset(X_test, label=y_test)

In [None]:
param = {'num_leaves': 50,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 5,
         'learning_rate': 0.2,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 44,
         "metric": 'auc',
         "verbosity": -1}

In [None]:
## fit an auxillary model
num_round = 50
clf = lgb.train(param, train, num_round, valid_sets = [train, test], verbose_eval=50, early_stopping_rounds = 50)

In [None]:
# shap values to explain impact of features on model's prediction
shap.initjs()
shap_values = shap.TreeExplainer(clf).shap_values(X_test[1000:])

In [None]:
shap.summary_plot(shap_values, X_test)

The AUC is relative high with features 40-44 as main contributors. These could be features that are non-stationary (e.g computed based on historical/lagged data).

Note that I have included 'resp' as one of the features.

Let's try removing the top 10 features and see how the AUC changes

In [None]:
cols_to_remove = ['feature_' + str(x) for x in range(41,46)]

X_train.drop(cols_to_remove, axis=1, inplace = True)
X_test.drop(cols_to_remove, axis=1, inplace = True)

print(X_train.shape)
print(X_test.shape)

# prepare data for lgb
train = lgb.Dataset(X_train, label=y_train)
test = lgb.Dataset(X_test, label=y_test)

In [None]:
## fit an auxillary model
num_round = 50
clf = lgb.train(param, train, num_round, valid_sets = [train, test], verbose_eval=50, early_stopping_rounds = 50)

AUC is reduced, but is still relatively high

In [None]:
# shap values to explain impact of features on model's prediction
shap.initjs()
shap_values = shap.TreeExplainer(clf).shap_values(X_test[1000:])

In [None]:
shap.summary_plot(shap_values, X_test)

## My thoughts
* This is just one way to look at difference in distributions between two datasets. Other methods include statistical test for difference in mean/variance, structural breaks...
* This kernel only looks at two distinct dates, time component is just one of the many reasons for differences. It could be different trades instruments altogether
* Maybe this method could be used to study feature_0 too