In my previous notebook [JS Adversarial Validation: Time Consistency](http://www.kaggle.com/gerwynng/js-adversarial-validation-time-consistency), i have explored a one-fit-size-all approach to test for time consistency. 

In this notebook, i extended the analysis to look at **the individual features to see how they 'shift' across time.** Again, i will mimic the private leaderboard by creating a 6-month gap (~125 days) to see how consistent the features are across time. That is, we split the train data into two subsets:
 1. first 188 days
 2. last 188 days
 
 
Doing so will give us more ideas on how each individual feature shift across time.


[Reference kernel](https://www.kaggle.com/nroman/eda-for-cis-fraud-detection)


In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import gc
import datatable as dt
import math
import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

import matplotlib.pyplot as plt

In [None]:
df = dt.fread('../input/jane-street-market-prediction/train.csv').to_pandas()
print(df.shape)

features = [c for c in df.columns if 'feature' in c]

In [None]:
train = df[df['date'] <= 188]
test = df[df['date'] >= (500-188)]

In [None]:
param = {'num_leaves': 50,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 5,
         'learning_rate': 0.2,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 44,
         "metric": 'auc',
         "verbosity": -1}

In [None]:
def covariate_shift(feature):
    df_train = pd.DataFrame(data = {feature: train[feature], 'isTest':0})
    df_test = pd.DataFrame(data = {feature: test[feature], 'isTest': 1})
    
    # Creating a single dataframe
    df_merge = pd.concat([df_train, df_test], ignore_index=True)
    
    # Splitting it to a training and testing set
    X_train, X_test, y_train, y_test = train_test_split(df_merge[feature], df_merge['isTest'].values, test_size=0.33,
                                                        random_state=47, stratify=df_merge['isTest'].values)
    # prepare data for lgb
    train_ = lgb.Dataset(np.expand_dims(X_train,axis=-1), label=y_train)
    
    clf = lgb.train(param, train_, 50)
    roc_auc =  roc_auc_score(y_test, clf.predict(np.expand_dims(X_test,axis=-1)))

    del X_train, y_train, X_test, y_test
    gc.collect();
    
    return roc_auc

In [None]:
scores = []
for f in features:
    score = covariate_shift(f)
    scores.append(score)
    print('{feature} : {score}'.format(feature = f, score = score))

ROC AUC score close to 0.5 --> the feature does not have any shift across the 6-month gap as the auxiliary model cannot distinguish the features values between first 188 and last 188 dates.

We will normalise the ROC AUC scores across the features and create a plot for visualise inspection.

In [None]:
# normalise scores by looking at absolute difference from 0.5
norm_scores = [abs(x-0.5) for x in scores]

In [None]:
plt.figure(figsize=(20,10))
plt.plot([*range(0,130)], norm_scores)
plt.ylabel('Norm Scores')
plt.xlabel('Feature');

Lets look at the top five features with highest norm scores

In [None]:
top_features = sorted(range(len(norm_scores)), key=lambda i: norm_scores[i], reverse=True)[:5]
print('Top Five Features with Highest Covariate Shifts:')
for x in top_features:
    print('feature_{x} : {norm_score}'.format(x = x, norm_score = norm_scores[x]))