## Abstract

The aim of this notebook is to check whether train and test sets are significantly different. Can we trust our local validation schemas and public LB? I'll use adversarial validation and Kolmogorov-Smirnov Test for these purposes.

### Adversarial Validation

In [None]:
#Load packages
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold

In [None]:
#Load data; drop target and ID's
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

train.drop(train[['ID_code', 'target']], axis=1, inplace=True)
test.drop(test[['ID_code']], axis=1, inplace=True)

In [None]:
#Create label array and complete dataset
y1 = np.array([0]*train.shape[0])
y2 = np.array([1]*test.shape[0])
y = np.concatenate((y1, y2))

X_data = pd.concat([train, test])
X_data.reset_index(drop=True, inplace=True)

In [None]:
#Initialize splits&LGBM
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=13)

lgb_model = lgb.LGBMClassifier(max_depth=-1,
                                   n_estimators=100,
                                   learning_rate=0.1,
                                   objective='binary', 
                                   n_jobs=-1)
                                   
counter = 1

In [None]:
#Train 5-fold adversarial validation classifier
for train_index, test_index in skf.split(X_data, y):
    print('\nFold {}'.format(counter))
    X_fit, X_val = X_data.loc[train_index], X_data.loc[test_index]
    y_fit, y_val = y[train_index], y[test_index]
    
    lgb_model.fit(X_fit, y_fit, eval_metric='auc', 
              eval_set=[(X_val, y_val)], 
              verbose=100, early_stopping_rounds=10)
    counter+=1

Average AUC across folds is stable and concentrates around 0.5. It means that we can hardly distinguish train set from test set using adversarial validation.

Now let's expand our investigation of dataset and look at distribution of features in train and test sets with respect to [Kolmogorov-Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test).

### Kolmogorov-Smirnov Test

In [None]:
#Load more packages
from scipy.stats import ks_2samp
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore')

In [None]:
#Perform KS-Test for each feature from train/test. Draw its distribution. Count features based on statistics.
#Plots are hidden. If you'd like to look at them - press "Output" button.
hypothesisnotrejected = []
hypothesisrejected = []

for col in train.columns:
    statistic, pvalue = ks_2samp(train[col], test[col])
    if pvalue>=statistic:
        hypothesisnotrejected.append(col)
    if pvalue<statistic:
        hypothesisrejected.append(col)
        
    plt.figure(figsize=(8,4))
    plt.title("Kolmogorov-Smirnov test for train/test\n"
              "feature: {}, statistics: {:.5f}, pvalue: {:5f}".format(col, statistic, pvalue))
    sns.kdeplot(train[col], color='blue', shade=True, label='Train')
    sns.kdeplot(test[col], color='green', shade=True, label='Test')

    plt.show()

In [None]:
len(hypothesisnotrejected), len(hypothesisrejected)

In [None]:
print(hypothesisrejected)

As we can see, 185 features successfully passed Kolmogorov-Smirnov test. We cannot reject null hypothesis that those features in train and test sets came from the same distribution. 15 features haven't passed this test and probably require our attention.

## Conslusion:

From adversarial validation we have no evidence that train and test sets come from different distributions. AUC around 0.50 states that LGBM can hardly distinguish train observations from test. These datasets are quite similar. Local validation schemas and public LB track should correctly reflect your efforts in this competition.

From Kolmogorov-Smirnov Test we can also state that both sets are quite similar. Hypothesis that samples are drawn from the same distribution can be rejected only for 15 out of 200 features based on KS-Test. Probably, we should pay more attention to those 15 features.