# Comparing two datasets
The challenges of Machine Learning algorithm generalizing on new data is based on the assumption that, the trends that the model has learnt on the traning data will be seen on the new datasets as well. This is not usually the case for real world problems, which leads to the generalization error and performance deterioration overtime. Specfically, these two use cases can benefit from knowing what changed in the datasets

* **Pre Implementation** - Finding out what is different between train and OOS/OOT and identifying variables with similar distributions
* **Post Implementation** - finding out what changed between training and production time period to deep dive changes in variable distribution and affect on the performance

In this notebook, I will explore 5 ways of doing these comparisons

* Using seaborn violin plots to compare the distributions visually between two datasets
* ANOVA and Tukey's test to establish whether the difference between two datasets is significant or not
* Andrew's Curves - these curves help distinguish various observations whether any differences exist on visual inspection
* KS Statistic to check whether the each variable in train and test comes from the same distribution 
* And finally, my favorite, Building a ML classifier and predicting which dataset it belongs to. This will throw out a quantitative measure using AUC as to how different are these datasets and specify which variables should be considered as important to change in datasets

### Dataset

I am using [Default on credit card clients dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset) and splitting it into train and test by sklearn library.


## 1. Violin Plots
Violin plots are similar to box and whisker plots. These are being used instead because you can split the violin in two parts compare distributions in train and test side by side. You can visually see that the plots for train (light green) and fairly close to test (darker blue)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def PrepareData():
    df = pd.read_csv('/kaggle/input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv')
#     print ('Shape :',df.shape)
#     print ('NULLS : \n', df.isnull().sum())

    df = df.rename(columns={'default.payment.next.month': 'def_pay', 
                            'PAY_0': 'PAY_1'})

    from sklearn.model_selection import train_test_split
    features = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1', 'PAY_2',
           'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
           'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
           'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
    X = df[features].copy()
    y = df['def_pay']
    
#     print ('Train Test Split')
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
#     print ('X_train Shape :', X_train.shape)
#     print ('X_test Shape :', X_test.shape)
#     print ('Event Rate Train : ', y_train.mean())
#     print ('Event Rate Test : ', y_test.mean())
    
    return X_train, X_test, y_train, y_test

# prepare data
X_train, X_test, y_train, y_test = PrepareData()

X_train['dataset'] = 'TRAIN'
X_test['dataset'] = 'TEST'
data = X_train.append(X_test)

# create subplots
f, axes = plt.subplots(4,6,figsize = (30,15))
ax0 = axes.ravel()[0]

# set ticks to non
for ax in axes.ravel():
    ax.set_xticks([])
    ax.set_yticks([])
    
# iterate over all variables and generate violin plot
for i, col in enumerate(X_train.drop('SEX', axis=1).columns[:-1]):
    ax = axes.ravel()[i]
    g = sns.violinplot(x='SEX', data=data, ax=ax, y=col, hue='dataset', split=True, palette = ['#78e08f', '#0a3d62'])
    ax.set_title(col, fontsize=14, color='#0a3d62', fontfamily='monospace')
    ax.set_ylabel('')
    ax.set_xticks([])
    ax.set_xlabel('')
    ax.get_legend().remove()
    sns.despine(top=True, right=True, left=True, bottom=True)
plt.suptitle('DISTRIBUTIONS BY GENDER FOR EACH VARIABLE IN TRAIN AND TEST', fontsize=20, color='#0a3d62', fontweight='bold', fontfamily='monospace')
plt.show()

## 2. Anova
The analysis of variance statistical models were developed by the English statistician Sir R. A. Fisher and are commonly used to determine if there is a significant difference between the means of two or more data sets.

Here we are comparing the train and test datasets. So,
* **Null Hypothesis** - The two datasets are similar
* **Alternate Hypothesis** - The two datsets are dissimilar

One way anova allows us to do this comparison, and rejecting the null hypothesis means, accepting the alternate, meaning the two datasets are significantly different. The decision for rejection is based on $p$ if $p \leq \alpha$ or significance level. $\alpha$ is typically 5% i.e. 95% confidence


In [None]:
X_train['y'] = y_train
X_train['dataset'] = 1
X_test['y'] = y_test
X_test['dataset'] = 2
data2 = X_train.append(X_test)
# data2.drop('dataset', axis=1, inplace=True)

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

anova_results = pd.DataFrame()
for col in X_train.columns[:-2]:
    model = ols(f'{col} ~ dataset', data = data2).fit()
    anova_result = sm.stats.anova_lm(model, typ=2)
    anova_result['var'] = col
    anova_results = anova_results.append(anova_result)

tukeys_results = pd.DataFrame()
for col in X_train.columns[:-2]:
    mc = MultiComparison(data2[col], data2['dataset'])
    result = mc.tukeyhsd().summary()
    result_as_df = pd.read_html(result.as_html())[0]
    result_as_df['var'] = col
    tukeys_results = tukeys_results.append(result_as_df, ignore_index=True)
    
f, ax = plt.subplots(1,1,figsize=(30,6))
anova_results[['var','PR(>F)']].dropna().set_index('var')['PR(>F)'].plot(kind='bar', ax=ax, color = '#78e08f', width=0.6)
ax.set_xticklabels(ax.get_xticklabels(), rotation=0, color='#0a3d62', fontfamily='monospace', fontsize=12)
ax.set_xlabel('')
ax.set_yticks([])
for s in ['top','right','bottom','left']:
    ax.spines[s].set_visible(False)
plt.axhline(y = 0.05, color = '#0a3d62', linestyle = '--', lw=3)
ax.text(-0.5,1.17,'P VALUES FOR EACH VARIABLE', fontsize=20, fontweight='bold', color='#0a3d62', fontfamily='monospace')
ax.text(-0.5,1.12,'All P values are > 0.05 (horizontal line), i.e. for none of variables are significantly different between train and test', fontsize=15, color='#0a3d62', fontfamily='monospace')
for bar in ax.patches:
    ax.annotate(
        format(bar.get_height(), '.2f'), 
        (bar.get_x() + bar.get_width() / 2, bar.get_height()), 
        ha='center', va='bottom',
        size=15, xytext=(0, 8), color = '#0a3d62',
        textcoords='offset points'
    )

plt.show()

## 3. Andrew's Curves
Andrews curves are used for visualizing high-dimensional data by mapping each observation onto a function. It preserves means, distance, and variances. It is given by formula:

$$T(n) = \frac{x_1}{sqrt(2)} + x_2sin(n) + x_3 cos(n) + x_4 sin(2n) + x_5 cos(2n) + ...$$

The test is completely overlayed on train curves which means there is a huge overlap in distributions of variables between train and test.

In [None]:
from pandas.plotting import andrews_curves
f,ax = plt.subplots(1,1, figsize=(30,10))
data2['dataset'] = data2.dataset.replace({1:'TRAIN',2:'TEST'})
data3 = data2.drop('y', axis=1)
andrews_curves(data3, "dataset", ax=ax, color = ['#0a3d62','#78e08f'])
for s in ['top','right','bottom','left']:
    ax.spines[s].set_visible(False)
plt.title('ANDREWS CURVES BY DATASET', fontsize=20, color='#0a3d62', fontfamily='monospace', fontweight='bold')
plt.show()

## 4. KS Statistic

* Performs the two-sample Kolmogorov-Smirnov test for goodness of fit.
* Null hypothesis states null both cumulative distributions are similar. Rejecting the null hypothesis means cumulative distributions are different.
* This test compares the underlying continuous distributions F(x) and G(x) of two independent samples.
    * `two-sided`: The null hypothesis is that the two distributions are identical, F(x)=G(x) for all x; the alternative is that they are not identical.
    * `less`: The null hypothesis is that F(x) >= G(x) for all x; the alternative is that F(x) < G(x) for at least one x.
    * `greater`: The null hypothesis is that F(x) <= G(x) for all x; the alternative is that F(x) > G(x) for at least one x.

In [None]:
from scipy.stats import ks_2samp
ksdf = pd.DataFrame()
alpha = 0.05
for col in X_train.columns[:-2]:
    s, p = ks_2samp(X_train[col], X_test[col])
    ksdf = ksdf.append(pd.DataFrame({
        'kstat' : [s],
        'pval': [p],
        'variable': [col],
        'reject_null_hypo': [p<alpha]
    }), ignore_index=True)
    

f, ax = plt.subplots(1,1,figsize=(30,6))
ksdf[['variable','pval']].set_index('variable')['pval'].plot(kind='bar', ax=ax, color = '#78e08f', width=0.6)
ax.set_xticklabels(ax.get_xticklabels(), rotation=0, color='#0a3d62', fontfamily='monospace', fontsize=12)
ax.set_xlabel('')
ax.set_yticks([])
for s in ['top','right','bottom','left']:
    ax.spines[s].set_visible(False)
plt.axhline(y = alpha, color = '#0a3d62', linestyle = '--', lw=3)
ax.text(-0.5,1.17,'P VALUE FOR EACH VARIABLE', fontsize=20, fontweight='bold', color='#0a3d62', fontfamily='monospace')
ax.text(-0.5,1.12,f'All P values are < {alpha}(horizontal line), i.e. for none of variables are significantly different between train and test', fontsize=15, color='#0a3d62', fontfamily='monospace')
for bar in ax.patches:
    ax.annotate(
        format(bar.get_height(), '.2f'), 
        (bar.get_x() + bar.get_width() / 2, bar.get_height()), 
        ha='center', va='bottom',
        size=15, xytext=(0, 8), color = '#0a3d62',
        textcoords='offset points'
    )
plt.show()



## 5. Building Model to predict train/test label
* The objective is to build a model to predict whether an observation belongs to train or test by stacking train and test
* The performance would be measured on AUC - higher AUC would mean higher difference between train and test and vice versa
* The variable importance would also suggest the variables leading the differences


In [None]:
X_train['label'] = 0
X_test['label'] = 1
data5 = X_train.append(X_test)
data5.drop('dataset', axis=1, inplace=True)
print(data5.shape)

from sklearn.model_selection import train_test_split
# print ('Train Test Split')
X_train, X_test, y_train, y_test = train_test_split(data5.drop('label',axis=1), data5.label, test_size=0.20, random_state=42)
# print ('X_train Shape :', X_train.shape)
# print ('X_test Shape :', X_test.shape)
# print ('Event Rate Train :', round(y_train.mean(),2))
# print ('Event Rate Test :', round(y_test.mean(),2))

import xgboost as xgb
train_dm = xgb.DMatrix(data = X_train, label = y_train.values)
test_dm = xgb.DMatrix(data = X_test, label = y_test.values)
params = {
    'num_boost_round': 500,
    'objective': 'binary:logistic',
    'max_depth' : 5,
    'gamma':10,
    'eta': 0.01,
    'min_child_weight': 10,
    'verbosity': 0
}
model = xgb.train(params, train_dm, num_boost_round = params['num_boost_round'])
train_preds = model.predict(train_dm)
test_preds = model.predict(test_dm)
from sklearn.metrics import roc_auc_score

print('TRAIN AUC :',round((roc_auc_score(y_train.values, train_preds))*100,2), '%')
print('TEST AUC:',round((roc_auc_score(y_test.values, test_preds))*100,2), '%')

The AUCs are around 50%, which means the model is not able to differentiate between the two datasets, which is exactly what we need for the datasets to show in ideal scenarios, for the model to perform. We can also look at the variable importances and see which variables contribute to the differences

#### Finding out which Variable contributes to most differences between datasets

In [None]:
fi = pd.DataFrame(model.get_score(importance_type='total_gain'), index = range(1)).T.reset_index()
fi.columns = ['variable','total_gain']
fi = fi.sort_values('total_gain', ascending = False)
fi['importance'] = np.sqrt(fi.total_gain)/np.sqrt(fi.total_gain.max()) * 100
fi.reset_index(drop = True)

f, ax = plt.subplots(1,1,figsize=(30,6))
fi[['variable','importance']].set_index('variable')['importance'].plot(kind='bar', ax=ax, color = '#78e08f', width=0.6)
ax.set_xticklabels(ax.get_xticklabels(), rotation=0, color='#0a3d62', fontfamily='monospace', fontsize=12)
ax.set_xlabel('')
ax.set_yticks([])
for s in ['top','right','bottom','left']:
    ax.spines[s].set_visible(False)
ax.text(-0.35,120,'VARIABLE IMPORTANCES', fontsize=20, fontweight='bold', color='#0a3d62', fontfamily='monospace')
ax.text(-0.35,113,f'Top 5 variables {fi.head().variable.values.tolist()} contribute most to the differences between train and test', fontsize=15, color='#0a3d62', fontfamily='monospace')
for bar in ax.patches:
    ax.annotate(
        format(bar.get_height(), '.2f'), 
        (bar.get_x() + bar.get_width() / 2, bar.get_height()), 
        ha='center', va='bottom',
        size=12, xytext=(0, 8), color = '#0a3d62',
        textcoords='offset points'
    )
plt.show()

## Conclusion

These 5 methods can be used to establish whether the datasets for train and OOT (Out of Time)/OOS (Out of Sample) are similar to know whether the model will generalize or not. If not, using these methods you'll know exactly which variables are contributing to the deviations both visually and quantitatively

## References

* [KS Statistic from scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html)
* [One way ANOVA and Tukey's Test](https://manchev.org/2015/07/01/using-one-way-anova-and-tukeys-test-to-compare-data-sets/)
* [Use Many models to compare](https://cran.r-project.org/web/packages/datarobot/vignettes/ComparingSubsets.html)
* [Seaborn Violin Plots](https://seaborn.pydata.org/generated/seaborn.violinplot.html)
* [How to plot Andrews curves using Pandas in Python?](https://www.geeksforgeeks.org/how-to-plot-andrews-curves-using-pandas-in-python/)
