One of the main issues that make Kaggle (and for tahat matter any other) predictive modeling tricky are the discrepancies between the training and the test datasets. In order to get an idea of the magnitude of these differences, one of the more valuable tools to use is adversarial validation. With aversariel validation we try to build an auxiliary model that predicts whether given data points belong to the train and the test set. If we can make predictions with such a model with a high degree of confidence, then that usually means that the train and test sets are significantly different, and we need to be careful to make a model that will take that into the account.

We will make this adversarial validation notebook with the Rapids library. [Rapids](https://rapids.ai) is an open-source GPU accelerated Data Sceince and Machine Learning library, developed and mainatained by [Nvidia](https://www.nvidia.com). It is designed to be compatible with many existing CPU tools, such as Pandas, scikit-learn, numpy, etc. It enables **massive** acceleration of many data-science and machine learning tasks, oftentimes by a factor fo 100X, or even more. 

Rapids is still undergoing developemnt, and only recently has it become possible to use RAPIDS natively in the Kaggle Docker environment. If you are interested in installing and riunning Rapids locally on your own machine, then you should [refer to the followong instructions](https://rapids.ai/start.html).

For the modeling part we'll use the latest version of XGBoost, which allows for GPU accelerated calculation of Shapely Values. We'll use these "SHAP" values to calculate correct feature importances. Starting with the version 1.3, XGBoost supports fast calculation of the SHAP values on GPU. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import cupy as cp # linear algebra
import cudf # data processing, CSV file I/O (e.g. cudf.read_csv)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from cuml.metrics import roc_auc_score
import shap
import gc
from random import shuffle
from sklearn.preprocessing import LabelEncoder

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import xgboost
xgboost.__version__

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-may-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-may-2021/test.csv')

In [None]:
train.head()

In [None]:
columns = test.columns[1:]
columns

In [None]:
target = np.hstack([np.ones(train.shape[0]), np.zeros(test.shape[0])])

In [None]:
train_test = np.vstack([train[columns].values, test[columns].values])

In [None]:
train_test.shape

In [None]:
index = list(range(train_test.shape[0]))
shuffle(index)

In [None]:
train_test = train_test[index, :]
target = target[index]

In [None]:
train_test = train_test.astype(np.float)

In [None]:
train, test, y_train, y_test = train_test_split(train_test, target, test_size=0.33, random_state=42)

In [None]:
del train_test
gc.collect()
gc.collect()

In [None]:
train = xgboost.DMatrix(train, label=y_train)
val = xgboost.DMatrix(test, label=y_test)

In [None]:
%%time
param = {
    'eta': 0.05,
    'max_depth': 10,
    'subsample': 0.8,
    'colsample_bytree': 0.7,
    'objective': 'reg:logistic',
    'eval_metric': 'auc',
    'tree_method': 'gpu_hist', 
    'predictor': 'gpu_predictor'
}
clf = xgboost.train(param, train, 600)

In [None]:
preds = clf.predict(val)

In [None]:
roc_auc_score(y_test, preds)

The AUC of 0.4975 is very close to the perfectly mixed sample, and for all practical purposes there seems to be no discernable difference between the train and test set. Nonehtless, let's take a look at what features may be the most different between the two sets.

In [None]:
%%time
shap_preds = clf.predict(val, pred_contribs=True) 

In [None]:
shap_preds.shape

In [None]:
shap.summary_plot(shap_preds[:,:-1], pd.DataFrame(test, columns=columns))

In [None]:
shap.summary_plot(shap_preds[:,:-1], pd.DataFrame(test, columns=columns), plot_type="bar")

Seems like featues 14 and 15 might have a slightly different distribution between the train and test sets.