One of the main issues that make Kaggle (and for tahat matter any other) predictive modeling tricky are the discrepancies between the training and the test datasets. In order to get an idea of the magnitude of these differences, one of the more valuable tools to use is adversarial validation. With aversariel validation we try to build an auxiliary model that predicts whether given data points belong to the train and the test set. If we can make predictions with such a model with a high degree of confidence, then that usually means that the train and test sets are significantly different, and we need to be careful to make a model that will take that into the account.

In this notebook we'll build an adversarial model on FFT features that were first used in this competition in [this Giba's notebook](https://www.kaggle.com/titericz/0-309-baseline-logisticregression-using-fft). I've created a stand-alone notebook that extracts those features, and it can be found [here](https://www.kaggle.com/tunguz/giba-s-fft-features-only).

We will make this adversarial validation notebook with the Rapids library. [Rapids](https://rapids.ai) is an open-source GPU accelerated Data Sceince and Machine Learning library, developed and mainatained by [Nvidia](https://www.nvidia.com). It is designed to be compatible with many existing CPU tools, such as Pandas, scikit-learn, numpy, etc. It enables **massive** acceleration of many data-science and machine learning tasks, oftentimes by a factor fo 100X, or even more. 

Rapids is still undergoing developemnt, and only recently has it become possible to use RAPIDS natively in the Kaggle Docker environment. If you are interested in installing and riunning Rapids locally on your own machine, then you should [refer to the followong instructions](https://rapids.ai/start.html).

For the modeling part we'll use the latest version of XGBoost, which allows for GPU accelerated calculation of Shapely Values. We'll use these "SHAP" values to calculate correct feature importances.

In [None]:
!pip install --use-feature=2020-resolver https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bdda9e1e4879118738d9f9d5094246692c0f6123c-py3-none-manylinux2010_x86_64.whl

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import cupy as cp # linear algebra
import cudf # data processing, CSV file I/O (e.g. cudf.read_csv)
from sklearn.model_selection import train_test_split
from cuml.metrics import roc_auc_score
import shap
import gc
from random import shuffle

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/giba-s-fft-features-only/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import xgboost
xgboost.__version__

In [None]:
train = cp.load('../input/giba-s-fft-features-only/TRAIN.npy')
test = cp.load("../input/giba-s-fft-features-only/TEST.npy")

In [None]:
train.shape

In [None]:
test.shape

In [None]:
target = cp.hstack([cp.ones(train.shape[0]), cp.zeros(test.shape[0])])

In [None]:
target.shape

In [None]:
train_test = cp.vstack([train, test])

In [None]:
train_test.shape

In [None]:
index = list(range(train_test.shape[0]))
shuffle(index)

In [None]:
train_test = train_test[index, :]
target = target[index]

In [None]:
train_test.shape

In [None]:
train, test, y_train, y_test = train_test_split(train_test, target, test_size=0.33, random_state=42)

In [None]:
del train_test
gc.collect()
gc.collect()

In [None]:
train = xgboost.DMatrix(train, label=y_train)
test = xgboost.DMatrix(test, label=y_test)

In [None]:
%%time
param = {
    'eta': 0.05,
    'max_depth': 10,
    'subsample': 0.8,
    'colsample_bytree': 0.7,
    'objective': 'reg:logistic',
    'eval_metric': 'auc',
    'tree_method': 'gpu_hist', 
    'predictor': 'gpu_predictor'
}
clf = xgboost.train(param, train, 600)

In [None]:
preds = clf.predict(test)

In [None]:
roc_auc_score(y_test, preds)

AUC of 0.86 is pretty high. Let's try to see if we can find which FFT components are the most responsible for the discrepancy. In order to do this, we'll resort to calculating SHAP values, which can be done directly on GPUs with the version 1.3 of XGBoost.

In [None]:
%%time
shap_preds = clf.predict(test, pred_contribs=True)

In [None]:
shap_preds.shape

In [None]:
shap_preds[:, :1000].shape

In [None]:
shap.initjs()

In [None]:
shap.summary_plot(shap_preds[:,:1000])

In [None]:
shap.summary_plot(shap_preds[:,:1000], plot_type="bar")