# Adversarial Validation - TPS 6

Adversarial validation is a method to check how different feature distributions between the training and test data. We train a binary classifier with a new target variable indicating whether a sample belongs to the test data (1) or not (0).

From the [EDA](https://www.kaggle.com/subinium/tps-jun-this-is-original-eda-viz) by @subinium, we already saw that feature distrubutions across the training and test data are quite similar, but here, let's use adversarial validation to validate the finding.

In [None]:
import lightgbm as lgb
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import seaborn
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from warnings import simplefilter

In [None]:
!pip install kaggler

In [None]:
import kaggler
from kaggler.model import AutoLGB
print(kaggler.__version__)

In [None]:
plt.style.use('fivethirtyeight')
pd.set_option('max_columns', 100)
simplefilter('ignore')

In [None]:
data_dir = Path('../input/tabular-playground-series-jun-2021')
train_file = data_dir / 'train.csv'
test_file = data_dir / 'test.csv'
sample_file = data_dir / 'sample_submission.csv'

id_col = 'id'
target_col = 'target'

n_fold = 5
seed = 42

In [None]:
trn = pd.read_csv(train_file, index_col=id_col)
tst = pd.read_csv(test_file, index_col=id_col)
sub = pd.read_csv(sample_file, index_col=id_col)
print(trn.shape, tst.shape, sub.shape)

In [None]:
n_trn = trn.shape[0]
df = pd.concat([trn.drop(target_col, axis=1), tst], axis=0)
print(df.shape)

In [None]:
cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)
X = df
y = pd.Series(np.concatenate([np.zeros(n_trn,), np.ones(df.shape[0] - n_trn,)]))
p = np.zeros_like(y, dtype=float)
for i, (i_trn, i_val) in enumerate(cv.split(X, y)):
    if i == 0:
        clf = AutoLGB(objective='binary', metric='auc', random_state=seed, feature_selection=False)
        clf.tune(X.iloc[i_trn], y[i_trn])
        features = clf.features
        params = clf.params
        n_best = clf.n_best
        print(f'{n_best}')
        print(f'{params}')
        print(f'{features}')
    
    trn_data = lgb.Dataset(X.iloc[i_trn], y[i_trn])
    val_data = lgb.Dataset(X.iloc[i_val], y[i_val])
    clf = lgb.train(params, trn_data, n_best, val_data, verbose_eval=100)
    p[i_val] = clf.predict(X.iloc[i_val])
    print(f'CV #{i + 1} AUC: {roc_auc_score(y[i_val], p[i_val]):.6f}')

In [None]:
print(f'CV AUC: {roc_auc_score(y, p):.6f}')

# Conclusion

Adverarial validation AUC is close to 50%. In other words, it confirms that the feature distributions between the training and test data are similar.

Let's have some fun. :)