# What is Adversarial Validation?
The objective of any predictive modelling project is to create a model using the training data, and afterwards apply this model to the test data. However, for the best results it is essential that the training data is a representative sample of the data we intend to use it on (*i.e.* the test data), otherwise our model will, at best, under-perform, or at worst, be completely useless.   

***Adversarial Validation*** is a very clever and very simple way to let us know if our test data and our training data are similar; we combine our `train` and `test` data, labeling them with say a `0` for the training data and a `1` for the test data, mix them up, then see if we are able to correctly re-identify them using a binary classifier.

If we cannot correctly classify them, *i.e.* we obtain an area under the [receiver operating characteristic curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (ROC) of 0.5 then they are indistinguishable and we are good to go.

However, if we can classify them (ROC > 0.5) then we have a problem, either with the whole dataset or more likely with some features in particular, which are probably from  different distributions in the test and train datasets.
If we have a problem, we can look at the feature that was most out of place. The problem may be that there were values that were only seen in, say, training data, but not in the test data. If the contribution to the ROC is very high from one feature, it may well be a good idea to remove that feature from the model.


## Adversarial Validation to reduce overfitting
The key to avoid overfitting is to create a situation where the local cross-vlidation (CV) score is representative of the competition score. When we have a ROC of 0.5 then your local data is representative of the test data, thus your local CV score should now be representative of the Public LB score.

Procedure:

* drop the training data target column 
* label the `test` and `train` data with `0` and `1` (it doesn't really matter which is which)
* combine the training and test data into one big dataset
* perform the binary classification, for example using XGboost
* look at our AUC ROC score

We shall look at two examples of adversarial validation. Note: For the purposes of these demonstrations we shall only be using the numeric features.
# Titanic
For our first example we shall look at the [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic) dataset

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from xgboost import cv
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 12})

In [None]:
# read in the data
train = pd.read_csv("../input/titanic/train.csv")
test  = pd.read_csv("../input/titanic/test.csv")

# select only the numerical features
X_test  = test.select_dtypes(include=['number']).copy()
X_train = train.select_dtypes(include=['number']).copy()

# drop the target column from the training data
X_train = X_train.drop(['Survived'], axis=1)

# add the train/test labels
X_train["AV_label"] = 0
X_test["AV_label"]  = 1

# make one big dataset
all_data = pd.concat([X_train, X_test], axis=0, ignore_index=True)

# shuffle
all_data_shuffled = all_data.sample(frac=1)

# create our DMatrix (the XGBoost data structure)
X = all_data_shuffled.drop(['AV_label'], axis=1)
y = all_data_shuffled['AV_label']
XGBdata = xgb.DMatrix(data=X,label=y)

# our XGBoost parameters
params = {"objective":"binary:logistic",
          "eval_metric":"logloss",
          'learning_rate': 0.05,
          'max_depth': 5, }

# perform cross validation with XGBoost
cross_val_results = cv(dtrain=XGBdata, params=params, 
                       nfold=5, metrics="auc", 
                       num_boost_round=200,early_stopping_rounds=20,
                       as_pandas=True)

# print out the final result
print((cross_val_results["test-auc-mean"]).tail(1))

We can see an AUC of 1 which indicates that our classifier is able to perfectly distinguish between the original training and test data. Let us look at the most important features:

In [None]:
classifier = XGBClassifier(eval_metric='logloss',use_label_encoder=False)
classifier.fit(X, y)
fig, ax = plt.subplots(figsize=(12,4))
plot_importance(classifier, ax=ax)
plt.show();

This was actually to be expected since we did not drop the `PassengerId` column. Our classifier has learned that the distribution of values of this feature are very different for the `train` and `test` rows. Let us drop the `PassengerId` column and re-calculate the ROC.

In [None]:
X = X.drop(['PassengerId'], axis=1)

In [None]:
XGBdata = xgb.DMatrix(data=X,label=y)
cross_val_results = cv(dtrain=XGBdata, params=params, 
                       nfold=5, metrics="auc", 
                       num_boost_round=200,early_stopping_rounds=20,
                       as_pandas=True)

print((cross_val_results["test-auc-mean"]).tail(1))

we now have a much more reasonable value, much closer to our ideal value of 0.5
# House Prices
For our second example we shall use the [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) dataset

In [None]:
train = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test  = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")
X_test  = test.select_dtypes(include=['number']).copy()
X_train = train.select_dtypes(include=['number']).copy()
# drop the target column from the training data
X_train = X_train.drop(['SalePrice'], axis=1)
# add the train/test labels
X_train["AV_label"] = 0
X_test["AV_label"]  = 1
all_data = pd.concat([X_train, X_test], axis=0, ignore_index=True)
all_data_shuffled = all_data.sample(frac=1)
X = all_data_shuffled.drop(['AV_label'], axis=1)
y = all_data_shuffled['AV_label']
XGBdata = xgb.DMatrix(data=X,label=y)
cross_val_results = cv(dtrain=XGBdata, params=params, 
                       nfold=5, metrics="auc", 
                       num_boost_round=200,early_stopping_rounds=20,
                       as_pandas=True)

print((cross_val_results["test-auc-mean"]).tail(1))

again we have a value of 1, this time because we did no drop the `Id` column

In [None]:
classifier = XGBClassifier(eval_metric='logloss',use_label_encoder=False)
classifier.fit(X, y)
fig, ax = plt.subplots(figsize=(12,4))
plot_importance(classifier, ax=ax)
plt.show();

In [None]:
X = X.drop(['Id'], axis=1)

In [None]:
XGBdata = xgb.DMatrix(data=X,label=y)
cross_val_results = cv(dtrain=XGBdata, params=params, 
                       nfold=5, metrics="auc", 
                       num_boost_round=200,early_stopping_rounds=20,
                       as_pandas=True)

print((cross_val_results["test-auc-mean"]).tail(1))

again we have a much better result, indicating that both the `train` and the `test` features have the same distributions.

## A little more detailed examination
We shall now compare each feature individually. For continuous distributions one uses the [Kolmogorov-Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) of goodness of fit, here using the SciPy [`kstest`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html) (For categorical data one should use [Pearson's chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)). We shall be calculating the $p$-values of the hypothesis that the two distributions are indeed the same.

In [None]:
from scipy import stats
features_list = X_test.columns.values.tolist()
for feature in features_list:
    statistic, pvalue = stats.kstest(X_train[feature], X_test[feature])
    print("p-value %.2f" %pvalue, "for the feature",feature)

we can see very small $p$-values for the features `Id` and `AV_label`. This is wonderful news as these are precisely the features that have completely different distributions between the training and the test datasets.
# Related reading
As far as I can tell Adversarial Validation was first described in the following two blog posts:
* ["Adversarial validation, part one"](http://fastml.com/adversarial-validation-part-one/) by  Zygmunt Zając (2016-05-23)
* ["Adversarial validation, part two"](http://fastml.com/adversarial-validation-part-two/) by Zygmunt Zając (2016-06-08)