# *Adversarial validation*

A way to check if your training data is too different from test data. This can be perfomed prior to training (no need to worry about data leakage). Steps:

- Ignore the value of the variable **y**, and create a new binary variable: the class **0** is attributed to the training data, and **1** to test.
- Join training and test datasets and create a classification model. You can use cross validation to calculate model metrics.
- Evaluate quality metrics, e.g., AUROC.

What is the expected result? Let's say your training and testing sets are similar (they belong to the same population). In this case, the classification model will have poor quality metric values as it will be unable to detect significant differences between the "two classes". The value of the area under the ROC curve will be close to 0.5.

However, maybe your test set is, in fact, very different from your training set. In this case, the binary model will easily be able to separate the two classes, and the value of the area under the ROC curve will be close to 1.0.

Let's implement this method:

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("Delaney_descriptors.csv", sep=";")
df.head()

Unnamed: 0,MaxEStateIndex,MinEStateIndex,MaxAbsEStateIndex,MinAbsEStateIndex,qed,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,NumRadicalElectrons,...,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea,Solubilidade_medida
0,10.253329,-1.701605,10.253329,0.486602,0.217518,457.432,430.216,457.158411,178,0,...,0,0,0,0,0,0,0,0,0,-0.77
1,11.724911,-0.14588,11.724911,0.14588,0.811283,201.225,190.137,201.078979,76,0,...,0,0,0,0,0,0,0,0,0,-3.3
2,10.020498,0.84509,10.020498,0.84509,0.343706,152.237,136.109,152.120115,62,0,...,0,0,0,0,0,0,0,0,0,-2.06
3,2.270278,1.301055,2.270278,1.301055,0.291526,278.354,264.242,278.10955,102,0,...,0,0,0,0,0,0,0,0,0,-7.87
4,2.041667,1.712963,2.041667,1.712963,0.448927,84.143,80.111,84.003371,26,0,...,0,0,0,0,0,0,1,0,0,-1.33


In [3]:
# Select only some descriptors
X = df[["MolWt", "FractionCSP3", "MolLogP", "NumAromaticRings", "NumHAcceptors", 
        "NumHDonors", "NumRotatableBonds", "TPSA"]]
y = df.iloc[:, -1]

In [4]:
# Separate train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [5]:
# Create new dummy classes for training and testing values
# y_train = class 0, y_test = class 1
y_train_adv = [0]*(len(y_train))
y_test_adv = [1]*(len(y_test))
y_adv = y_train_adv + y_test_adv
len(y_adv)

1128

In [6]:
# Group data
X_adv = pd.concat([X_train, X_test])
X_adv.shape

(1128, 8)

In [7]:
# Train a classification model and evaluate it using cross-validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

clf = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=0)
scores = cross_val_score(clf, X_adv, y_adv, cv=5, scoring='roc_auc')
scores

array([0.41318277, 0.415395  , 0.45785321, 0.48298817, 0.48837701])

An area under the ROC curve close to 1.0 indicates a good classifier, while values close to 0.5 indicate bad classifiers. We see in this case that Adversarial validation indicates that our test data is not significantly different from our training data.

Source:
- https://articles.bnomial.com/adversarial-validation