# Adversarial Validation on MoA Dataset

In this notebook, we'll take a look at how similar the train and test dataset are. The way we're going to do this is by using a technique called [Adversarial Validation](http://fastml.com/adversarial-validation-part-one/).

We're going to assign new labels to the data, TRAIN (1) or TEST (0), then we'll train a classifier that will try to predict if an example comes from the train or test dataset. We hope the classifier perform no better than random - this would correspond to ROC AUC of 0.5, as said by Zygmunt in the post linked above.

In [None]:
import numpy as np
import pandas as pd

import lightgbm as lgb
from sklearn import model_selection

## Reading the datasets

Let's start by reading the data, and assigning the new label and concatenating it so it becomes one dataset.

In [None]:
train_features = pd.read_csv('/kaggle/input/lish-moa/train_features.csv')
test_features = pd.read_csv('/kaggle/input/lish-moa/test_features.csv')

In [None]:
train_features['TARGET'] = 1
test_features['TARGET'] = 0

In [None]:
data = pd.concat([train_features, test_features])

## Processing the new dataset

Before we continue, take a look at this:

> cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs

Since control perturbations (8% of our dataset) have no MoAs - and we will not be predicting for them - let's remove them.

In [None]:
data['cp_type'].value_counts(normalize=True)

In [None]:
data = data[data['cp_type'] == 'trt_cp'].copy()
data.head()

Let's also remove the sig_id and cp_type features. The TARGET will be removed further up the notebook.

In [None]:
data = data.drop(['sig_id', 'cp_type'], axis=1)

Let's change the data type of the cp_dose to 'category' so our model knows how to interpret it.

In [None]:
data['cp_dose'] = data['cp_dose'].astype('category')

## Train and test split

Now let's separate this new dataset into train and test. Don't forget to shuffle!

In [None]:
X_data = data.drop(['TARGET'], axis=1)
y_data = data['TARGET']

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_data, y_data, train_size=0.33, shuffle=True)

## Creating the model

Let's create our classifier and see how it performs

In [None]:
train = lgb.Dataset(X_train, label=y_train)
test = lgb.Dataset(X_test, label=y_test)

In [None]:
param = {'num_leaves': 50,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 5,
         'learning_rate': 0.2,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 44,
         "metric": 'auc',
         "verbosity": -1}

In [None]:
num_round = 50
clf = lgb.train(param, train, num_round, valid_sets = [train, test], verbose_eval=50, early_stopping_rounds = 50)

Great, we're fine! The validation's AUC was close to 0.50. Our model was not able to distinguish train from test, so we can expect a good validation in this competition. :)