### The Adversarial validation.

Or in other words, we will try to see if our classification model will be able to distinguish the train set from the test set and if yes - we can see features importances to understand how it managed to do it.

The main idea of this technic is very simple:
- Set a binary target for the train/test set (train 1 / test 0 for example)
- Combine train and test in one dataset
- Run any Classification model to see if there is a significant difference in train/test sets.

If we got roc auc result near 0.5 (0.5-0.6) - all good, and there are no significant differences. It also means that overfitting most likely will not come from features values differences.

If we have roc auc score >0.6 - it's a sign of some "leaky" feature or values distributions in train/test sets and you should look closer and do some cleaning. 

In [None]:
import numpy as np
import pandas as pd
import os, sys, gc, warnings, random

## Sklearn utils
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold

## LGB
import lightgbm as lgb

## Turn off warnings
warnings.filterwarnings('ignore')

## SEEDer
def seed_everything(seed=0):
    random.seed(seed)
    np.random.seed(seed)

In [None]:
########################### Initial Vars
###########################################################
TARGET    = 'target'   # Our Target
SEED      = 42         # Base SEED
N_SPLITS  = 5          # Number of Kfold Splits
PATH      = '../input/tabular-playground-series-mar-2021/'

cat_cols = ['cat'+str(i) for i in range(19)]  # Categorial Columns
cnt_cols = ['cont'+str(i) for i in range(11)] # Continuous Columns 

remove_features = ['id',TARGET] # Features that we will not use for training

In [None]:
########################### Data
###########################################################

# Main data
train_df = pd.read_csv(PATH+'train.csv')
test_df  = pd.read_csv(PATH+'test.csv')

# Combine train and test
# and assign new target
train_df[TARGET] = 1
test_df[TARGET]  = 0
all_df = pd.concat([train_df, test_df]).reset_index(drop=True)

del train_df, test_df

In [None]:
########################### Categorical encoding
###########################################################
# For the Adversarial Validation we will not
# do any "fancy" encoding
for col in cat_cols:   
    all_df[col] = all_df[col].astype('category')

In [None]:
########################### Models params and Features
###########################################################
features_columns = [col for col in list(all_df) if col not in remove_features]

lgb_params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': 'auc',
                'n_estimators': 200,
                'learning_rate': 0.05,
                'num_leaves': 2**7,
                'min_data_in_leaf': 2**8,
                'feature_fraction': 0.7,
                'subsample': 0.7,
                'subsample_freq': 1,
                'early_stopping_rounds': 100,
                'boost_from_average': True,
                'seed': SEED,
                'verbose': -1
            }

In [None]:
########################### LGB Model
###########################################################

# We have enough data to use normal Kfold split
folds = KFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)

# Separate train features and target
X,y   = all_df[features_columns], all_df[TARGET]

# Create column to store predictions
all_df['preds'] = 0

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X,y)):
        
    print('Fold:',fold_+1)
        
    # Creating lgb train/valid data
    tr_x, tr_y = X.iloc[trn_idx,:], y[trn_idx] 
    vl_x, vl_y = X.iloc[val_idx,:], y[val_idx] 
        
    train_data = lgb.Dataset(tr_x, label=tr_y)
    valid_data = lgb.Dataset(vl_x, label=vl_y)
        
    # Train Model
    seed_everything(SEED)
    estimator = lgb.train(
                          lgb_params,
                          train_data,
                          valid_sets = [train_data,valid_data],
                          verbose_eval = 100,
                        )
        
    all_df.iloc[val_idx, len(list(all_df))-1] += (estimator.predict(vl_x)) 

print(roc_auc_score(all_df[TARGET], all_df['preds']))

In [None]:
# 0.5010550371750001 is a good result
# and seems that you don't have to do anything special
# with train/test values and their distributions

In [None]:
# If the score is greater than 0.6 you may want to see 
# importances chart to find "traitor" feature
lgb.plot_importance(estimator, figsize=(20,20))

In [None]:
########################### Fast example from other competition
###########################################################
TARGET    = 'target'   # Our Target
SEED      = 42         # Base SEED
N_SPLITS  = 2          # Number of Kfold Splits
PATH      = '../input/ieee-fraud-detection/'
train_df = pd.read_csv(PATH+'train_transaction.csv')
test_df  = pd.read_csv(PATH+'test_transaction.csv')
train_df[TARGET] = 1
test_df[TARGET]  = 0
del train_df['isFraud']

all_df = pd.concat([train_df, test_df]).reset_index(drop=True)
del all_df['TransactionDT'], all_df['TransactionID'] # obvious "leakers"

del train_df, test_df

for col in list(all_df):
    if all_df[col].dtype=='O':
        all_df[col] = all_df[col].astype('category')

In [None]:
########################### LGB Model
###########################################################
features_columns = [col for col in list(all_df) if col not in [TARGET]]
folds = KFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)
all_df['preds'] = 0
for fold_, (trn_idx, val_idx) in enumerate(folds.split(all_df[TARGET],all_df[TARGET])):
    print('Fold:',fold_+1)
    train_data = lgb.Dataset(all_df[features_columns].iloc[trn_idx,:], 
                             label=all_df[TARGET][trn_idx])
    valid_data = lgb.Dataset(all_df[features_columns].iloc[val_idx,:], 
                             label=all_df[TARGET][val_idx])
    estimator = lgb.train(lgb_params,train_data,
                          valid_sets = [train_data,valid_data],
                          verbose_eval = 100)
    all_df.iloc[val_idx, len(list(all_df))-1] += (estimator.predict(all_df[features_columns].iloc[val_idx,:])) 
    break # we will run only 1 fold - hust for fast check
    
print(roc_auc_score(all_df[TARGET], all_df['preds']))


In [None]:
# Score 0.88+ is toooo big 
# It means that something wrong with values
# or distributions in train/test sets

In [None]:
lgb.plot_importance(estimator, figsize=(20,20))