### **Introduction**

**Adversarial Validation** :If you were to study some of the competition-winning solutions on Kaggle, you might notice references to “adversarial validation” . What is Ad Val?
In short, we build a classifier to try to predict which data rows are from the training set, and which are from the test set. If the two datasets came from the same distribution, this should be impossible. But if there are systematic differences in the feature values of your training and test datasets, then a classifier will be able to successfully learn to distinguish between them. The better a model you can learn to distinguish them, the bigger the problem you have.
But the good news is that you can analyze the learned model to help you diagnose the problem. And once you understand the problem, you can go about fixing it.

![adval](https://www.thetalkingmachines.com/sites/default/files/styles/widescreen_large/public/2020-04/29_big_data_grid.jpg?itok=GoyDY8Rf)
#  
*image ref.[Adversarial Validation Overview]("https://www.thetalkingmachines.com/article/adversarial-validation-overview-0")* 

**LOFO** :Among several feature selection methods LOFO is a bit different LOFO (Leave One Feature Out) Importance calculates the importances of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, with a validation scheme of choice, based on the chosen metric.

LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set. The mean and standard deviation (across the folds) of the importance of each feature is then reported.

If a model is not passed as an argument to LOFO Importance, it will run LightGBM as a default model.

See more: https://github.com/aerdem4/lofo-importance

In this notebook we will walkthroug to how to use ad-val and LOFO to get better result from XGBoost

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
plt.style.use('fivethirtyeight')
from scipy import stats
from scipy.stats import rankdata, norm


from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split, KFold, GroupKFold, GridSearchCV, StratifiedKFold

from sklearn.metrics import roc_auc_score



from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
import time, os, warnings, random, string, re, gc, sys

import xgboost as xgb
import lightgbm as lgb


if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
    
from IPython.display import display


def set_seed(seed=2021):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
set_seed()





train = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv')
tr_orig = train.copy()
ts_orig= test.copy()
target = train.pop('target')
del train['enrollee_id']
del test['enrollee_id']



In [None]:
cats = [c for c in train.columns if train[c].dtypes =='object']



for c in cats:
    le=LabelEncoder()
    le.fit(list(train[c].astype('str')) + list(test[c].astype('str')))
    train[c] = le.transform(list(train[c].astype(str))) 
    test[c] = le.transform(list(test[c].astype(str))) 
train.head()

In [None]:
sns.catplot(data=train, orient="h", kind="box", height=4.5, aspect=2, palette='Blues')
sns.catplot(data=test, orient="h", kind="box", height=4.5, aspect=2, palette='Reds')

### **XGBoost base model**

In [None]:


xgb_params = {
    
    'objective':'binary:logistic', 
    'max_depth': 5, 
    'learning_rate': 0.01, 
    'booster':'gbtree', 
    'max_leaves': 15, 
    'eval_metric': 'auc', 
    'colsample_bytree': 0.8, 
    'subsample':0.9, 
    'lambda': 2, 
    'alpha': 1, 
    'scale_pos_weight':5
   
}


xgb_scores = []

oof_xgb = np.zeros(len(train))
pred_xgb = np.zeros(len(test))

importances = pd.DataFrame()


folds = StratifiedKFold(n_splits=3, shuffle=True, random_state=4242)

for fold_, (train_ind, val_ind) in enumerate(folds.split(train, target)):
    print('fold : ----------------------------------------', fold_)
    trn_data = xgb.DMatrix(data=train.iloc[train_ind], label=target.iloc[train_ind])
    val_data = xgb.DMatrix(data= train.iloc[val_ind], label=target.iloc[val_ind])
    
       
    xgb_model = xgb.train(xgb_params, trn_data, num_boost_round=1000, evals=[(trn_data, 'train'), (val_data, 'test')], verbose_eval=100, early_stopping_rounds=100)
    oof_xgb[val_ind] = xgb_model.predict(xgb.DMatrix(train.iloc[val_ind]),  ntree_limit= xgb_model.best_ntree_limit)
    
    print(roc_auc_score(target.iloc[val_ind], oof_xgb[val_ind]))
    xgb_scores.append(roc_auc_score(target.iloc[val_ind], oof_xgb[val_ind]))
        
    importance_score = xgb_model.get_score(importance_type='gain')
    importance_frame = pd.DataFrame({'Importance': list(importance_score.values()), 'Feature': list(importance_score.keys())})
    importance_frame['fold'] = fold_ +1
    importances = pd.concat([importances, importance_frame], axis=0, sort=False)
    
    pred_xgb += xgb_model.predict(xgb.DMatrix(test), ntree_limit= xgb_model.best_ntree_limit)/folds.n_splits
 
print('model auc:', np.mean(xgb_scores))

In [None]:
answer = np.load('../input/job-change-dataset-answer/jobchange_test_target_values.npy')
roc_auc_score(answer, pred_xgb)

### Adversarial Validation with lgb

In [None]:
train = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv')

target = train['target']

cats = [c for c in train.columns if train[c].dtypes =='object']



for c in cats:
    le=LabelEncoder()
    le.fit(list(train[c].astype('str')) + list(test[c].astype('str')))
    train[c] = le.transform(list(train[c].astype(str))) 
    test[c] = le.transform(list(test[c].astype(str))) 
train.head()

use_cols  = [c for c in train.columns if c not in ['enrollee_id', 'target']]

features = list(train[use_cols].columns)

In [None]:
train = train[use_cols]
test = test[use_cols]

train['target'] = 0
test['target'] = 1
train_test = pd.concat([train, test], axis =0)

target = train_test['target'].values

train_test.head()

In [None]:
param = {'num_leaves': 50,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 5,
         'learning_rate': 0.001,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 17,
         "metric": 'auc',
         "lambda_l1": 0.1,
         "verbosity": -1}


scores = []
oof = np.zeros(len(train_test))
feature_importances_gain = pd.DataFrame()
feature_importances_gain['feature'] = train_test[features].columns

feature_importances_split = pd.DataFrame()
feature_importances_split['feature'] = train_test[features].columns


folds = KFold(n_splits=5, shuffle=True, random_state=15)



for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_test, target)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(train_test.iloc[trn_idx][features], label=target[trn_idx])
    val_data = lgb.Dataset(train_test.iloc[val_idx][features], label=target[val_idx])

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
    oof[val_idx] = clf.predict(train_test.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    
    feature_importances_gain['fold_{}'.format(fold_ + 1)] = clf.feature_importance(importance_type='gain')
    feature_importances_split['fold_{}'.format(fold_ + 1)] = clf.feature_importance(importance_type='split')

**We see that the oof AUC is VERY close to 0.5, so these two datasets seem very statistically similar. But let see which features affect train vs test distribution diff.**

In [None]:
feature_importances_gain['average'] = feature_importances_gain[['fold_{}'.format(fold + 1) for fold in range(folds.n_splits)]].mean(axis=1)
feature_importances_gain.to_csv('feature_importances.csv')

plt.figure(figsize=(20, 8))
sns.barplot(data=feature_importances_gain.sort_values(by='average', ascending=False).head(100),  x='average', y='feature');
plt.title('TOP feature importance over {} folds average affects train vs test distribution'.format(folds.n_splits));

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(25, 7))
sns.distplot(train.city,bins=50,  fit=norm,kde=True,kde_kws={"shade": True}, norm_hist=True,  color='darkcyan', ax=axs[0])
sns.distplot(train.training_hours,bins=50,  fit=norm,kde=True,kde_kws={"shade": True}, norm_hist=True,  color='darkcyan', ax=axs[1])
axs[0].set_title('Train Vs Test')
sns.distplot(test.city,bins=50,  fit=norm,kde=True,kde_kws={"shade": True}, norm_hist=True,  color='darkred', ax=axs[0])
sns.distplot(test.training_hours,bins=50,  fit=norm,kde=True,kde_kws={"shade": True},norm_hist=True,  color='darkred', ax=axs[1])
axs[1].set_title('Train Vs Test')

In [None]:
train.training_hours.hist(bins=70, color='darkcyan',figsize=(10, 5))
test.training_hours.hist(bins=70, color='darkred',figsize=(10, 5))

In [None]:
train.city.hist(bins=30, color='darkcyan',figsize=(10, 5))
test.city.hist(bins=30, color='darkred',figsize=(10, 5))

### **LOFO Importances**

In [None]:
!pip install lofo-importance

In [None]:

from lofo import LOFOImportance, Dataset, plot_importance
%matplotlib inline

# import data
train_df = tr_orig.copy()
test_df = ts_orig.copy()



In [None]:
train_df.head()

#### LOFO , Default algorithm: lgb

In [None]:

# extract a sample of the data
sample_df = train_df.sample(frac=1, random_state=0)
#sample_df.sort_values("AvSigVersion", inplace=True)
feats = [col for col in train_df.columns if col not in ['enrollee_id', 'target']]
# define the validation scheme
cv = KFold(n_splits=3, shuffle=True, random_state=0)

# define the binary target and the features
dataset = Dataset(df=sample_df, target="target", features=feats)

# define the validation scheme and scorer. The default model is LightGBM
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="roc_auc")

# get the mean and standard deviation of the importances in pandas format
importance_df = lofo_imp.get_importance()

# plot the means and standard deviations of the importances
plot_importance(importance_df, figsize=(20, 12))

**Yet "city" and "training_hours" are harmful. Healing these features could be done by feat eng , statistical techniques and so on, but the simplest is to remove them and see the effect on xgb performance**

In [None]:
train = tr_orig.copy()
test = ts_orig.copy()

target = train.pop('target')
del train['enrollee_id']
del test['enrollee_id']

cats = [c for c in train.columns if train[c].dtypes =='object']



for c in cats:
    le=LabelEncoder()
    le.fit(list(train[c].astype('str')) + list(test[c].astype('str')))
    train[c] = le.transform(list(train[c].astype(str))) 
    test[c] = le.transform(list(test[c].astype(str))) 


for df in [train, test]:
    del df['city']
    del df['training_hours']

In [None]:


xgb_params = {
    
    'objective':'binary:logistic', 
    'max_depth': 5, 
    'learning_rate': 0.01, 
    'booster':'gbtree', 
    'max_leaves': 15, 
    'eval_metric': 'auc', 
    'colsample_bytree': 0.8, 
    'subsample':0.9, 
    'lambda': 2, 
    'alpha': 1, 
    'scale_pos_weight':5
   
}


xgb_scores = []

oof_xgb = np.zeros(len(train))
pred_xgb = np.zeros(len(test))

importances = pd.DataFrame()


folds = StratifiedKFold(n_splits=3, shuffle=True, random_state=4242)

for fold_, (train_ind, val_ind) in enumerate(folds.split(train, target)):
    print('fold : ----------------------------------------', fold_)
    trn_data = xgb.DMatrix(data=train.iloc[train_ind], label=target.iloc[train_ind])
    val_data = xgb.DMatrix(data= train.iloc[val_ind], label=target.iloc[val_ind])
    
       
    xgb_model = xgb.train(xgb_params, trn_data, num_boost_round=1000, evals=[(trn_data, 'train'), (val_data, 'test')], verbose_eval=100, early_stopping_rounds=100)
    oof_xgb[val_ind] = xgb_model.predict(xgb.DMatrix(train.iloc[val_ind]),  ntree_limit= xgb_model.best_ntree_limit)
    
    print(roc_auc_score(target.iloc[val_ind], oof_xgb[val_ind]))
    xgb_scores.append(roc_auc_score(target.iloc[val_ind], oof_xgb[val_ind]))
        
    importance_score = xgb_model.get_score(importance_type='gain')
    importance_frame = pd.DataFrame({'Importance': list(importance_score.values()), 'Feature': list(importance_score.keys())})
    importance_frame['fold'] = fold_ +1
    importances = pd.concat([importances, importance_frame], axis=0, sort=False)
    
    pred_xgb += xgb_model.predict(xgb.DMatrix(test), ntree_limit= xgb_model.best_ntree_limit)/folds.n_splits
 
print('model auc:', np.mean(xgb_scores))

In [None]:

roc_auc_score(answer, pred_xgb)

>#### _*CV boost ::::   0.8023 --------->0.8057*_

>#### *test roc_auc boost: :::  0.7970 -----------> 0.7986*

**So in terms of our objective function and roc_auc and our features used  "city" and "training_hours" are harmful !**

### XGBoost Feature Importances

In [None]:
#importances['gain_log'] = importances['gain']
mean_gain = importances[['Importance', 'Feature']].groupby('Feature').mean()
#importances['mean_score'] = importances['Feature'].map(mean_gain['Importance'])
mean_gain = mean_gain.reset_index()
plt.figure(figsize=(20, 15))
sns.barplot(x='Importance', y='Feature', data=mean_gain.sort_values('Importance', ascending=False), palette='bone')