Since the distributions between train and test seem to be significantly different (e.g. see https://www.kaggle.com/nanomathias/distribution-of-test-vs-training-data, https://www.kaggle.com/c/santander-value-prediction-challenge/discussion/59172 and https://www.kaggle.com/c/santander-value-prediction-challenge/discussion/59139), it is probably a good idea to adjust your validation strategy in order to get a model that generalizes well to the leaderboard.

Adversarial validation is a technique that can be used to select training samples that are most similar to test samples. We can then use these samples to validate our models on, so that they generalize to the leaderboard. More info about adversarial validation here: http://fastml.com/adversarial-validation-part-one/

In this kernel we test if the distributions between train and test are actually different, and if we can find rows in our training set that are very similar to our test set.

In [None]:
# Load libraries
import numpy as np
import pandas as pd
import gc
import datetime

from sklearn.model_selection import KFold
import lightgbm as lgb

# Params
NFOLD = 5
DATA_PATH = '../input/'

In [None]:
# Load data
train = pd.read_csv(DATA_PATH + "train.csv")
test = pd.read_csv(DATA_PATH + "test.csv")

# Mark train as 1, test as 0
train['target'] = 1
test['target'] = 0

# Concat dataframes
n_train = train.shape[0]
df = pd.concat([train, test], axis = 0)
del train, test
gc.collect()

In [None]:
# Remove columns with only one value in our training set
predictors = list(df.columns.difference(['ID', 'target']))
df_train = df.iloc[:n_train].copy()
cols_to_remove = [c for c in predictors if df_train[c].nunique() == 1]
df.drop(cols_to_remove, axis=1, inplace=True)

# Update column names
predictors = list(df.columns.difference(['ID', 'target']))

# Get some basic meta features
df['cols_mean'] = df[predictors].replace(0, np.NaN).mean(axis=1)
df['cols_count'] = df[predictors].replace(0, np.NaN).count(axis=1)
df['cols_sum'] = df[predictors].replace(0, np.NaN).sum(axis=1)
df['cols_std'] = df[predictors].replace(0, np.NaN).std(axis=1)

In [None]:
# Prepare for training

# Shuffle dataset
df = df.iloc[np.random.permutation(len(df))]
df.reset_index(drop = True, inplace = True)

# Get target column name
target = 'target'

# lgb params
lgb_params = {
        'boosting': 'gbdt',
        'application': 'binary',
        'metric': 'auc', 
        'learning_rate': 0.1,
        'num_leaves': 32,
        'max_depth': 8,
        'bagging_fraction': 0.7,
        'bagging_freq': 5,
        'feature_fraction': 0.7,
}

# Get folds for k-fold CV
folds = KFold(n_splits = NFOLD, shuffle = True, random_state = 0)
fold = folds.split(df)
    
eval_score = 0
n_estimators = 0
eval_preds = np.zeros(df.shape[0])

In [None]:
# Run LightGBM for each fold
for i, (train_index, test_index) in enumerate(fold):
    print( "\n[{}] Fold {} of {}".format(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), i+1, NFOLD))
    train_X, valid_X = df[predictors].values[train_index], df[predictors].values[test_index]
    train_y, valid_y = df[target].values[train_index], df[target].values[test_index]

    dtrain = lgb.Dataset(train_X, label = train_y,
                          feature_name = list(predictors)
                          )
    dvalid = lgb.Dataset(valid_X, label = valid_y,
                          feature_name = list(predictors)
                          )
        
    eval_results = {}
    
    bst = lgb.train(lgb_params, 
                         dtrain, 
                         valid_sets = [dtrain, dvalid], 
                         valid_names = ['train', 'valid'], 
                         evals_result = eval_results, 
                         num_boost_round = 5000,
                         early_stopping_rounds = 100,
                         verbose_eval = 100)
    
    print("\nRounds:", bst.best_iteration)
    print("AUC: ", eval_results['valid']['auc'][bst.best_iteration-1])

    n_estimators += bst.best_iteration
    eval_score += eval_results['valid']['auc'][bst.best_iteration-1]
   
    eval_preds[test_index] += bst.predict(valid_X, num_iteration = bst.best_iteration)
    
n_estimators = int(round(n_estimators/NFOLD,0))
eval_score = round(eval_score/NFOLD,6)

print("\nModel Report")
print("Rounds: ", n_estimators)
print("AUC: ", eval_score)    

In [None]:
# Feature importance
lgb.plot_importance(bst, max_num_features = 20)

In [None]:
# Get training rows that are most similar to test
df_av = df[['ID', 'target']].copy()
df_av['preds'] = eval_preds
df_av_train = df_av[df_av.target == 1]
df_av_train = df_av_train.sort_values(by=['preds']).reset_index(drop=True)

# Check distribution
df_av_train.preds.plot()

# Store to feather
df_av_train[['ID', 'preds']].reset_index(drop=True).to_feather('adversarial_validation.ft')

In [None]:
# Check first 20 rows
df_av_train.head(20)

It seems our model's AUC is around 0.94, which means train and test are highly separable. It also seems that there are round 1000 rows in our training set (20% of the total) that have a prediction smaller than 0.1, i.e. the distribution of these rows are similar to those in the test set. We can use these rows for validating our model.