# Adversarial Train/Test Similarity - TPS Nov21

In this notebook we:
* visualize a few key features, 
* their relation to the target, 
* and build a **Adversarial Scorecard Model** to distinguish between the Train/Test sets.
    * *Scorecard = Discretized Severity Levels + Logit Link Function*

## Import Packages

In [None]:
import pandas as pd
import numpy as np
import datatable as dt
import optuna

import gc; gc.enable()

import warnings
warnings.filterwarnings('ignore')

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline

## Down-Casting

In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

## Data Prep

In [None]:
%%time
PATH = '../input/tabular-playground-series-nov-2021/train.csv'
train = dt.fread(PATH).to_pandas().drop('id', axis=1)
train = reduce_memory_usage(train)

PATH = '../input/tabular-playground-series-nov-2021/test.csv'
test = dt.fread(PATH).to_pandas().drop('id', axis=1)
test = reduce_memory_usage(test)

In [None]:
train.head()

In [None]:
train.info()

In [None]:
bool_cols_train = []
for i, col in enumerate(train.columns):
    if train[col].dtypes == bool:
        bool_cols_train.append(i)
        
bool_cols_test = []
for i, col in enumerate(test.columns):
    if test[col].dtypes == bool:
        bool_cols_test.append(i)

train.iloc[:, bool_cols_train] = train.iloc[:, bool_cols_train].astype(int)
test.iloc[:, bool_cols_test] = test.iloc[:, bool_cols_test].astype(int)

In [None]:
target = 'target'
X = train.drop(target, axis=1).copy()
y = train[target].copy()

del train; gc.collect()

## Visualizations

In [None]:
SIZE = (13,5)

for c in ['f34', 'f55', 'f91', 'f43', 'f8', 'f27', 'f50', 'f71']:
    plt.figure(figsize=SIZE)
    sns.histplot(X[c], alpha=0.5, label='train')
    sns.histplot(test[c], color='red', alpha=0.5, label='test')
    plt.title(f'{c} - Distributions')
    plt.legend()
    plt.show()

In [None]:
for c in ['f34', 'f55', 'f91', 'f43', 'f8', 'f27', 'f50', 'f71']:
    plt.figure(figsize=SIZE)
    sns.histplot(X.loc[y==0, c], alpha=0.5, label='class 0')
    sns.histplot(X.loc[y==1, c], color='orange', alpha=0.5, label='class 1')
    plt.title(f'{c} - Class Distributions')
    plt.legend()
    plt.show()

## Adversarial Scorecard

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
target = 'train/test'

X[target] = 0
test[target] = 1

X = X.append(test)
y = X[target]
del X[target]; del test; gc.collect()

X.sample(5)

In [None]:
clf = LogisticRegression(class_weight='balanced', n_jobs=-1, random_state=42)
binner = KBinsDiscretizer(20)
pipe = make_pipeline(binner, clf)

scores = cross_val_score(pipe, X, y, cv=5, scoring='roc_auc')
scores.mean(), scores.std()

## **Conclusion**

There doesn't seem to be much of a difference between the train/test sets. 

*Whew!* One less thing to worry about. I was afraid there may have been some drift in a few variables since the tree-based approaches seem to be under-performing compared to the linear-based methods. 

**Q:** So what could be the reason for that?

Thoughts?