This is a quick notebook to highlight how the provided train and test set present some differences that you might want to consider when building your model.

In [None]:
!pip install tubesml==0.3.1

In [None]:
import numpy as np 
import pandas as pd

import tubesml as tml

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE

import xgboost as xgb

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

The data look as follows

In [None]:
df_train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
df_test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
df_train.head()

Both train and test sets present some missing values. The test set has a little more of them

In [None]:
print('Train Set')
print('\n')
_ = tml.list_missing(df_train)
print('_'*40)
print('Test Set')
print('\n')
_ = tml.list_missing(df_test)

If we then focus on the columns with descrete values (either categorical or ordinal), we see quite some differences between the train and the test set

In [None]:
def plot_frame(ax):
    ax.set_facecolor('#292525')
    ax.spines['bottom'].set_color('w')
    ax.tick_params(axis='x', colors='w')
    ax.xaxis.label.set_color('w')
    ax.spines['left'].set_color('w')
    ax.tick_params(axis='y', colors='w')
    ax.yaxis.label.set_color('w')
    return ax


fig, ax = plt.subplots(5, 2, figsize=(14, 25), facecolor='#292525', sharey=True)

i=0

for col in ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']:
    df_train[col].value_counts(dropna=False, normalize=True).sort_index().plot(kind='bar', ax=ax[i][0], color='#C3C92E')
    df_test[col].value_counts(dropna=False, normalize=True).sort_index().plot(kind='bar', ax=ax[i][1], color='#C93D2E')
    ax[i][0] = plot_frame(ax[i][0])
    ax[i][1] = plot_frame(ax[i][1])
    ax[i][0].set_title(f'Train set - {col}', fontsize=14, color='w')
    ax[i][1].set_title(f'Test set - {col}', fontsize=14, color='w')
    ax[i][0].set_xticklabels(ax[i][0].get_xticklabels(), rotation=0)
    ax[i][1].set_xticklabels(ax[i][1].get_xticklabels(), rotation=0)
    i += 1

That is, the test set has proportionally:

* more males
* more passengers from the third class
* more passengers travelling alone

All three can be very good predictor of the survival of a passenger and training a model on data that have a different distribution might lead to not optimal solutions.

Similarly, we see for the continuous variables the following differences

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(14, 10), facecolor='#292525', sharey=True)

i=0

for col in ['Age', 'Fare']:
    sns.kdeplot(df_train[col], ax=ax[i][0], shade=True, color='#C3C92E')
    sns.kdeplot(df_test[col], ax=ax[i][1], shade=True, color='#C93D2E')
    ax[i][0] = plot_frame(ax[i][0])
    ax[i][1] = plot_frame(ax[i][1])
    ax[i][0].set_title(f'Train set - {col}', fontsize=14, color='w')
    ax[i][1].set_title(f'Test set - {col}', fontsize=14, color='w')
    ax[i][0].set_xlabel('')
    ax[i][1].set_xlabel('')
    i += 1

Thus we see some differences here too, especially in the Age distribution.

**Something is not different, however, can be ticket number.** 

If we count the unique ticket numbers in the two sets, we find that about 18000 of them are common to the 2 datasets.

In [None]:
print(f'Unique ticket numbers in the Train set: {len(set(df_train.Ticket))}')
print(f'Unique ticket numbers in the Test set: {len(set(df_test.Ticket))}')
print(f'Ticket numbers that are present in both sets: {len(set(set(df_train.Ticket).intersection(set(df_test.Ticket))))}')

Which is an interesting thing as it might help identifying cluster of passengers traveling together

# Why does it matter?

If we build naive models, we can create nice baselines for more sofisticated ones. Here some results one would then expect when validating the model against the training set

In [None]:
preds_none = [0]*len(df_train)
preds_class = np.where(df_train['Pclass'].isin([1, 2]), 1, 0)
preds_gender = np.where(df_train['Sex'] == 'female', 1, 0)
preds_class_gender = np.where((df_train['Sex'] == 'female') & (df_train['Pclass'].isin([1, 2])), 1, 0)

print(f'Accuracy when predicting nobody survived: {accuracy_score(preds_none, df_train.Survived)}')
print(f'Accuracy when predicting only passengers in the first 2 classes survived: {accuracy_score(preds_class, df_train.Survived)}')
print(f'Accuracy when predicting only female passengers survived: {accuracy_score(preds_gender, df_train.Survived)}')
print(f'Accuracy when predicting only female passengers in the first 2 classes survived: {accuracy_score(preds_class_gender, df_train.Survived)}')

We can then make some submissions out of them

In [None]:
preds_none = [0]*len(df_test)
preds_class = np.where(df_test['Pclass'].isin([1, 2]), 1, 0)
preds_gender = np.where(df_test['Sex'] == 'female', 1, 0)
preds_class_gender = np.where((df_test['Sex'] == 'female') & (df_test['Pclass'].isin([1, 2])), 1, 0)

output = pd.DataFrame({'PassengerId': df_test.PassengerId, 'Survived': preds_none})
output.to_csv('submission_none.csv', index=False)
output = pd.DataFrame({'PassengerId': df_test.PassengerId, 'Survived': preds_class})
output.to_csv('submission_class.csv', index=False)
output = pd.DataFrame({'PassengerId': df_test.PassengerId, 'Survived': preds_gender})
output.to_csv('submission_gender.csv', index=False)
output = pd.DataFrame({'PassengerId': df_test.PassengerId, 'Survived': preds_class_gender})
output.to_csv('submission_class_gender.csv', index=False)

Which score on the public LB as follow:

* No survivors: 65%
* Only class: 67%
* Only gender: 78%
* Class and gender: 74%

Which is in line with what we expect given the different distribution showed above.

# Adversarial validation

Let's see if a model can tell the two dataset apart, we use the following dataset

In [None]:
train_feat = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
df_train = df_train[train_feat].copy()
df_test = df_test[train_feat].copy()
df_train['target'] = 1
df_test['target'] = 0

adv = pd.concat([df_train, df_test], ignore_index=True)
y_adv = adv['target']
adv = adv.drop('target', axis=1)
adv.head()

Where we also labeled with 1 all the entries from the train set and with 0 all the ones from the test set. Let's make use of the convenient methods of [**TubesML**](https://pypi.org/project/tubesml/) to get quickly a model to see if we can correctly classify the 2 sets. An ideal situation is when we can't, hence we hope in low accuracy and and AUC around 0.5

In [None]:
# Full model pipeline to impute the missing values and prepare the data for the model
num_pipe = Pipeline([('fs', tml.DtypeSel('numeric')), 
                     ('imp', tml.DfImputer(strategy='median'))])
cat_pipe = Pipeline([('fs', tml.DtypeSel('category')), 
                     ('imp', tml.DfImputer(strategy='most_frequent')), 
                     ('dum', tml.Dummify(drop_first=True))])
proc_pipe = tml.FeatureUnionDf(transformer_list=[('num', num_pipe), ('cat', cat_pipe)])
full_pipe = Pipeline([('proc', proc_pipe), 
                      ('model', xgb.XGBClassifier(n_estimators=10000, subsample=0.7, random_state=10, n_jobs=-1))])

kfolds = KFold(n_splits=10, shuffle=True, random_state=345)

# generate an out of fold prediction for the full dataset
# TubesML allows to do so with early stopping and it also returns the feature importance by fold
oof, imp = tml.cv_score(data=adv, target=y_adv,
                             cv=kfolds, estimator=full_pipe, 
                             predict_proba=True, early_stopping=100, eval_metric='auc', imp_coef=True)

print(f'AUC score: {roc_auc_score(y_score=oof, y_true=y_adv)}')
print(f'Accuracy: {accuracy_score(y_pred=oof>0.5, y_true=y_adv)}')

# plot the feature importance with some uncertainty bar
tml.plot_feat_imp(imp)

The model can reasonably find some signal to tell train and test sets apart, which complicates things when setting up a cross-validation strategy we can trust.

On the other hand, t-SNE does not show anything crazy

In [None]:
df_train['target'] = 1
df_test['target'] = 0
train_feat = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
adv = pd.concat([df_train, df_test], ignore_index=True).sample(100000)  # sampling because life is too short

green = adv['target'] == 1
red = adv['target'] == 0

adv = proc_pipe.fit_transform(adv[train_feat])
tsne = TSNE(n_components=2, init='pca', random_state=51, perplexity=100, learning_rate=500, n_jobs=-1)

y_total = tsne.fit_transform(adv)             
                           
fig, ax = plt.subplots(1, figsize=(15,8), facecolor='#292525')

ax.scatter(y_total[red, 0], y_total[red, 1], c='#C93D2E', alpha=0.5, label='Test')
ax.scatter(y_total[green, 0], y_total[green, 1], c='#C3C92E', alpha=0.2, label='Train')
ax = plot_frame(ax)
ax.legend()
plt.show()