It is good news for me that Kaggle starts a new for-beginner competition because I'm learning data analysis with my colleagues. Thank you Kaggle team!!!

In this notebook, I will try loading, visualizing and preprocessing tabular dataset, and developing a supervised machine learning model to predict which passengers were transported by the anomaly.

- Revisions
  - 11.0: Publish
  - 12.0: Drop `PassengerNumber` from features.
  - 13.0: Voting prediction is done by SVM, LightGBM and Catboost.
  - 15.0:  
    - `CatBoostClassifier` is replaced with `CatboostVoting`.
    - Weak learners which removed from voting at version 13.0 come back.
    - Only SVM, LightGBM and CatboostVoting are used when stacking.

## Table of content

- [Load dataset](#Load-dataset)
- [Understand task and evaluation metrics](#Understand-task-and-evaluation-metrics)
- [EDA](#EDA)
- [Clustering](#Clustering)
- [Make models](#Make-models)

In [None]:
from copy import deepcopy
from dataclasses import dataclass
import gc
from typing import Dict, List, Optional, Tuple

from catboost import CatBoostClassifier
import ipywidgets
from lightgbm import LGBMClassifier
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import FastICA, PCA, KernelPCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, RidgeClassifier, RidgeClassifierCV
from sklearn.manifold import TSNE
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC, SVC
from umap import UMAP
from xgboost import XGBClassifier

In [None]:
pd.options.display.max_rows = 50
pd.options.display.max_columns = 100
pd.options.display.float_format = '{:.5f}'.format
%matplotlib inline

## Load dataset

Load files by `pandas.read_csv`.

In [None]:
train = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
train['Transported'] = train['Transported'].astype('int')
test = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
sample_submission = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')

See 20 samples of training/test data.

In [None]:
train.sample(20)

In [None]:
test.sample(20)

## Understand task and evaluation metrics

There is a column named `Transported` in train.csv, but no in test.csv. This competition's task is predicting `Transpoted` value of test.csv.

Let's see `Transported`'s values.

In [None]:
train['Transported'].value_counts(dropna=False) \
                    .reset_index() \
                    .rename(columns={'index': 'Transported', 'Transported': 'Number of rows'})

`Transported` in train.csv has only 2 values: `True` (4378 rows) or `False` (4315 rows). So this is a binary classification task.

Your prediction will be evaluated by its accuracy. You can see accuracy's explanation [here](https://developers.google.com/machine-learning/crash-course/classification/accuracy). For example, denote $y$ is correct `Transported` value of test.csv and $\hat{y}$ is your prediction;

$$
y = (True, False, False, True, False, True) \\
\hat{y} = (True, False, True, True, False, False).
$$

In this case, your accuracy is $0.666\dots$; because there are __6__ predictions and __4__ of them (1st, 2nd, 4th and 5th) are correct, thus $Accuracy = \frac{4}{6} = 0.666\$.

## EDA

EDA stands for [Exploratory Data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis). I'll do quick EDA in order to get more familiar with dataset.

### Data size

In [None]:
print(f'''Data size:
- Training data: Number of row = {train.shape[0]}, Number of columns = {train.shape[1]}
- Test data    : Number of row = {test.shape[0]}, Number of columns = {test.shape[1]}''')

### Data types

In [None]:
train.dtypes

In [None]:
test.dtypes

### Missing values

Count number of missing rows by column.

In [None]:
missing_values = pd.concat([
    train.drop(columns=['Transported']).isnull().sum(), test.isnull().sum()],
    axis=1)
missing_values.columns = ['Number of missing value (train)', 'Number of missing value (test)']
missing_values['% of missing value (train)'] = 100 * missing_values['Number of missing value (train)'] / train.shape[0]
missing_values['% of missing value (test)'] = 100 * missing_values['Number of missing value (test)'] / test.shape[0]
missing_values

Approximately 1.8\~2.5% of rows are missing, except for `PassengerId`.

### Add some columns

I will split `PassangerId` into group and number part.

In [None]:
# `PassangerGroup`: Passanger's group
# `PassengerNumber`: Unique within `PassengerGroup`
train['PassengerGroup'] = train['PassengerId'].apply(lambda x: str(x[:4]))
train['PassengerNumber'] = train['PassengerId'].apply(lambda x: int(x[-2:]))
test['PassengerGroup'] = test['PassengerId'].apply(lambda x: str(x[:4]))
test['PassengerNumber'] = test['PassengerId'].apply(lambda x: int(x[-2:]))
train[['PassengerId', 'PassengerGroup', 'PassengerNumber']]

Is there any `PassengerGroup` which appears both in "train.csv" and "test.csv"?

In [None]:
set(train['PassengerGroup'].tolist()) & set(test['PassengerGroup'].tolist())

__No__.

Calculate number of people in each `PassangerGroup`.

In [None]:
# `PassengerGroupSize`: How many people are in each `PassangerGroup`?
train['PassengerGroupSize'] = train.groupby('PassengerGroup')['PassengerNumber'].transform('max')
test['PassengerGroupSize'] = test.groupby('PassengerGroup')['PassengerNumber'].transform('max')
train[['PassengerGroup', 'PassengerNumber', 'PassengerGroupSize']].sort_values(['PassengerGroup', 'PassengerNumber'])

Split `Cabin` which represents deck/number/side.

In [None]:
# Split cabin into deck, number and side.
train['CabinDeck'] = train['Cabin'].apply(lambda x: x.split('/')[0] if isinstance(x, str) else x)
train['CabinNumber'] = train['Cabin'].apply(lambda x: int(x.split('/')[1]) if isinstance(x, str) else x)
train['CabinSide'] = train['Cabin'].apply(lambda x: x.split('/')[2] if isinstance(x, str) else x)

test['CabinDeck'] = test['Cabin'].apply(lambda x: x.split('/')[0] if isinstance(x, str) else x)
test['CabinNumber'] = test['Cabin'].apply(lambda x: int(x.split('/')[1]) if isinstance(x, str) else x)
test['CabinSide'] = test['Cabin'].apply(lambda x: x.split('/')[2] if isinstance(x, str) else x)
train[['Cabin', 'CabinDeck', 'CabinNumber', 'CabinSide']]

### Cardinality

How many unique values are in each columns?

In [None]:
cardinality = pd.concat([train.drop(columns=['Transported']).nunique(), test.nunique()], axis=1)
cardinality.columns = ['Number of unique values (train)', 'Number of unique values (test)']
cardinality

There are low cardinality columns such as `HomePlanet`, `CryoSleep`, `Destination`, `VIP`, `CabinDeck` and `CabinSide`. Such features often should be treated as categorical features. `PassangerNumber` and `PassengerGroupSize` are also low cardinality, but they are ordinal so I will treated them as continuous features.

### Continuous features

In [None]:
continuous_features = [
    c for c in train.select_dtypes('number').columns.tolist()
    if c not in ('Transported', 'PassengerNumber')
]
continuous_features

See summary statistics and distributions.

In [None]:
train.describe()

Check statistics aggregated by `Transported`.

In [None]:
train.groupby('Transported')[continuous_features].describe()

It seems `RoomService`, `FoodCount`, `ShoppingMall`, `Spa` and `VRDeck` are clues by themselves. `Age` isn't seemed as a strong clue by itself, but there are probability that combnation of `Age` and other features are important for predicting `Transported`.

Plot and see histograms of these features.

In [None]:
fig = plt.figure(figsize=(18., 18.))
for i, c in enumerate(continuous_features):
    ax = plt.subplot(3, 3, 1 + i)
    if c in ('Age', 'PassengerNumber', 'PassengerGroupSize'):
        ax = sns.histplot(data=train, x=c, hue='Transported')
        ax.set_title(c)
    else:
        # The range is very long so it should be log scaled for visibility.
        ax = sns.histplot(x=np.log1p(train[c]), hue=train['Transported'])
        ax.set_title(f'{c} (x-axis is log scaled)')

See peason correlation coefficient.

In [None]:
train.corr()

Heatmap is easy way to visualize the strength of linear correlation.

In [None]:
fig = plt.figure(figsize=(10., 10.))
ax = sns.heatmap(train.corr(), vmin=-1., vmax=1., annot=True)

Plot scatter plot matrix.

In [None]:
# I prefer seaborn.scatterplot than seaborn.pairplot
fig = plt.figure(figsize=(24.5, 24.5))
pairs = [
    (x, y) for x in continuous_features
    for y in continuous_features
    if x > y
]
for i, (x, y) in enumerate(pairs):
    # All values are log scaled
    ax = plt.subplot(6, 6, 1 + i)
    
    # Some features should be log scaled for visualization
    if x in ('Age', 'PassengerNumber', 'PassengerGroupSize') and y in ('Age', 'PassengerNumber', 'PassengerGroupSize'):
        ax = sns.scatterplot(data=train, x=train[x], y=train[y], hue='Transported')
    elif x in ('Age', 'PassengerNumber', 'PassengerGroupSize'):
        ax = sns.scatterplot(data=train, x=train[x], y=np.log1p(train[y]), hue='Transported')
    elif y in ('Age', 'PassengerNumber', 'PassengerGroupSize'):
        ax = sns.scatterplot(data=train, x=np.log1p(train[x]), y=train[y], hue='Transported')
    else:
        ax = sns.scatterplot(data=train, x=np.log1p(train[x]), y=np.log1p(train[y]), hue='Transported')


## Categories

In [None]:
categorical_features = [c for c in train.select_dtypes(exclude='number').columns if c not in ('Transported', 'Cabin', 'Name', 'PassengerId', 'PassengerGroup')]
categorical_features

Show all categories' size.

In [None]:
data_tabs = ipywidgets.Tab()
data_tabs.children = list([ipywidgets.Output() for _ in categorical_features])

for i, c in enumerate(categorical_features):
    data_tabs.set_title(i, c)
    
    # Display corresponding table output for this tab name
    with data_tabs.children[i]:
            display(
                pd.merge(
                    train[c].value_counts().reset_index().rename(columns={'index': c, c: 'train'}),
                    test[c].value_counts().reset_index().rename(columns={'index': c, c: 'test'}),
                    how='outer'
                )
            )

display(data_tabs)

Visualize each category size by `Transported`.

In [None]:
fig = plt.figure(figsize=(15., 12.))
for i, c in enumerate(categorical_features):
    ax = plt.subplot(2, 3, 1 + i)
    ax = sns.countplot(data=train, x=c, hue='Transported')

## Demension reduction

Try reducing dimension to 2 by following methods;
- PCA
- Kernel PCA
- T-SNE
- UMAP
- FastICA

## 

In [None]:
features = continuous_features + categorical_features
X = train[features].copy()  # Apply dimension reduction to this
X = pd.get_dummies(X, columns=categorical_features, drop_first=True, dummy_na=True)
X[continuous_features] = StandardScaler().fit_transform(X[continuous_features])
X[continuous_features] = SimpleImputer(strategy='median').fit_transform(X[continuous_features])
X

In [None]:
fig = plt.figure(figsize=(16., 16.))
dimension_reduce_models ={
    'PCA': PCA(n_components=2, random_state=2022),
    'Kernel PCA (rbf)': KernelPCA(n_components=2, kernel='rbf', n_jobs=-1, random_state=2022),
    'Kernel PCA (polynominal)': KernelPCA(n_components=2, kernel='poly', n_jobs=-1, random_state=2022),
    'Kernel PCA (sigmoid)': KernelPCA(n_components=2, kernel='sigmoid', n_jobs=-1, random_state=2022),
    'Kernel PCA (cosince)': KernelPCA(n_components=2, kernel='cosine', n_jobs=-1, random_state=2022),
    'T-SNE': TSNE(n_components=2, random_state=2022),
    'UMAP': UMAP(n_components=2, random_state=2022, n_jobs=-1),
    'Fast ICA': FastICA(n_components=2, random_state=2022)
}
for i, (name, model) in enumerate(dimension_reduce_models.items()):
    ax = plt.subplot(3, 3, 1 + i)
    X_reduced = model.fit_transform(X.copy())
    X_reduced = pd.DataFrame(data=X_reduced, columns=['comp1', 'comp2'])
    X_reduced['Transported'] = train['Transported']
    ax = sns.scatterplot(data=X_reduced, x='comp1', y='comp2', hue='Transported')
    ax.set_title(name)
    if i == 7:  # last
        ax.legend(loc='upper left', bbox_to_anchor=[1., 1.])
    else:
        ax.legend_ = None
    sns.despine()

The result of UMAP and T-SNE looks fine. Increase UMAP's `n_components` 2 to 4 then re-visualize.

In [None]:
n_components = 4
fig = plt.figure(figsize=(16., 16.))
X_reduced_umap = UMAP(n_components=n_components, random_state=2022, n_jobs=-1).fit_transform(X.copy())
X_reduced_umap = pd.DataFrame(X_reduced_umap, columns=[f'comp{i + 1}' for i in range(n_components)])
X_reduced_umap['Transported'] = train['Transported']
sns.pairplot(data=X_reduced_umap, corner=True, hue='Transported')

In [None]:
del X_reduced_umap, dimension_reduce_models
gc.collect();

## Clustering

At first, try K-Means clustering algorithm. I will find number of clusters by elbow method.

In [None]:
%%time
sse = []
kmeans = {}
for n in range(5, 17):
    kmeans[str(n)] = KMeans(n_clusters=n, random_state=2022).fit(X)
    sse.append([n, kmeans[str(n)].inertia_])
    
sse = pd.DataFrame(sse, columns=['n_clusters', 'sse'])
ax = sns.lineplot(data=sse, x='n_clusters', y='sse')

It seems number of cluster should be 9.

In [None]:
cluster_df = train[['Transported']].copy()
cluster_df['kmeans'] = kmeans['9'].predict(X)
pd.crosstab(cluster_df['kmeans'], cluster_df['Transported'])

Cluster 6 and 8 are almost the cluster of `Transported` is False!.

I will try ward clustering and gaussian mixture additionaly. The number of cluster size is 9, same with kmeans.

In [None]:
%%time
cluster_df['ward'] = AgglomerativeClustering(n_clusters=9, linkage='ward').fit_predict(X)
pd.crosstab(cluster_df['ward'], cluster_df['Transported'])

Cluster 3 and 5 are almost the cluster of Transported is False!.

In [None]:
%%time
cluster_df['gaussian'] = GaussianMixture(n_components=9, random_state=2022).fit_predict(X)
pd.crosstab(cluster_df['gaussian'], cluster_df['Transported'])

In [None]:
del X, kmeans
gc.collect();

## Make models

In [None]:
continuous_features

In [None]:
categorical_features

### Common preprocessing

For categorical features, convert value into integer.

In [None]:
for c in categorical_features:
    values = [v for v in train[c].unique() if not (isinstance(v, float) and np.isnan(v))]
    mapping = {v: i for i, v in enumerate(values)}
    print(f'''{c};
    - Mapping: {mapping}
    - Values (training set): {train[c].value_counts().sort_index().index.tolist()}
    - Values (test set)    : {test[c].value_counts().sort_index().index.tolist()}''')
    train[c] = train[c].map(mapping)
    test[c] = test[c].map(mapping)

### Split dataset into training/validation set

I will do cross validation for estimate classifier's performace.

- Method: Stratified cross validation
- Number of splits: 5

Stratified cross validation is good way if you want to keep original positive/negative proportion after splitting the whole dataset into training/validation set.

I'll estimate classifiers' performance by average accuracy of each cv loop. In particular, I'll estimated the generalized performance (a.k.a locac cv) by that of validation set.

Submission value is determined by voting, meaning that if majority of cv classifiers predict a certain `PassengerId`'s `Transported` is True(False), then final prediction of that `PassengerId` will be True(False).

In [None]:
# [(<index of training set>, <index of validation set>)], the list length = <number of cv splits>
type_fold_indice = List[Tuple[np.ndarray, np.ndarray]]

fold_indice: type_fold_indice = []
n_splits = 5
splitter = StratifiedKFold(n_splits, random_state=2022, shuffle=True)
for idx_train, idx_valid in splitter.split(X=train, y=train['Transported']):
    fold_indice.append((idx_train, idx_valid))

In [None]:
@dataclass
class CrossValidationResult:
    '''Cross validation result.
    
    Attributes
    ----------
    prediction_train, prediction_valid, prediction_test: pd.DataFrame
        Prediction for training/validation/test set, having following 3 columns;
        - `PassengerId`
        - `Fold`
        - `Prediction`.
        `Fold` is identifier of number of cross validation loop, starting from 1.
        `Prediction` is predicted value of `Transported`.
    scores: pd.DataFrame
        Accuracy scores of each cv loop having following 2 columns;
        - `Training`
        - `Validation`.
    classifiers: list
        Fitted classifiers.
    '''
    
    prediction_train: pd.DataFrame
    prediction_valid: pd.DataFrame
    prediction_test: pd.DataFrame
    scores: pd.DataFrame
    classifiers: list
        
    @property
    def local_cv(self) -> float:
        return self.scores['Validation'].mean()


def run_cross_validation_v1(
    train: pd.DataFrame,
    test: pd.DataFrame,
    fold_indice: type_fold_indice,
    continuous_features: List[str],
    categorical_features: List[str],
    base_classifier,
    base_preprocessor: Optional['estimator'] = None) -> CrossValidationResult:
    '''Run cross validation.
    
    Parameters
    ----------
    train, test: pd.DataFrame
        Training set and test set.
    fold_indice: type_fold_indice
        List of indice which split given training data into training/validation set in each fold.
    continuous_features, categorical_features: List[str]
        List of features' name.
    base_classifier
        Scikit-learn like object.
    base_preprocessor
        Scikit-learn like object. If given it will apply to continuous_features.

    Returns
    -------
    cv_result: CrossValidationResult
    
    '''
    prediction_train = []
    prediction_valid = []
    prediction_test = []
    scores = []
    classifiers = []

    for i, (idx_train, idx_valid) in enumerate(fold_indice):
        X_train = train.iloc[idx_train][continuous_features + categorical_features].copy()
        y_train = train.iloc[idx_train]['Transported'].copy()
        X_valid = train.iloc[idx_valid][continuous_features + categorical_features].copy()
        y_valid = train.iloc[idx_valid]['Transported'].copy()
        X_test = test[continuous_features + categorical_features].copy()
    
        # Preprocess continuous features
        if base_preprocessor:
            preprocessor_continuous = deepcopy(base_preprocessor)
            preprocessor_continuous.fit(X_train[continuous_features])
            X_train[continuous_features] = preprocessor_continuous.transform(X_train[continuous_features])
            X_valid[continuous_features] = preprocessor_continuous.transform(X_valid[continuous_features])
            X_test[continuous_features] = preprocessor_continuous.transform(X_test[continuous_features])

        # Preprocess categorical features
        X_train['fold'] = 0
        X_valid['fold'] = 1
        X_test['fold'] = 2
        X = pd.concat([X_train, X_valid, X_test], axis=0)
        X = pd.get_dummies(X, columns=categorical_features, dummy_na=True, drop_first=True)
        X_train = X.query('fold == 0').drop(columns=['fold'])
        X_valid = X.query('fold == 1').drop(columns=['fold'])
        X_test = X.query('fold == 2').drop(columns=['fold'])
        del X
    
        # Train classifier
        classifier = deepcopy(base_classifier)
        classifier = classifier.fit(X_train, y_train)
        classifier.feature_names__ = X_train.columns.tolist()
        classifiers.append(classifier)
    
        # Make prediction
        #- Training set
        pred_train = classifier.predict(X_train)
        prediction_train_ = pd.DataFrame()
        prediction_train_['PassengerId'] = train.iloc[idx_train]['PassengerId']
        prediction_train_['Prediction'] = pred_train
        prediction_train_['Fold'] = i + 1
        prediction_train.append(prediction_train_)
        #- Validation set
        pred_valid = classifier.predict(X_valid)
        prediction_valid_ = pd.DataFrame()
        prediction_valid_['PassengerId'] = train.iloc[idx_valid]['PassengerId']
        prediction_valid_['Prediction'] = pred_valid
        prediction_valid_['Fold'] = i + 1
        prediction_valid.append(prediction_valid_)
        #- Test set
        prediction_test_ = pd.DataFrame()
        prediction_test_['PassengerId'] = test['PassengerId']
        prediction_test_['Prediction'] = classifier.predict(X_test)
        prediction_test_['Fold'] = i + 1
        prediction_test.append(prediction_test_)
        del prediction_train_, prediction_valid_, prediction_test_
        # Scoring
        score_train = accuracy_score(y_train, pred_train)
        score_valid = accuracy_score(y_valid, pred_valid)
        print(f'Fold {i + 1}...Training score: {score_train}, Validation score: {score_valid}')
        scores.append([score_train, score_valid])

    scores = pd.DataFrame(scores, columns=['Training', 'Validation'])
    prediction_train = pd.concat(prediction_train)
    prediction_valid = pd.concat(prediction_valid)
    prediction_test = pd.concat(prediction_test)
    cv_result = CrossValidationResult(
        prediction_train,
        prediction_valid,
        prediction_test,
        scores,
        classifiers
    )
    return cv_result

def run_cross_validation_v2(
    train: pd.DataFrame,
    test: pd.DataFrame,
    fold_indice: type_fold_indice,
    continuous_features: List[str],
    categorical_features: List[str],
    base_classifier,
    fit_params: Optional[dict] = None) -> CrossValidationResult:
    '''Run cross validation.
    
    Parameters
    ----------
    train, test: pd.DataFrame
        Training set and test set.
    fold_indice: type_fold_indice
        List of indice which split given training data into training/validation set in each fold.
    continuous_features, categorical_features: List[str]
        List of features' name.
    base_classifier
        Scikit-learn like object.
    fit_params: Optional[dict]
        If given, it will be used when fitting classifier like `classifier.fit(X, y, **fit_params`.

    Returns
    -------
    cv_result: CrossValidationResult
    
    '''
    prediction_train = []
    prediction_valid = []
    prediction_test = []
    scores = []
    classifiers = []

    for i, (idx_train, idx_valid) in enumerate(fold_indice):
        X_train = train.iloc[idx_train][continuous_features + categorical_features].copy()
        y_train = train.iloc[idx_train]['Transported'].copy()
        X_valid = train.iloc[idx_valid][continuous_features + categorical_features].copy()
        y_valid = train.iloc[idx_valid]['Transported'].copy()
        X_test = test[continuous_features + categorical_features].copy()
    
        # Train classifier
        classifier = deepcopy(base_classifier)
        if fit_params:
            classifier.fit(X_train, y_train, **fit_params)
        else:
            classifier.fit(X_train, y_train)
        classifier.feature_names__ = X_train.columns.tolist()
        classifiers.append(classifier)
    
        # Make prediction
        #- Training set
        pred_train = classifier.predict(X_train)
        prediction_train_ = pd.DataFrame()
        prediction_train_['PassengerId'] = train.iloc[idx_train]['PassengerId']
        prediction_train_['Prediction'] = pred_train
        prediction_train_['Fold'] = i + 1
        prediction_train.append(prediction_train_)
        #- Validation set
        pred_valid = classifier.predict(X_valid)
        prediction_valid_ = pd.DataFrame()
        prediction_valid_['PassengerId'] = train.iloc[idx_valid]['PassengerId']
        prediction_valid_['Prediction'] = pred_valid
        prediction_valid_['Fold'] = i + 1
        prediction_valid.append(prediction_valid_)
        #- Test set
        prediction_test_ = pd.DataFrame()
        prediction_test_['PassengerId'] = test['PassengerId']
        prediction_test_['Prediction'] = classifier.predict(X_test)
        prediction_test_['Fold'] = i + 1
        prediction_test.append(prediction_test_)
        del prediction_train_, prediction_valid_, prediction_test_
        # Scoring
        score_train = accuracy_score(y_train, pred_train)
        score_valid = accuracy_score(y_valid, pred_valid)
        print(f'Fold {i + 1}...Training score: {score_train}, Validation score: {score_valid}')
        scores.append([score_train, score_valid])

    scores = pd.DataFrame(scores, columns=['Training', 'Validation'])
    prediction_train = pd.concat(prediction_train)
    prediction_valid = pd.concat(prediction_valid)
    prediction_test = pd.concat(prediction_test)
    cv_result = CrossValidationResult(
        prediction_train,
        prediction_valid,
        prediction_test,
        scores,
        classifiers
    )
    return cv_result


def to_final_prediction(df: pd.DataFrame, n_splits: int = 5) -> pd.DataFrame:
    final_prediction = df \
                      .groupby('PassengerId')['Prediction'] \
                      .sum() \
                      .reset_index()
    final_prediction['Transported'] = final_prediction['Prediction'].apply(lambda x: x > n_splits / 2).astype('bool')
    return final_prediction[['PassengerId', 'Transported']]

def make_my_submission(
    sample_submission: pd.DataFrame,
    cv_result: CrossValidationResult,
    n_splits: int = 5
    ) -> pd.DataFrame:
    final_prediction = to_final_prediction(cv_result.prediction_test)
    my_submission = pd.merge(sample_submission[['PassengerId']], final_prediction, indicator=True)
#     display(final_prediction)
#     display(my_submission)
    
    # Ensure the set of `PassengerId` in my submission is correct.
    if my_submission['_merge'].nunique() != 1:  # expected only "both"
        display(my_submission.query('_merge != "both"'))
        raise ValueError
    else:
        return my_submission[['PassengerId', 'Transported']]

### Logistic regression

Preprocessing:  
- Continuous features
  - Impute missing value with median
  - Z-score normalization
- Categorical features
  - Impute missing value with constant value
  - One hot encoding

In [None]:
base_preprocessor = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])

In [None]:
%%time
# The hyper parameters are not so tuned
cv_result_logreg = run_cross_validation_v1(train, test, fold_indice,
                                           [c for c in continuous_features if c != 'Age'],
                                           categorical_features,
                                           LogisticRegression(
                                               n_jobs=-1, random_state=2022,
                                               max_iter=3000, C=10., penalty='l1', solver='saga'),
                                           base_preprocessor)
print(f'Local cv: {cv_result_logreg.local_cv:.5f}')

Scikit-learn's logistic regresion object save coeeficients of each feature after fitting.

In [None]:
coef = []
for clf in (cv_result_logreg.classifiers):
    coef_ = pd.DataFrame({
        'Feature': clf.feature_names__,
        'Coefficient': clf.coef_[0]
    })
    coef.append(coef_)
coef = pd.concat(coef)
coef_stat = coef.groupby('Feature')['Coefficient'].describe()
coef_stat

The bigger the absolute value of coefficient is, the more important that feature is for logistic regression model to predicted `Transported`.

In [None]:
fig = plt.figure(figsize=(9., 7.))
ax = sns.barplot(y=coef_stat.index, x=coef_stat['mean'])
ax.axvline(0., color='black')
ax.set(title='Average features\'s coefficient of LogisticRegression',
       xlabel='Average coefficient');

In [None]:
submission_logreg = make_my_submission(sample_submission, cv_result_logreg)
display(submission_logreg['Transported'].value_counts(normalize=True).sort_index())
submission_logreg.to_csv('submission_logistic_regression.csv', index=False)

### Ridge

Preprocessing is the same as for logistic regression.

In [None]:
%%time
# The hyper parameters are not so tuned
cv_result_ridge = run_cross_validation_v1(train, test, fold_indice,
#                                           [c for c in continuous_features if c != 'Age'],
                                          continuous_features,
                                          categorical_features,
                                          RidgeClassifier(random_state=2022, alpha=1000.),
                                          base_preprocessor)
print(f'Local cv: {cv_result_ridge.local_cv:.5f}')

See features' coefficient.

In [None]:
coef = []
for clf in (cv_result_ridge.classifiers):
    coef_ = pd.DataFrame({
        'Feature': clf.feature_names__,
        'Coefficient': clf.coef_[0]
    })
    coef.append(coef_)
coef = pd.concat(coef)
coef_stat = coef.groupby('Feature')['Coefficient'].describe()
coef_stat

In [None]:
fig = plt.figure(figsize=(9., 7.))
ax = sns.barplot(y=coef_stat.index, x=coef_stat['mean'])
ax.axvline(0., color='black')
ax.set(title='Average features\'s coefficient of Ridge classifier',
       xlabel='Average coefficient');

In [None]:
submission_ridge = make_my_submission(sample_submission, cv_result_ridge)
display(submission_ridge['Transported'].value_counts(normalize=True).sort_index())
submission_ridge.to_csv('submission_ridge.csv', index=False)


### SVM

Preprocessing is the same as for logistic regression.

In [None]:
%%time
# The hyper parameters are not so tuned
cv_result_svm = run_cross_validation_v1(train, test, fold_indice,
                                        [c for c in continuous_features if c != 'Age'],  # 0.80513
#                                         continuous_features,
                                        categorical_features,
                                        SVC(random_state=2022, C=10.),
                                        base_preprocessor)
print(f'Local cv: {cv_result_svm.local_cv:.5f}')

SVM is non linear model if `kernel` is not "linear" therefore feature coefficient cannot be calculated.

In [None]:
submission_svm = make_my_submission(sample_submission, cv_result_svm)
display(submission_svm['Transported'].value_counts(normalize=True).sort_index())
submission_svm.to_csv('submission_svm.csv', index=False)

### KNN

Preprocessing is the same as for logistic regression.

In [None]:
# def cosine_distance(x, y) -> float:
#     if (x == y).all():
#         return 0.
#     dot = np.dot(x, y)
#     norm = np.linalg.norm(x) * np.linalg.norm(y)
#     similarity = dot / norm
#     return 1. - similarity

In [None]:
%%time
# The hyper parameters are not so tuned
cv_result_knn = run_cross_validation_v1(train, test, fold_indice,
                                        [c for c in continuous_features if c != 'Age'],
#                                         continuous_features,
                                        categorical_features,
                                        Pipeline(
                                            steps=[
                                                ('umap', UMAP(n_components=6, n_jobs=-1, random_state=2022)),
                                                ('estimator', KNeighborsClassifier(n_jobs=-1, weights='distance'))
                                            ]
                                        ),
                                        base_preprocessor)
print(f'Local cv: {cv_result_knn.local_cv:.5f}')

In [None]:
submission_knn = make_my_submission(sample_submission, cv_result_knn)
display(submission_knn['Transported'].value_counts(normalize=True).sort_index())
submission_knn.to_csv('submission_knn.csv', index=False)

### Gaussian naive bayes

In [None]:
%%time
# The hyper parameters are not so tuned
cv_result_gnn = run_cross_validation_v1(train, test, fold_indice,
#                                         [c for c in continuous_features if c != 'Age'],  # 0.78247
                                        continuous_features,
                                        categorical_features,
#                                         GaussianNB(),
                                        Pipeline(
                                            steps=[
                                                ('umap', UMAP(n_components=6, n_jobs=-1, random_state=2022)),
                                                ('estimator', GaussianNB())
                                            ]
                                        ),
                                        base_preprocessor)
print(f'Local cv: {cv_result_gnn.local_cv:.5f}')

In [None]:
submission_gnn = make_my_submission(sample_submission, cv_result_gnn)
display(submission_gnn['Transported'].value_counts(normalize=True).sort_index())
submission_gnn.to_csv('submission_gausiann_naive_bayes.csv', index=False)

### Bernoulli naive bayes

In [None]:
%%time
cv_result_bernoulli = run_cross_validation_v1(train, test, fold_indice,
#                                         [c for c in continuous_features if c != 'Age'],  # 0.78247
                                        continuous_features,
                                        categorical_features,
#                                         BernoulliNB(),
                                        Pipeline(
                                            steps=[
                                                ('umap', UMAP(n_components=6, n_jobs=-1, random_state=2022)),
                                                ('estimator', BernoulliNB())
                                            ]
                                        ),
#                                         SimpleImputer(strategy='median'))
                                       base_preprocessor)
print(f'Local cv: {cv_result_bernoulli.local_cv:.5f}')

In [None]:
submission_bernoulli = make_my_submission(sample_submission, cv_result_bernoulli)
display(submission_bernoulli['Transported'].value_counts(normalize=True).sort_index())
submission_bernoulli.to_csv('submission_bernoulli_naive_bayes.csv', index=False)

### CatBoost

In [None]:
class CatBoostVoting(BaseEstimator, TransformerMixin):
    '''Voting ensemble of 4 CatBoost classifiers.
    '''
    
    def __init__(self,
                 learning_rate: float,
                 min_votes: int,
                 random_state: int = 2022):
        '''Initializer.
        
        Parameters
        ----------
        learning_rate, random_state:
            Directory passed to `CatBoostClassifier`.
        min_votes: int
            If the number of weak learners which predict the label is True is equal or greater than `min_votes`,
            the final prediction is to be 1 (positive), otherwise 0 (negative).
        '''
        self.learning_rate = learning_rate
        self.min_votes = min_votes
        self.random_state = random_state

    def fit(self, X: pd.DataFrame, y: pd.Series) -> object:

        #- Version 6.
        classifier1 = CatBoostClassifier(**{
            'logging_level': 'Silent',
            'learning_rate': self.learning_rate,
            'random_state': self.random_state,
            'iterations': 485,
            'depth': 9,
            'l2_leaf_reg': 37.78633552228524,
            'random_strength': 0.5860739898501952,
            'bagging_temperature': 5.135564748789393,
            'grow_policy': 'Lossguide',
            'min_data_in_leaf': 27}
        )
        #- Version 5.
        classifier2 = CatBoostClassifier(**{
            'logging_level': 'Silent',
            'learning_rate': self.learning_rate,
            'random_state': self.random_state,
            'iterations': 488,
            'depth': 9,
            'l2_leaf_reg': 9.078186461867324,
            'random_strength': 0.41661942028590215,
            'bagging_temperature': 5.022878573438741,
            'grow_policy': 'Lossguide',
            'min_data_in_leaf': 32}
        )
        #- Version 8.
        classifier3 = CatBoostClassifier(**{
            'learning_rate': self.learning_rate,
            'random_state': self.random_state,
            'iterations': 483,
            'depth': 9,
            'l2_leaf_reg': 15.56684547250753,
            'random_strength': 0.8564880812992988,
            'bagging_temperature': 6.269361412534385,
            'grow_policy': 'Depthwise',
            'min_data_in_leaf': 5}
        )
        #- Version 9.
        classifier4 = CatBoostClassifier(**{
            'logging_level': 'Silent',
            'learning_rate': self.learning_rate,
            'random_state': self.random_state,
            'iterations': 480,
            'depth': 11,
            'l2_leaf_reg': 8.931378652814573,
            'random_strength': 1.3059544347194127,
            'bagging_temperature': 2.133405869044489,
            'grow_policy': 'Lossguide',
            'min_data_in_leaf': 32,}
        )

        classifiers = [classifier1, classifier2, classifier3, classifier4]
        for i in range(len(classifiers)):
            classifiers[i].fit(X, y)
        self.classifiers_ = classifiers

        return self
    
    def get_predictions(self, X: pd.DataFrame) -> np.ndarray:
        predictions = np.zeros((X.shape[0], self.num_classifiers))
        for i, classifier in enumerate(self.classifiers):
            predictions[:, i] = classifier.predict(X)

        return predictions
    
    def get_probabilities(self, X: pd.DataFrame) -> np.ndarray:
        probability = np.zeros((X.shape[0], self.num_classifiers))
        for i, classifier in enumerate(self.classifiers):
            probability[:, i] = classifier.predict_proba(X)[:, 1]

        return probability

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        predictions = self.get_predictions(X)

        return (np.sum(predictions, axis=1) >= self.min_votes).astype('int')
    
    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        probability = np.zeros((X.shape[0], 2))
        probability1 = np.mean(self.get_probabilities(X), axis=1)  # positive class
        probability[:, 0] = 1. - probability1
        probability[:, 1] = probability1        

        return probability
        
    @property
    def num_classifiers(self) -> int:

        return len(self.classifiers_)

    @property
    def classifiers(self):
        for classifier in self.classifiers_:
            yield classifier

In [None]:
%%time
# The hyper parameters are not so tuned
cv_result_cat = run_cross_validation_v2(train, test, fold_indice,
                                        continuous_features,
                                        categorical_features,
                                        CatBoostVoting(min_votes=2, random_state=2022, learning_rate=0.01))
print(f'Local cv: {cv_result_cat.local_cv:.5f}')

Trained CatBoost model has feature importance attribute.

In [None]:
# feature_importance = []
# for model in cv_result_cat.classifiers:
#     feature_importance_ = pd.DataFrame(
#         {'Feature': model.feature_names_,
#          'Importance': model.feature_importances_}
#     )
#     feature_importance.append(feature_importance_)
# feature_importance = pd.concat(feature_importance)
# feature_importance_stat = feature_importance.groupby('Feature')['Importance'].describe()
# feature_importance_stat

In [None]:
# fig = plt.figure(figsize=(9., 7.))
# ax = sns.barplot(y=feature_importance_stat.index, x=feature_importance_stat['mean'])
# ax.set(title='Average features\'s importance of CatBoost',
#        xlabel='Average importance');

In [None]:
submission_cat = make_my_submission(sample_submission, cv_result_cat)
display(submission_cat['Transported'].value_counts(normalize=True).sort_index())
submission_cat.to_csv('submission_catboost.csv', index=False)

### LightGBM

In [None]:
%%time
# The hyper parameters are not so tuned
cv_result_lgbm = run_cross_validation_v2(train, test, fold_indice,
                                         continuous_features,
                                         categorical_features,
                                         LGBMClassifier(random_state=2022, n_jobs=-1, n_estimators=200, learning_rate=0.05,
                                                        importance_type='gain',
                                                        subsample=0.9, subsample_freq=3, colsample_bytree=0.6))
print(f'Local cv: {cv_result_lgbm.local_cv:.5f}')

See feature importances.

In [None]:
feature_importance = []
for model in cv_result_lgbm.classifiers:
    feature_importance_ = pd.DataFrame(
        {'Feature': model.feature_name_,
         'Importance': model.feature_importances_}
    )
    feature_importance.append(feature_importance_)
feature_importance = pd.concat(feature_importance)
feature_importance_stat = feature_importance.groupby('Feature')['Importance'].describe()
feature_importance_stat

In [None]:
fig = plt.figure(figsize=(9., 7.))
ax = sns.barplot(y=feature_importance_stat.index, x=feature_importance_stat['mean'])
ax.set(title='Average features\'s importance of LightGBM',
       xlabel='Average importance');

In [None]:
submission_lgbm = make_my_submission(sample_submission, cv_result_lgbm)
display(submission_lgbm['Transported'].value_counts(normalize=True).sort_index())
submission_lgbm.to_csv('submission_lightgbm.csv', index=False)

### XGBoost

In [None]:
%%time
# The hyper parameters are not so tuned
cv_result_xgb = run_cross_validation_v2(train, test, fold_indice,
                                        continuous_features,
                                        categorical_features,
                                        XGBClassifier(random_state=2022, n_jobs=-1, objective='binary:logistic'))
print(f'Local cv: {cv_result_xgb.local_cv:.5f}')

See feature importances.

In [None]:
feature_importance = []
for model in cv_result_xgb.classifiers:
    feature_importance_ = pd.DataFrame(
        {'Feature': model.feature_names__,
         'Importance': model.feature_importances_}
    )
    feature_importance.append(feature_importance_)
feature_importance = pd.concat(feature_importance)
feature_importance_stat = feature_importance.groupby('Feature')['Importance'].describe()
feature_importance_stat

In [None]:
fig = plt.figure(figsize=(9., 7.))
ax = sns.barplot(y=feature_importance_stat.index, x=feature_importance_stat['mean'])
ax.set(title='Average features\'s importance of XGBoost',
       xlabel='Average importance');

In [None]:
submission_xgb = make_my_submission(sample_submission, cv_result_xgb)
display(submission_xgb['Transported'].value_counts(normalize=True).sort_index())
submission_xgb.to_csv('submission_xgboost.csv', index=False)

### Voting ensemble

I've trained 9 classifiers. Determine final prediction by voting.

In [None]:
plt.figure(figsize=(9., 9.))
submissions = pd.concat([
    submission_logreg.rename(columns={'Transported': 'Logistic'}).set_index('PassengerId'),
    submission_ridge.rename(columns={'Transported': 'Ridge'}).set_index('PassengerId'),
    submission_svm.rename(columns={'Transported': 'SVM'}).set_index('PassengerId'),
    submission_knn.rename(columns={'Transported': 'KNN'}).set_index('PassengerId'),
    submission_gnn.rename(columns={'Transported': 'GaussianNB'}).set_index('PassengerId'),
    submission_bernoulli.rename(columns={'Transported': 'BernoulliNB'}).set_index('PassengerId'),
    submission_cat.rename(columns={'Transported': 'CatBoost'}).set_index('PassengerId'),
    submission_lgbm.rename(columns={'Transported': 'LightGBM'}).set_index('PassengerId'),
    submission_xgb.rename(columns={'Transported': 'XGBoost'}).set_index('PassengerId'),
], axis=1)
models = submissions.columns.tolist()
ax = sns.heatmap(submissions.corr(), vmin=0., vmax=1., annot=True)
ax.set_title('Peason correlation coefficient');

In [None]:
# Vote by trained classifiers;
num_models = len(models)
submissions['Prediction'] = submissions[models].sum(axis=1)
submissions['Transported'] = submissions['Prediction'].apply(lambda x: x > num_models / 2)
submissions

In [None]:
display(submissions['Transported'].value_counts(normalize=True).sort_index())
submissions[['Transported']].to_csv('submission_voting.csv')

## Stacking

Use 11 classifier's predictions as 12th classifier's feature. This is ensemble technique called stacking.

In [None]:
def run_stacking(train: pd.DataFrame,
                 cv_results: Dict[str, CrossValidationResult],
                 stacking_classifier: Optional[object] = None,
                 n_splits: int = 5) -> CrossValidationResult:
    y = train.set_index('PassengerId').sort_index()[['Transported']]
    prediction_train = []
    prediction_valid = []
    prediction_test = []
    scores = []
    classifiers = []

    for i in range(n_splits):

        # Use 11 weak learner's prediction as 12th classifier's feature
        X_train, X_valid, X_test = [], [], []
        for name, cv_result in cv_results.items():
            X_train.append(cv_result.prediction_train
                                    .query(f'Fold == {i + 1}')
                                    .rename(columns={'Prediction': name})
                                    .set_index('PassengerId')
                                    .drop(columns=['Fold']))
            X_valid.append(cv_result.prediction_valid
                                    .query(f'Fold == {i + 1}')
                                    .rename(columns={'Prediction': name})
                                    .set_index('PassengerId')
                                    .drop(columns=['Fold']))
            X_test.append(cv_result.prediction_test
                                   .query(f'Fold == {i + 1}')
                                   .rename(columns={'Prediction': name})
                                   .set_index('PassengerId')
                                   .drop(columns=['Fold']))
        X_train = pd.concat(X_train, axis=1)
        X_valid = pd.concat(X_valid, axis=1)
        X_test = pd.concat(X_test, axis=1)

        # Ground truth
        y_train = y.loc[X_train.index, 'Transported']
        y_valid = y.loc[X_valid.index, 'Transported']
        
        # Run stacking
        if stacking_classifier is None:
            classifier = RidgeClassifierCV()
        else:
            classifier = deepcopy(stacking_classifier)
        classifier = classifier.fit(X_train, y_train) 
        classifier.feature_names__ = X_train.columns.tolist()
        classifiers.append(classifier)
    
        # Make prediction
        #- Training set
        pred_train = classifier.predict(X_train)
        prediction_train_ = pd.DataFrame()
        prediction_train_['PassengerId'] = X_train.index
        prediction_train_['Prediction'] = pred_train
        prediction_train_['Fold'] = i + 1
        prediction_train.append(prediction_train_)
        #- Validation set
        pred_valid = classifier.predict(X_valid)
        prediction_valid_ = pd.DataFrame()
        prediction_valid_['PassengerId'] = X_valid.index
        prediction_valid_['Prediction'] = pred_valid
        prediction_valid_['Fold'] = i + 1
        prediction_valid.append(prediction_valid_)
        #- Test set
        prediction_test_ = pd.DataFrame()
        prediction_test_['PassengerId'] = X_test.index
        prediction_test_['Prediction'] = classifier.predict(X_test)
        prediction_test_['Fold'] = i + 1
        prediction_test.append(prediction_test_)
        del prediction_train_, prediction_valid_, prediction_test_
        # Scoring
        score_train = accuracy_score(y_train, pred_train)
        score_valid = accuracy_score(y_valid, pred_valid)
        print(f'Fold {i + 1}...Training score: {score_train}, Validation score: {score_valid}')
        scores.append([score_train, score_valid])

    scores = pd.DataFrame(scores, columns=['Training', 'Validation'])
    prediction_train = pd.concat(prediction_train)
    prediction_valid = pd.concat(prediction_valid)
    prediction_test = pd.concat(prediction_test)
    cv_result = CrossValidationResult(
        prediction_train,
        prediction_valid,
        prediction_test,
        scores,
        classifiers
    )
    return cv_result

In [None]:
%%time
cv_results = {
#     'Logistic Regression': cv_result_logreg,
#     'Ridge': cv_result_ridge,
    'SVM': cv_result_svm,
#     'KNN': cv_result_knn,
#     'GaussianNB': cv_result_gnn,
#     'BernoulliNB': cv_result_bernoulli,
    'CatBoost': cv_result_cat,
    'LightGBM': cv_result_lgbm,
#     'XGBoost': cv_result_xgb
}
cv_result_stacking = run_stacking(train,
                                  cv_results,
                                  n_splits=n_splits,
                                  stacking_classifier=CatBoostClassifier(random_state=2022, logging_level='Silent'))
print(cv_result_stacking.local_cv)

In [None]:
feature_importance = []
for model in cv_result_stacking.classifiers:
    feature_importance_ = pd.DataFrame(
        {'Feature': model.feature_names_,
         'Importance': model.feature_importances_}
    )
    feature_importance.append(feature_importance_)
feature_importance = pd.concat(feature_importance)
feature_importance_stat = feature_importance.groupby('Feature')['Importance'].describe()
feature_importance_stat

In [None]:
fig = plt.figure(figsize=(9., 7.))
ax = sns.barplot(y=feature_importance_stat.index, x=feature_importance_stat['mean'])
ax.set(title='Average features\'s importance of XGBoost',
       xlabel='Average importance');

In [None]:
submission_stacking = make_my_submission(sample_submission, cv_result_stacking)
display(submission_stacking['Transported'].value_counts(normalize=True).sort_index())
submission_stacking.to_csv('submission_stacking.csv', index=False)