# Table of Contents

1. &nbsp; [Introduction](#1.-Introduction)
2. &nbsp; [Preamble](#2.-Preamble)
3. &nbsp; [Helpers](#3.-Helpers)
4. &nbsp; [Leaderboard](#4.-Leaderboard)
5. &nbsp; [Feature Engineering](#5.-Feature-Engineering)
6. &nbsp; [Pipeline Preprocessing](#6.-Pipeline-Preprocessing)
7. &nbsp; [Holdout + CV](#7.-Holdout-+-CV)
8. &nbsp; [Final Words](#8.-Final-Words)

# 1. Introduction

This notebook is an XGBoost starter for the Titanic dataset, featuring no missing data imputation and no data binning.

No EDA since there's plenty of awesome EDA for this dataset.

Questions and feedback are welcome!

## Credit

Moral of the story:
> Generally, grouping passengers is a good way to improve your score. Try searching for groups.

-- Konstantin

I learned a lot from various kernels and discussions.  I want to especially credit:

- [How am I doing with my score](https://www.kaggle.com/pliptor/how-am-i-doing-with-my-score) by [Oscar Takeshita](https://www.kaggle.com/pliptor)
- [Titanic [0.82] - [0.83]](https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83) by [Konstantin](https://www.kaggle.com/konstantinmasich)

I also recommend checking out:

### sklearn pipelines + pandas
- [Deploying Machine Learning using sklearn pipelines](https://www.youtube.com/watch?v=URdnFlZnlaE) (YouTube) by Kevin Goetsch
- [Mind the Gap! Bridging the pandas – scikit learn dtype divide](https://www.youtube.com/watch?v=KLPtEBokqQ0) (YouTube) by Tom Augspurger
- Kevin Goetsch's github repo: https://github.com/Kgoetsch/sklearn_pipeline_enhancements
- Julie Michelman's github repo: https://github.com/jem1031/pandas-pipelines-custom-transformers

### XGBoost
- [Walkthrough](https://www.youtube.com/watch?v=ufHo8vbk6g4) (YouTube) by [Tong He](https://www.kaggle.com/hetong007)
- [Open Source Tools and Data Science Competitions](https://www.youtube.com/watch?v=7YnVZrabTA8) (YouTube) by [Owen Zhang](https://www.kaggle.com/owenzhang1)
- [Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md) (github)
- [Python API](http://xgboost.readthedocs.io/en/latest/python/python_api.html) (readthedocs)

### Titanic
- https://www.encyclopedia-titanica.org/
- [Titanic Cutaway Diagram](https://commons.wikimedia.org/wiki/File:Titanic_cutaway_diagram.png) (Wikimedia)

## License

My work is licensed under CC0:

- Overview: https://creativecommons.org/publicdomain/zero/1.0/
- Legal code: https://creativecommons.org/publicdomain/zero/1.0/legalcode.txt

All other rights remain with their respective owners.

# 2. Preamble

The usual suspects.

## 2.1 Jupyter Magic

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

## 2.2 Imports

In [3]:
from functools import partial

import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
from matplotlib import pyplot as plt
from sklearn.pipeline import make_pipeline

## 2.3 Library Settings

In [None]:
plt.rcParams['figure.figsize'] = (13,4)
sns.set(
    style='whitegrid',
    color_codes=True,
    font_scale=1.5)
np.set_printoptions(
    suppress=True,
    linewidth=200)
pd.set_option(
    'display.max_rows', 1000,
    'display.max_columns', None,
)

## 2.4 Globals

In [None]:
SEED = 0
SEED_LIST = 2 ** np.array([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
VAL_SIZE = 0.3

## 2.5 File Paths

In [None]:
train_csv       = '../input/titanic/train.csv'
test_csv        = '../input/titanic/test.csv'
submit_csv      = '../input/titanic/gender_submission.csv'
leaderboard_csv = '../input/titanic-public-leaderboard/titanic-publicleaderboard.csv'

# 3. Helpers

The true carry.

## 3.1 XGBoost

### Training

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold

def cv(params, n=100, n_cv=5, k=5):
    cv_results = xgb.cv(
        params,
        dfull,
        num_boost_round=n,
        folds=RepeatedStratifiedKFold(n_splits=k, n_repeats=n_cv, random_state=SEED),
        seed=SEED,
    )
    plot_cv(cv_results)
    return cv_results

def holdout(params, n=100, early_stopping_rounds=None):
    evals = {}
    m = xgb.train(
        params,
        dtrain,
        num_boost_round=n,
        evals=[(dtrain, 'train'), (dval, 'val')],
        evals_result=evals,
        early_stopping_rounds=early_stopping_rounds,
        verbose_eval=None,
    )
    plot_evals(evals)
    return evals

def train(params, n):
    return xgb.train(
        params,
        dfull,
        num_boost_round=n,
        verbose_eval=None,
    )

### Plotting

In [None]:
def roll(ls, w=5):
    return pd.Series(ls).rolling(window=w).mean()

def plot(a, b, c, d):
    plt.subplot(1, 2, 1)
    plt.plot(a), plt.plot(b)
    plt.ylim(0, 0.7)

    plt.subplot(1, 2, 2)
    plt.plot(c), plt.plot(d)
    plt.ylim(0, 0.2)

def plot_cv(cv_dict, start=0, stop=None):
    keys = [
        'train-logloss-mean',
        'test-logloss-mean',
        'train-error-mean',
        'test-error-mean'
    ]
    plot(*[roll(cv_dict[k][start:stop]) for k in keys])

def plot_evals(evals, start=0, stop=None):
    eval_list = [
        roll(evals[a][b][start:stop])
        for b in ['logloss', 'error']
        for a in ['train', 'val']
    ]
    plot(*eval_list)

def plot_cv_error(cv_results, start=0, stop=None):
    plt.plot(cv_results[['train-error-mean', 'test-error-mean']][start:stop])

def plot_holdout_error(h, start=0, stop=None):
    plt.plot(
        pd.DataFrame(
            [h['train']['error'], h['val']['error']],
            index=['train', 'val'])
        .T
        [start:stop]
    )

### Submit

In [None]:
def ensemble(params, n):
    def d(x): return dict(params, seed=x)
    return (
        np.vstack(train(d(x), n).predict(dtest) for x in SEED_LIST)
        .T
        .mean(axis=1)
    )

def submit(y_hat, name):
    df = pd.read_csv(submit_csv).assign(Survived=y_hat)
    timestamp = datetime.datetime.now().strftime('%d-%m-%Y_%H-%M')
    path = f'./{timestamp}_{name}.csv'
    df.to_csv(path, index=False)

def threshold(y_hat, pr=0.5):
    return (y_hat > pr) * 1

## 3.2 Scripts

In [None]:
import datetime

def dtype_info(X):
    return pd.concat([
        X.dtypes.rename('dtypes'),
        traintest.min().astype('object').rename('min'),
        traintest.max().astype('object').rename('max'),],
        axis=1
    )

def find(col, s, df):
    if isinstance(s, str):
        pass
    else:
        s = '|'.join([f'{x}' for x in s])
    return df[(
        df
        [col]
        .str.lower()
        .str.contains(s)
    )]

def na(X):
    count = X.isna().sum()
    if len(X.shape) < 2:
        return count
    else:
        return count[lambda x: x > 0]

def perc(x):
    return np.round(x * 100, 2)

def vc(df):
    return df.value_counts(dropna=False).sort_index()

## 3.3 seq

In [None]:
import math
from typing import Union

Numeric = Union[int, float, np.number]

def seq(
        start: Numeric,
        stop: Numeric,
        step: Numeric = None) \
        -> np.ndarray:
    """Inclusive sequence."""

    if step is None:
        if start < stop:
            step = 1
        else:
            step = -1

    if is_int(start) and is_int(step):
        dtype = 'int'
    else:
        dtype = None

    d = max(n_dec(step), n_dec(start))
    n_step = math.floor(round(round(stop - start, d + 1) / step, d + 1)) + 1
    delta = np.arange(n_step) * step
    return np.round(start + delta, decimals=d).astype(dtype)

def is_int(
        x: Numeric) \
        -> bool:
    """Whether `x` is int."""
    return isinstance(x, (int, np.integer))

def n_dec(
        x: Numeric) \
        -> int:
    """No of decimal places, using `str` conversion."""
    if x == 0:
        return 0
    _, _, dec = str(x).partition('.')
    return len(dec)

## 3.4 Misc

In [None]:
def bin_interp(X, bins, interp=None):
    """Interpolate bin values."""

    idx = X.apply(lambda x: bin_val(x, bins))

    if interp == 'median':
        v = X.groupby(idx).median()
    elif interp == 'mean':
        v = X.groupby(idx).mean()
    elif interp == 'min':
        v = X.groupby(idx).min()
    elif interp == 'max':
        v = X.groupby(idx).max()
    else:
        return seq(0, len(bins))

    v = list(v)
    bin_vals = [v[0]] + v + [v[-1]]

    return bin_vals

def bin_val(x, bins, vals=None):
    """Map `x` to bin value."""

    if vals is None:
        vals = seq(0, len(bins))

    assert len(vals) == len(bins) + 1, 'len(vals) must equal len(bins) + 1'

    if np.isnan(x):
        return np.nan
    elif x < bins[0]:
        index = 0
    elif x == bins[0]:
        index = 1
    elif x == bins[-1]:
        index = -2
    elif x > bins[-1]:
        index = -1
    else:
        index = np.searchsorted(bins, x, side='right')

    return vals[index]

def count(col, traintest):
    """Map value counts."""

    def f(x):
        if pd.notna(x) and x in vc.index:
            return vc.loc[x]
        else:
            return np.nan

    vc = traintest.value_counts()

    return (
        col
        .apply(lambda x: f(x))
        .rename(traintest.name + '_count')
    )

def eq_attr(one, attr, *rest):
    return all(all(getattr(one, attr) == getattr(x, attr)) for x in rest)

def match(X, col, with_df):
    """Yes/no inner join."""

    return (
        X[col]
        .isin(with_df[col])
        .astype(np.uint8)
        .rename(with_df.index.name)
    )

def reorder(df, order=None):
    """Sort `df` columns by dtype and name."""

    def sort(df):
        return df.dtypes.reset_index().sort_values([0, 'index'])['index']
    if order is None:
        order = [np.floating, np.integer, 'category', 'object']
    names = [sort(df.select_dtypes(s)) for s in order]
    return df[[x for ls in names for x in ls]]

## 3.5 Preprocessing

In [2]:
from sklearn.model_selection import train_test_split

def load(csv):
    ycol = 'target'

    col_names = {
        'Survived': ycol,
        'Pclass': 'ticket_class',
        'Name': 'name',
        'Sex': 'sex',
        'Age': 'age',
        'SibSp': 'n_sib_sp',
        'Parch': 'n_par_ch',
        'Ticket': 'ticket',
        'Fare': 'fare',
        'Cabin': 'cabin',
        'Embarked': 'port',
    }

    exclude = [
        'PassengerId'
    ]

    dtype = {
        'Pclass': np.uint8,
        'Age': np.float32,
        'SibSp': np.uint8,
        'Parch': np.uint8,
        'Fare': np.float32,
    }

    df = reorder(
        pd.read_csv(
            csv,
            dtype=dtype,
            usecols=lambda x: x not in exclude,
        )
        .rename(columns=col_names)
    )

    if ycol in df.columns:
        return df.drop(columns=ycol), df[ycol]
    else:
        return df

def load_titanic():
    X, y = load(train_csv)
    test = load(test_csv)
    traintest = pd.concat([X, test])
    return X, y, test, traintest

def preprocess(pip):
    full_X, full_y, todo_test, todo_traintest = load_titanic()

    todo_X, todo_val_X, y, val_y \
        = train_test_split(
            full_X,
            full_y,
            test_size=VAL_SIZE,
            stratify=full_y,
            random_state=SEED
        )

    tr_y = full_y
    tr_X = pip.fit_transform(full_X, full_y)
    traintest = pip.transform(todo_traintest)

    X = pip.fit_transform(todo_X, y)
    val_X = pip.transform(todo_val_X)
    test = pip.transform(todo_test)

    return (
        reorder(X), y,
        reorder(val_X), val_y,
        reorder(tr_X), tr_y,
        reorder(test), reorder(traintest)
    )

## 3.6 Transformers

In [None]:
from sklearn.base import TransformerMixin


class Apply(TransformerMixin):
    def __init__(self, fn):
        self.fn = fn

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.apply(self.fn)


class AsType(TransformerMixin):
    def __init__(self, t):
        self.t = t

    def fit(self, X, y=None):
        if self.t == 'category':
            self.dtype = pd.Categorical(X.unique())
        else:
            self.dtype = self.t
        return self

    def transform(self, X):
        return X.astype(self.dtype)


class ColMap(TransformerMixin):
    def __init__(self, trf):
        self.trf = trf

    def fit(self, X, y=None):
        self.trf_list = [self.trf().fit(col) for _, col in X.iteritems()]
        return self
    
    def transform(self, X):
        cols = [t.transform(X.iloc[:, i]) for i, t in enumerate(self.trf_list)]
        return pd.concat(cols, axis=1)


class ColProduct(TransformerMixin):
    def __init__(self, trf):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.product(axis=1)


class ColQuot(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.iloc[:, 0] / X.iloc[:, 1]


class ColSum(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.sum(axis=1)


class Cut(TransformerMixin):
    def __init__(self, bins, interp=None):
        self.bins = bins
        self.interp = interp

    def fit(self, X, y=None):
        self.name = X.name
        self.vals = bin_interp(X, self.bins, self.interp)
        return self

    def transform(self, X):
        n = len(self.vals) - 2
        return (
            X
            .apply(lambda x: bin_val(x, self.bins, self.vals))
            .rename(f'{self.name}_cut{n}')
        )


class DataFrameUnion(TransformerMixin):
    def __init__(self, trf_list):
        self.trf_list = trf_list

    def fit(self, X, y=None):
        for t in self.trf_list:
            t.fit(X, y)
        return self

    def transform(self, X):
        return pd.concat([t.transform(X) for t in self.trf_list], axis=1)


class FillNA(TransformerMixin):
    def __init__(self, val):
        self.val = val

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.fillna(self.val)


class GetDummies(TransformerMixin):
    def __init__(self, drop_first=False):
        self.drop = drop_first

    def fit(self, X, y=None):
        self.name = X.name
        self.cat = pd.Categorical(X.unique())
        return self

    def transform(self, X):
        return pd.get_dummies(X.astype(self.cat), prefix=self.name, drop_first=self.drop)


class Identity(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X


class Map(TransformerMixin):
    def __init__(self, d):
        self.d = d

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.map(self.d)


class MeanEncode(TransformerMixin):
    def __init__(self, y):
        self.y = y

    def fit(self, X, y=None):
        m = self.y.groupby(X).mean()
        keys = m.sort_values().index.values
        vals = m.index.values
        self.encode = {k: v for (k, v) in zip(keys, vals)}
        return self

    def transform(self, X):
        return X.replace(self.encode)


class NADummies(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.isna().astype(np.uint8).rename(X.name, + '_na')


class PdFunction(TransformerMixin):
    def __init__(self, fn):
        self.fn = fn

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return self.fn(X)


class QCut(TransformerMixin):
    def __init__(self, q, interp=None):
        self.q = q
        self.interp = interp

    def fit(self, X, y=None):
        _, self.bins = pd.qcut(X, self.q, retbins=True)
        self.bin_vals = bin_interp(X, self.bins, self.interp)
        return self

    def transform(self, X):
        return (
            X
            .apply(lambda x: bin_val(x, self.bins, self.bin_vals))
            .rename(f'{X.name}_qcut{self.q}')
        )


class Rename(TransformerMixin):
    def __init__(self, name):
        self.name = name

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.rename(self.name)


class SelectColumns(TransformerMixin):
    def __init__(self, include=None, exclude=None):
        self.include = include
        self.exclude = exclude

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.include:
            X = X[self.include]
        if self.exclude:
            return X.drop(columns=self.exclude)
        return X


class SelectDtypes(TransformerMixin):
    def __init__(self, include=None, exclude=None):
        self.include = include
        self.exclude = exclude

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.select_dtypes(include=self.include, exclude=self.exclude)


class StandardScaler(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        self.mean = X.mean()
        self.std = X.std(ddof=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std

## 3.7 Leaderboard

In [None]:
def read_leaderboard():
    return (
        pd
        .read_csv(leaderboard_csv)
        .groupby('TeamId')
        .Score.max()
    )

def leaderboard_info():
    df = read_leaderboard()

    n = len(df)
    m = len(pd.read_csv(leaderboard_csv))
    print(f'{n} Teams, {m} submissions')

    mean = perc(df.mean())
    print(f'Mean: {mean}')

    std = perc(df.std())
    print(f'Stdev: {std}')

def leaderboard_percentiles(p=None):
    df = read_leaderboard()

    if p is None:
        p = seq(90, 10, step=-10)

    return pd.DataFrame({
        'Percentile': p,
        'Score': perc(np.percentile(df, p)),
    })

def plot_leaderboard(x=None):
    df = read_leaderboard()
    
    if x is None:
        x = seq(0, 100, step=0.1)
    y = np.percentile(df, q=x)

    plt.title('Leaderboard')
    plt.ylabel('Score (% Accuracy)')
    plt.xlabel('Percentile (%)')
    
    plt.plot(x, y*100)

# 4. Leaderboard

Raw leaderboard data from 10 May 2018.

A quick overview of the public leaderboard to get a feel for the competition.

In [None]:
leaderboard_info()

The raw data has multiple scores per team, while the public leaderboard shows best submits only.  We'll be looking at best submits.

In [None]:
leaderboard_percentiles()

### Key Takeaways

- The gender baseline (76.55% acc) sits at the 30th percentile.
- This is a very small dataset, and the test set is especially small.
- The public leaderboard is calculated from 50% of 418 rows: that's 209 predictions.
- So, the difference between 30th percentile and 90th percentile is 8 people.
- The leaderboard metric is accuracy, but we'll be minimizing log loss (since xgboost requires gradient + hessian).
- Accuracy is a very chunky metric
    - The minimum resolution of the public leaderboard is roughly 0.48% acc (1 person of 209).
    - Unlike log loss, (Bayesian) confidence isn't taken into account.

Next, let's take a quick look at the full distribution of scores.

In [None]:
plot_leaderboard()

- Submitting floating point predictions instead of `int` will score 0.
- There's a big jump near the top.  Scores around 100% acc are probably using at least some hand labeling.
- Most scores are around 78% +/- 4 people.

# 5. Feature Engineering

Lead into gold.

## 5.1 Glossary
Features have been renamed as follows:
```
Survived  ->  target
Pclass    ->  ticket_class
Name      ->  name
Sex       ->  sex
Age       ->  age
SibSp     ->  n_sib_sp
Parch     ->  n_par_ch
Ticket    ->  ticket
Fare      ->  fare
Cabin     ->  cabin
Embarked  ->  port
```

## 5.2 Features

In order of importance (using `seed=0`), excluding complementary dummies (the full feature importance plot is in &sect; 7.4):

### surv (84)
- At least 1 person survived, with the same ticket or surname.
- Restricted to groups that appear in both `train` and `test`.
- Combination of `tk_surv` and `sn_surv`.
- `tk_surv` (max): at least 1 person with the same ticket survived.
- `sn_surv` (max): at least 1 person with the same surname survived.
- `ticket` and `surname` groups don't completely overlap -> `na` values.
- 6 levels:
```
4  ->   1   1  ->  both tk_surv and sn_surv
3  ->   1  na  ->  at least 1
2  ->   1   0  ->  exactly 1
1  ->   0  na  ->  maybe 1
0  ->   0   0  ->  exactly 0
na ->  na  na  ->  unknown
```
- Credit to all the top kernels.  This is probably *the* key feature, and perhaps the only important feature.

### cabin_encode_h (60)
- Horizontal major, cabin encoding (vertical slices); "major" as in "row major" vs "col major" matrices.
- `cabin_encode_h = cabin_no_encode + deck_encode / 10`
- `cabin_no_encode` is a hand labeled feature representing how close/far the cabin is from the from/back of the ship (explained below).
- `deck_encode` is a simple label encoding of deck A to G and T (explained below).
- Counterpart to `cabin_encode_v`.

### fare_quot (58)
- `fare_quot = fare / ticket_count`

### ticket_count (47)
- Number of people with the same ticket, across `train` + `test`.

### title_mr (42)
- Extracted from name.
- Includes rare titles such as `capt`, `col`, `don`.

### fare (41)
- As is.

### age_tc3_sex1 (32)
- 3rd class, female `age`
- Uses `age_mask`: filter `age` by `ticket_class` and `sex`; 0 or `na` otherwise.

### age (32)
- As is.

### tk_age_mean (26)
- Average age of people with same ticket, across `train` + `test`.

### ticket_class_3 (25)
- `ticket_class` dummy

### sex (25)
- Label encoding:
    - `female -> 1`
    - `male   -> 0`

### tk_n_sib_sp_mean (25)
- Average `n_sib_sp` of people with the same ticket, across `train` + `test`.

### cabin_no_encode (22)
- Horizontal encoding of cabin number: how close/far from the front/back of the ship.
- Hand labeled feature using deckplans at Encyclopedia Titanica.
```
          /----------------\
Back   | V  IV  III  II  I >   Front
          \----------------/
```
- Diagram of the Titanic collision: https://commons.wikimedia.org/wiki/File:Titanic_porting_around_English.svg
- Part of `cabin_encode_v` and `cabin_encode_h`.

### tk_sex (21)
- Mean `sex` of people with the same ticket, across `train` and `test`.

### cabin_encode_v (21)
- Deck major cabin encoding (horizontal slices); "major" as in "row major" vs "col major" matrices.
- `cabin_encode_v = deck_encode + cabin_no_encode / 10`
- Counterpart to `cabin_encode_h`.

### n_fam (16)
- `n_fam = n_par_ch + n_sib_sp`

### tc3_sex1 (16)
- Dummy: 3rd class, female.
- `tc_sex` are dummies, indicating `ticket_class` and `sex`
- No missing values, unlike `age_mask` features.

### tk_n_par_ch_mean (13)
- Mean `n_par_ch` of people with the same ticket, across `train` and `test`.

### port (11)
- `Embarked` -> rename to `port` -> label encode:
```
S -> 1
Q -> 2
C -> 3
```

### title_master (8)
- Extracted from `name`.

### deck_encode (6)
- Label encoding of `deck`, which is extracted from `cabin`.
```
T -> 8  (the top)
A -> 7
B -> 6
C -> 5
D -> 4
E -> 3
F -> 2
G -> 1  (the bottom)
```

### n_fam_2 (2)
- Polynomial feature: `n_fam_2 = n_sib_sp * n_par_ch`
- The idea is to treat `n_sib_sp` as a horizontal feature, and `n_par_ch` as a vertical feature, producing a sort of area feature.

### title_mrs (1)
- Extracted from `name`.
- Includes: `mme`, `the` (`Countess`), `dona`, `lady`.

### n_sib_sp (1)
- As is.

### title_miss (1)
- Extracted from `name`.
- Includes: `ms`, `mlle`.

## 5.3 Unused Features & Ideas

Excluded features aren't necessarily flawed; implementation matters.

### Included, but not used by XGBoost (seed = 0)
- `age_tc1_sex1`
- `n_par_ch`
- `ticket_class_1`
- `ticket_class_2`

### Excluded
- `ticket_no`: `uint` extracted from `ticket`.
    - Various binning strategies including hand labeling.
    - The deck plans suggest that ticket number is *not* correlated with cabin position.
    - Can be used to augment ticket/surname groups: eg, extended family members have nearby ticket numbers.
- `ticket_prefix`: `str` extracted from `ticket`.
    - Some tickets have a prefix such as `PC` or `STON/O2`.
    - Some, such as `STON`, seem to correspond to port of embark (Southampton).
- `age_cut` + `fare_cut`:
    - Binning by hand or by quantile (`qcut`).
- `sn_surv` + `tk_surv` (alone):
    - Variations such as `mean` and `min`.
    - Only a combined `max` is included.
- `tk_`: ticket group `min`, `max`, `count` for features such as `n_sib_sp`.  Only `mean` is included.
- `mother`, `father`, `child`:
    - Family position, and variations such as `tk_child` (ticket has child).

## 5.4 Functions

Implementation details.

Derived from:

1. `n_sib_sp` + `n_par_ch`
1. `cabin`
1. `name`
1. `sex`
1. `ticket`
1. interaction: multi column features

### SibSp + ParCh

In [None]:
def n_fam(X):
    return (
        (X.n_sib_sp + X.n_par_ch)
        .astype(np.uint8)
        .rename('n_fam')
    )

def n_fam_2(X):
    return (
        ((X.n_sib_sp+1) * (X.n_par_ch+1))
        .astype(np.uint8)
        .rename('n_fam_2')
    )

### Cabin

In [None]:
def cabin_encode_v(X):
    return (
        (deck_encode(X) + cabin_no_encode(X) / 10)
        .astype(np.float32)
        .rename('cabin_encode_v')
    )

def cabin_encode_h(X):
    return (
        (cabin_no_encode(X) + deck_encode(X) / 10)
        .astype(np.float32)
        .rename('cabin_encode_h')
    )

def cabin_no(X):
    return (
        X
        .cabin
        .str.extract(r'(\d+)', expand=False)
        .astype(np.float32)
        .rename('cabin_no')
    )

def cabin_no_encode(X):
    def encode(x):
        if x.deck == 'T':
            return 2
        elif np.isnan(x.cabin_no):
            return np.nan
        elif x.deck == 'A':
            if x.cabin_no >= 35:
                return 4
            else:
                return 2
        elif x.deck == 'B':
            if x.cabin_no >= 51:
                return 3
            else:
                return 2
        elif x.deck == 'C':
            if x.cabin_no % 2 == 0:
                if 92 <= x.cabin_no <= 102 or 142 <= x.cabin_no <= 148:
                    return 4
                elif 62 <= x.cabin_no <= 90 or 104 <= x.cabin_no <= 140:
                    return 3
                else:
                    return 2
            else:
                if 85 <= x.cabin_no <= 93 or 123 <= x.cabin_no <= 127:
                    return 4
                elif 55 <= x.cabin_no <= 83 or 95 <= x.cabin_no <= 121:
                    return 3
                else:
                    return 2
        elif x.deck == 'D':
            if x.cabin_no >= 51:
                return 5
            else:
                return 2
        elif x.deck == 'E':
            if x.cabin_no >= 91:
                return 5
            elif x.cabin_no >= 70:
                return 4
            elif x.cabin_no >= 26:
                return 3
            else:
                return 2
        elif x.deck == 'F':
            if x.cabin_no >= 46:
                return 1
            elif x.cabin_no >= 20:
                return 5
            else:
                return 4
        elif x.deck == 'G':
            return 5
    
    df = pd.concat([X.cabin, deck(X), cabin_no(X)], axis=1)
    return (
        df
        .apply(encode, axis=1)
        .astype(np.float32)
        .rename('cabin_no_encode')
    )

def deck(X):
    return (
        X
        .cabin
        .str.extract(r'([A-Z])', expand=False)
        .rename('deck')
    )

def deck_encode(X):
    return (
        deck(X)
        .map({
            'T': 8,
            'A': 7,
            'B': 6,
            'C': 5,
            'D': 4,
            'E': 3,
            'F': 2,
            'G': 1,
        })
        .astype(np.float32)
        .rename('deck_encode')
    )

def starboard(X):
    return (
        (np.round(cabin_no(X)) % 2 == 0)
        .astype(np.uint8)
        .rename('starboard')
    )

### Name

In [None]:
def surname(X):
    return (
        X
        .name
        .str.lower()
        .str.extract(r'([a-z]+),', expand=False)
    )

def title(X):
    return (
        X
        .name
        .str.lower()
        .str.extract(r', (\w+)', expand=False)
        .rename('title')
    )

def title_fill(X):
    def rare(row):
        if row.title in ['miss', 'mrs', 'master', 'mr']:
            return row.title
        elif row.title in d:
            return d[row.title]
        elif row.sex == 'male':
            return 'mr'
        elif row.sex == 'female':
            return 'mrs'
        else:
            raise ValueError('row.sex is missing / not in [`male`, `female`]')

    miss = ['ms', 'mlle']
    mrs = ['mme', 'dona', 'lady', 'the']
    mr = [
        'capt',
        'col',
        'don',
        'jonkheer',
        'major',
        'rev',
        'sir',
    ]

    d = {
        **{k: 'mr' for k in mr},
        **{k: 'mrs' for k in mrs},
        **{k: 'miss' for k in miss}
    }

    return (
        X
        .assign(title=title)
        .apply(rare, axis=1)
        .rename('title')
    )

### Sex

In [None]:
def sex(X):
    return (
        X
        .sex
        .map({'female': 1, 'male': 0})
        .astype(np.uint8)
    )

### Ticket

In [None]:
def ticket_count(X):
    _, _, _, traintest = load_titanic()
    return count(X.ticket, traintest.ticket).astype(np.uint8)

### Interaction

In [None]:
def age_mask(X, tc, sx):
    nm = f'age_tc{tc}_sex{sx}'
    return (X.age * (X.ticket_class == tc) * (sex(X) == sx)).rename(nm)

def fare_quot(X):
    return (
        (X.fare / ticket_count(X))
        .astype(np.float32)
        .rename('fare_quot')
    )

def tc_sex(X, tc, sx):
    return (
        ((X.ticket_class == tc) & (sex(X) == sx))
        .astype(np.uint8)
        .rename(f'tc{tc}_sex{sx}')
    )

def tk_fn(X, col, fn='mean'):
    _, _, _, traintest = load_titanic()
    vc = getattr(traintest[col].groupby(traintest.ticket), fn)()
    return (
        X
        .ticket
        .apply(lambda x: vc.loc[x])
        .astype(np.float32)
        .rename(f'tk_{col}_{fn}')
    )

def tk_sex(X):
    _, _, _, traintest = load_titanic()
    vc = sex(traintest).groupby(traintest.ticket).mean()
    return (
        X
        .ticket
        .apply(lambda x: vc.loc[x])
        .astype(np.float32)
        .rename('tk_sex')
    )

Finally, the all important `surv` group of functions:

In [None]:
def surv(X):
    def encode(x):
        a = x.tk_surv_max
        b = x.sn_surv_max
        if a == 1 and b == 1:
            return 4
        elif a == 1 or b == 1:
            if a == 0 or b == 0:
                return 2
            else:
                return 3
        elif a == 0 or b == 0:
            if a == 0 and b == 0:
                return 0
            else:
                return 1
        else:
            return np.nan
    return (
        pd.concat([tk_surv(X), sn_surv(X)], axis=1)
        .apply(encode, axis=1)
        .astype(np.float32)
        .rename('surv')
    )

def sn_surv(X, fn='max'):
    tr, y, te, _ = load_titanic()
    v = getattr(y.groupby(surname(tr)), fn)()[lambda x: x.index.isin(surname(te))]
    return (
        surname(X)
        .map(v)
        .astype(np.float32)
        .rename(f'sn_surv_{fn}')
    )

def tk_surv(X, fn='max'):
    tr, y, te, _ = load_titanic()
    v = getattr(y.groupby(tr.ticket), fn)()[lambda x: x.index.isin(te.ticket)]
    return (
        X
        .ticket
        .map(v)
        .astype(np.float32)
        .rename(f'tk_surv_{fn}')
    )

# 6. Pipeline Preprocessing

<figure>
  <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/2010_mavericks_competition.jpg/640px-2010_mavericks_competition.jpg">
  <figcaption style="text-align: center;">
      Andrew Davis at Mavericks. Photograph by Shalom Jacobovitz.
      <br>via
       <a href="https://commons.wikimedia.org/wiki/File:2010_mavericks_competition.jpg">Wikimedia</a>
       (<a href="https://creativecommons.org/licenses/by-sa/2.0">CC BY-SA 2.0</a>)
  </figcaption>
</figure>

## 6.1 The Pipeline

Credit to Kevin Goetsch, Julie Michelman, and Tom Augspurger.

In [None]:
X_pipeline = DataFrameUnion([
    # age
    SelectColumns('age'),

    # fare
    SelectColumns('fare'),

    # n_par_ch + n_sib_sp
    SelectColumns('n_par_ch'),
    SelectColumns('n_sib_sp'),
    PdFunction(n_fam),
    PdFunction(n_fam_2),

    # ticket_class
    make_pipeline(
        SelectColumns('ticket_class'),
        GetDummies(),
    ),

    # cabin
    PdFunction(cabin_encode_v),
    PdFunction(cabin_encode_h),
    PdFunction(cabin_no_encode),
    PdFunction(deck_encode),

    # name -> title -> dummies
    make_pipeline(
        PdFunction(title_fill),
        GetDummies(),
    ),

    # port -> 1/2/3
    make_pipeline(
        SelectColumns('port'),
        Map({'S': 1, 'Q': 2, 'C': 3}),
        AsType(np.float32)
    ),

    # sex -> 0/1
    PdFunction(sex),

    # ticket -> count
    PdFunction(ticket_count),

    #
    # interaction #
    
    # fare / ticket_count -> fare_quot
    PdFunction(fare_quot),

    # age by sex/ticket_class
    PdFunction(partial(age_mask, tc=1, sx=1)),
    PdFunction(partial(age_mask, tc=2, sx=1)),
    PdFunction(partial(age_mask, tc=3, sx=1)),
    PdFunction(partial(age_mask, tc=1, sx=0)),
    PdFunction(partial(age_mask, tc=2, sx=0)),
    PdFunction(partial(age_mask, tc=3, sx=0)),

    # 0/1 by sex/ticket_class
    PdFunction(partial(tc_sex, tc=1, sx=1)),
    PdFunction(partial(tc_sex, tc=2, sx=1)),
    PdFunction(partial(tc_sex, tc=3, sx=1)),
    PdFunction(partial(tc_sex, tc=1, sx=0)),
    PdFunction(partial(tc_sex, tc=2, sx=0)),
    PdFunction(partial(tc_sex, tc=3, sx=0)),

    # ticket grouping
    PdFunction(surv),
    PdFunction(tk_sex),
    PdFunction(partial(tk_fn, col='age')),
    PdFunction(partial(tk_fn, col='n_par_ch')),
    PdFunction(partial(tk_fn, col='n_sib_sp')),
])

## 6.2 Execute

- Split `train`:
    - `train/val`: validation set `val_X, val_y`, using `VAL_SIZE`
    - `train/train`: proper train set `X, y`
- Full `train`: `tr_X, tr_y`
- Combined `train/test`: `traintest`
- `test`: as is

In [None]:
X, y, val_X, val_y, tr_X, tr_y, test, traintest = preprocess(X_pipeline)

## 6.3 Diagnostics

### Shape

In [None]:
X.shape

### Dtypes
Check for:
- Overflow
- Column names
- Floating point error
- Anything that looks funny

In [None]:
dtype_info(X)

Use function `vc` (value counts) to check individual columns.

### Train/Val/Test Parity

Check that each dataframe has the same dtypes and same columns, in the same order.

In [None]:
eq_attr(X, 'columns', val_X, tr_X, test, traintest) \
    and eq_attr(X, 'dtypes', val_X, tr_X, test, traintest)

### DMatrix
XGBoost's custom data format.

In [None]:
dtrain = xgb.DMatrix(X, y)
dval = xgb.DMatrix(val_X, val_y)
dfull = xgb.DMatrix(tr_X, tr_y)
dtest = xgb.DMatrix(test)

# 7. Holdout + CV

Stirring in an ad hoc fashion.

## 7.1 Parameters

Done by hand.  Here's a rough outline of what I tried:
```
eta:           0.1 -> 0.01 -> 0.005
gamma:              0 -> 1 -> 2   <- 3/5/10/20
max_depth:          3 -> 4 -> 5   <- 6/7/8/16/32
min_child_weight:        1 -> 1.6
subsample:               1 -> 0.9 <- 0.7/0.5/0.3
colsample_by_tree:       1 -> 0.5 <- 0.9/0.3
lambda:  0 -> 1 -> 2 -> 32 -> 16
```

The narrative:

- `eta`: `0.01 -> 0.005`
    - My previous best model had less than 50 trees (`n=35`) -> `0.5x` learning rate -> `2x` trees -> `+1` public leaderboard.
    - Otherwise, most of my training was done at `eta=0.01`.
    - `eta=0.1` seems to train too quickly -> overfit too quickly.
- `gamma`: `1 -> 2`
    - Most of my training was done at `gamma=1`.
    - Used to combat overfitting.
- `max_depth`: `3 -> 4 -> 5`
    - Using such a high `max_depth` is probably suboptimal, given the overwhelming concern of overfitting.
    - Other `xgboost` kernels have had success with `n=3`.
    - I suspect there's a small subset of this model that performs better; I think Konstantin's kernel is a pretty good indication of this.
    - My intuition was that a single tree requires 2 splits to isolate a single level of a label encoded column, such as `deck_encode`.  And, an interaction across 5 or so columns doesn't seem unreasonable.
- `min_child_weight`: `1 -> 1.6`
    - Using Owen Zhang's rule of thumb: `mcw = 3/sqrt(event_rate)` -> 1.6
    - I didn't really deviate from `1.6`.
- `subsample`: `1 -> 0.9`
    - Owen Zhang recommends just `1`, but I thought a small amount (`0.9`) might help with overfitting.
- `colsample_by_tree`: `1 -> 0.5`
    - Again following Owen Zhang; did not deviate from `0.5` very much.
    - `colsample=1` seems to cause `surv` to overfit; the model will refuse to use other (apparently) suboptimal columns.
- `lambda`: `1 -> 16`
    - I wanted a lot of regularization, and `gamma` seemed too heavy handed.
    - `lambda` seems to slow but not stop overfitting.
    - I used powers of 2: `1, 2, 4, 16, 32, 64` and values halfway between: `3, 10, 24, 48`.
- I also tried adjusting:
    - `scale_pos_weight:` `0.5` to `3.0` by `0.1`
    - `base_score`: `0.5 -> 0.4, 0.45, 0.49, 0.51, 0.55, 0.6`

## Protoyping Examples

Here's a quick look at some parameter combinations:

### Defaults
Different implementations have different defaults.

In [None]:
_params = {
    'eta': 0.1,
    'gamma': 0,
    'max_depth': 3,
    'min_child_weight': 1,
    'subsample': 1,
    'colsample_bytree': 1,
    'lambda': 0,
    'eval_metric': ['error', 'logloss'],
    'objective': 'binary:logistic',
    'silent': 1,
    'seed': SEED,
}

In [None]:
_h = holdout(_params, n=200)

In [None]:
_cv = cv(_params, n=200)

### Zoom In

In [None]:
_params = {
    'eta': 0.025,
    'gamma': 0,
    'max_depth': 3,
    'min_child_weight': 1,
    'subsample': 1,
    'colsample_bytree': 1,
    'lambda': 0,
    'eval_metric': ['error', 'logloss'],
    'objective': 'binary:logistic',
    'silent': 1,
    'seed': SEED,
}

In [None]:
_h = holdout(_params, n=200)

In [None]:
_cv = cv(_params, n=200)

### Add some regularization

In [None]:
_params = {
    'eta': 0.025,
    'gamma': 1,
    'max_depth': 3,
    'min_child_weight': 1.6,
    'subsample': 1,
    'colsample_bytree': 0.5,
    'lambda': 1,
    'eval_metric': ['error', 'logloss'],
    'objective': 'binary:logistic',
    'silent': 1,
    'seed': SEED,
}

In [None]:
_h = holdout(_params, n=200)

In [None]:
_cv = cv(_params, n=200)

- At this point, I would try to find a number of trees `n` with low holdout and cv error (both).
- For a long time, I wanted a model with low log loss, but I haven't been able to figure it out.
- I eventually spent most of my training with small variations of `eta=0.01`, `gamma=1`, `max_depth=5`, `mcw=1.6`, `subsample=0.9`, `colsample=0.5`, `lambda=16`.
- At various points, I removed features that were unused or almost unused (`f score=1`) by my then best models.  I tried not to do too much feature selection.

## The Final Model

Without further ado.

In [None]:
params = {
    'eta': 0.005,
    'gamma': 2,
    'max_depth': 5,
    'min_child_weight': 1.6,
    'subsample': 0.9,
    'colsample_bytree': 0.5,
    'lambda': 16,
    'eval_metric': ['error', 'logloss'],
    'objective': 'binary:logistic',
    'silent': 1,
    'seed': SEED,
}

## 7.2 Holdout

Train on `dtrain` and measure log loss and error on `dval`.

In [None]:
%%time
h = holdout(params, n=200)

In [None]:
plot_holdout_error(h, 0, 200)

## 7.3 CV

Train on `dfull` with `5x5` repeated stratified *k*-fold cross validation.

I also used `StratifiedShuffleSplit` at various test sizes, ranging from 0.1 to 0.95 and *k*=10 *k*fold.

In [None]:
%%time
cv_results = cv(params, n=200)

In [None]:
plot_cv_error(cv_results, 0, 200)

Candidates for early stopping include: `n = 65, 96, 105`.
- 96 and 105 predict the same values.
- 65 is 1 off on the public leaderboard.

Chasing the leaderboard, my choice is `n=96` trees.

## 7.4 Feature Importance

A quick look at single seed feature importance.  The final model averages across several random seeds.

In [None]:
z = train(params, n=96)

Unused columns:

In [None]:
X.columns[~X.columns.isin(z.get_fscore().keys())]

Built-in plotting:

In [None]:
_, ax = plt.subplots(1, 1, figsize=(13, 16))
xgb.plot_importance(z, ax=ax);

## 7.5 Trees

We can look at individual trees.

Kaggle's notebook display width is a bit narrow; use browser zoom-in for a more readable view.

### First 5 trees

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=0)

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=1)

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=2)

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=3)

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=4)

### Last 5 trees

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=z.best_ntree_limit-1)

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=z.best_ntree_limit-2)

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=z.best_ntree_limit-3)

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=z.best_ntree_limit-4)

In [None]:
xgb.to_graphviz(z, rankdir='LR', num_trees=z.best_ntree_limit-5)

## 7.6 Ensemble

- `ensemble` trains on several seeds and averages their probabilities (arithmetic mean).
    - using `SEED_LIST`, which does not include `SEED`.
    - both `subsample` and `colsample` are random.
    - sum log odds is an alternative to arithmetic mean (see [here](https://arbital.com/p/bayes_log_odds/)).
- `threshold` converts probabilities to `0/1` (`int`) for submission, using a strict greater than: `y_hat > pr -> 1, else 0`

In [None]:
%%time
p = ensemble(params, n=96)
y_hat = threshold(p)

One sanity check is the predicted number of survivors.  Basic leaderboard probing shows that there are 156 survivors on the public leaderboard.

In [None]:
sum(y_hat)

- The few top kernels I checked are all biased toward 0.  This is probably an artifact of using the accuracy metric on an unbalanced (and fairly noisy) dataset, as opposed to using f1 or log loss.

- Thresholding to increase or decrease # of predicted survivors sometimes helps; whether it can be done in a principled and robust manner is a different matter.

- For reference, Konstantin's 0.83253 kernel predicts 134 survivors.

## 7.7 Submit

### Public LB: 0.82775

In [1]:
submit(y_hat, 'xgb')

# 8. Final Words

This is an interesting dataset with a lot of noise, but not so much noise that it's easy to luck into a good score, in my opinion.  I had a hard time crossing 0.78, then 0.79, 0.80, etc, even though each step is only a difference of 2 people.  My models were surprisingly stable in terms of peak score; I'm not sure whether that's a testament to XGBoost, or just an artifact of my approach, or just plain luck/false pattern matching.

I actually spent a lot of time trying to build out a robust and principled cross validation workflow. [Version 55](https://www.kaggle.com/numbersareuseful/titanic-starter-with-xgboost-173-209-top-2-lb?scriptVersionId=3452087) is an example of my attempt, using `RandomizedSearchCV`.  It was a complete failure.  Ultimately, I restarted from scratch, simplified my workflow, and changed tactics: focus on feature engineering + learn from other top kernels + switch to hand tuning.

At the end of the day, I'm not actually sure whether my model is underfitting or overfitting, and I don't have confidence in my model because the parameters were hand tuned in an ad hoc fashion.  I think parameter search is probably a necessary ingredient for robust and interpretable models, and I think there's a lot of room to build a better model or at least better justify (for or against) the choice of parameters that I'm using.  Model justification (ie, confidence) is just as important as model performance, because generalization is the holy grail.

But, this is all I got.  Good luck!

**Questions, comments, criticism, tips & tricks all welcome!**