**LOAD LIBRARIES & DATA FILES**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from sklearn.metrics import roc_auc_score, accuracy_score, log_loss, confusion_matrix
from sklearn.cluster import KMeans
from scipy.stats import rankdata, distributions
from sklearn.model_selection import StratifiedKFold

import gc
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 20)
np.random.seed(1729)

TRAIN = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
TEST = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
DATA = TRAIN.append(TEST, ignore_index=True, sort=False)
features = [f for f in TRAIN.columns if f not in ['PassengerId','Survived']]
cat_features = []
del(TRAIN, TEST)
gc.collect()

Thanks for checking out our notebook!!  We built it as part of the Kaggle mentor program as a set of tips, tricks and methods for use on tabular data in a Kaggle competition.  We examine the data, do some feature engineering including four different types of categorical encoding and some clustering, do some feature selection and build five different model types on three different sets of data.  Then we ensemble the 15 models for a submission file.  We hope you find it useful!

# **1.0 FEATURE EXPLORATION**
In this section we'll go through some the features of the data, fill in missing values, and create some new features.  All of the data we have here is categorical with the exception of the **Fare** feature.  Some of the features are integers where order matters (i.e. ordinal features), such as **Age** of **Pclass**, but we we'll view them as categorical because each value is discrete from other values.  We'll also build some charts or graphs to visualize the feature data.

The goal is to explore each of the features to see which ones (or combinations of more than one) expose different survival rates.  Actually we'll be looking for statistically *significant* differences.  Of the 100,000 passengers in the train data, 42.774% of them survive.  We'll call this overall survival rate *S<sub>r*.  And,
$$S_r = .42774$$

**Confidence Interval**

Viewing the train dataset as a sample of the population of all 200,000 passengers (train & test) we can use a simple 95% confidence interval on a sample survival rate to indicate feature values that show a significantly different survival rate from the overall survival rate.  The confidence interval for a proportion is given as:

$$(p' - z_\alpha * \sqrt{\frac{p' * (1-p')}{N}} , p' + z_\alpha * \sqrt{\frac{p' * (1-p')}{N}})$$

For a 99.7% Confidence interval on our overall survival rate would be:

$$S_r \pm 3.0 * \sqrt{\frac{S_r*(1-S_r)}{100,000}} = (0.42305, 0.43243)$$

So to start with we will be especially interested in feature or feature values that can show us portions of the data where the survival rates are lower than 42.467% or higher than 43.080%.  This is just to start with, we will also have to take into account how much data we have for each feature value.  But we'll do that later ...

**High-Cardinality Categorical Variables**

This data contains a few high-cardinality categorical features.  That is, features that have a large number values in relation to the size of the dataset.  For example, there are 174,854 unique **Name** values in the 200,000 rows of data.  Similarly, there are 132,613 **Ticket** values and 45,442 **Cabin** values.  Simply label encoding these can lead to overfitting when models predict for the unique combinations of these types of features.  To try and control this we'll "bin" these feature's values into smaller blocks or clusters.

## **1.1 NaNs & Sex**
One of the first features we'll look at is a simple count of the number of missing values in a row of the data.  We'll aslo take a quick look at the Sex of the passenger which is a dominant indicator of survival.  According to Wikipedia, the phrase "women and children first" is a code of conduct dating from 1852, whereby the lives of women and children were to be saved first in a life-threatening situation, typically abandoning ship, when survival resources such as lifeboats were limited.  The phrase is, however, most famously associated with the sinking of RMS Titanic in 1912.  It certainly seems to apply here, although not as much with the children...

In [None]:
DATA['NanCount'] = DATA[features].isnull().sum(axis=1)
cat_features += ['NanCount','Sex']

# -------------------------------------------
fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)

# -------------------------------------------
X = 'NanCount'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax0 = fig.add_subplot(gs[0, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax0.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax0.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax0.set_ylabel('Rows', fontsize=12, weight='bold')
ax0.set_title(X+' & Survival Rates', fontsize=14, weight='bold')
ax0.set_xticks(x)
ax0.set_xticklabels(data[X], fontsize=12, weight='bold')
ax0.legend(fontsize=12)
ax0.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'Sex'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax1 = fig.add_subplot(gs[0, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax1.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax1.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax1.set_ylabel('Rows', fontsize=12, weight='bold')
ax1.set_title(X+' & Survival Rates', fontsize=14, weight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(data[X], fontsize=12, weight='bold')
ax1.legend(fontsize=12)
ax1.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)
fig.tight_layout()
plt.show()

## **Things to Note:**
* NanCounts 0 & 1 look like good indicators of survivability
* There are similar amounts of data for each NanCount in the train and test datasets
* Sex looks like a VERY good indicator of survivability
* There is some imbalance of sexes between train and test - there are a fair amount more males in the test dataset than in the train dataset

## **1.2 Age**
We have a number of missing values in the Age column so we'll fill the missing values with the overal mean Age.  There are also a number of "babies on board."  These are listed with ages by the tenth of a year (e.g., 0.4 or 0.6) which we will replace with zeros.  Age is an ordinal variable for everyone over the age of one.  This means someone who is 22.8 years old is recorded as being 22.  To make the age differences consistent we should make the babies all 0 years old.

We'll also do some binning based on Age Groups.  The Age groups are a bit arbitrary.  They are based on what the definitions of infant, toddler, teenager, etc.  We also break the "Adults" into 18-39, 40-70 and 70+ groups based on looking at the changes in the survival rates of various ages.  The bins are named "AgeGroupX" so that when we label encode them they stay in order from youngest to oldest. The Groups are as follows:
* AgeGroup0 = 0-3 years
* AgeGroup1 = 4-10 years
* AgeGroup2 = 11-17 years
* AgeGroup3 = 18-39 years
* AgeGroup4 = 40-70 years
* AgeGroup5 = >70 years

In [None]:
def age_group(A):
    if A<=3:
        return 'AgeGrp0'
    elif (A>=4)&(A<=10):
        return 'AgeGrp1'
    elif (A>=11)&(A<=17):
        return 'AgeGrp2'
    elif (A>=18)&(A<=39):
        return 'AgeGrp3'
    elif (A>=40)&(A<=70):
        return 'AgeGrp4'
    else:
        return 'AgeGrp5'

DATA['Age'].fillna((DATA['Age'].mean()), inplace=True)
DATA.loc[DATA['Age']<1, 'Age'] = 0
DATA['AgeGroup'] = DATA['Age'].apply(age_group)
cat_features += ['AgeGroup']

# -------------------------------------------
fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)
# -------------------------------------------
data = DATA.groupby(['Age'], as_index=False)['Survived'].mean()

ax0 = fig.add_subplot(gs[0, 0])
ax0.scatter(data['Age'],data['Survived'], s=25, color='navy')
ax0.set_xlabel('Age', fontsize=10, weight='bold')
ax0.set_ylabel('Survival Rate', fontsize=10, weight='bold')
ax0.set_title('Survival Rate by Age', fontsize=12, weight='bold')

# -------------------------------------------
X = 'AgeGroup'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax1 = fig.add_subplot(gs[0, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax1.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax1.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax1.set_ylabel('Rows', fontsize=10, weight='bold')
ax1.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(data[X], fontsize=10, weight='bold')
ax1.legend(fontsize=12)
ax1.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

fig.tight_layout()
plt.show()

## **Things to Note:**
* Age Groups 3 & 4 look useful for predicting the survivability
* But there are very different amounts of Group3 and Group4 between train and test

## **1.3 Family Sizes (SibSp and Parch)**
There is a lot to look at with the SibSp and Parch features and there combination into "Family Sizes."  In particular, in looks like those passengers travelling without any other family members (FamilySize==1) survive somewhat less frequently.

In [None]:
DATA['SibSp'] = np.clip(DATA['SibSp'],0,2)
DATA['Parch'] = np.clip(DATA['Parch'],0,2)
DATA['FamilySize'] = DATA['SibSp']+DATA['Parch']+1
cat_features += ['FamilySize','SibSp','Parch']

# -------------------------------------------
fig = plt.figure(figsize=(16, 16), facecolor='white')
gs = fig.add_gridspec(2, 2)

# -------------------------------------------
data = DATA.groupby(['SibSp','Parch'], as_index=False)['Survived'].mean()
cm = plt.cm.get_cmap('RdYlBu')
ax0 = fig.add_subplot(gs[0, 0])
sc0 = ax0.scatter(data['SibSp'],data['Parch'], s=50, c=data['Survived'], cmap=cm)
ax0.set_xlabel('SibSp', fontsize=10, weight='bold')
ax0.set_ylabel('Parch', fontsize=10, weight='bold')
ax0.set_title('Survival Rate by Family', fontsize=12, weight='bold')
plt.colorbar(sc0) #, cax=ax0)

# -------------------------------------------
X = 'FamilySize'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax1 = fig.add_subplot(gs[0, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax1.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax1.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax1.set_ylabel('Rows', fontsize=10, weight='bold')
ax1.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(data[X], fontsize=10, weight='bold')
ax1.legend(fontsize=12)
ax1.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'SibSp'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax2 = fig.add_subplot(gs[1, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax2.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax2.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax2.set_ylabel('Rows', fontsize=10, weight='bold')
ax2.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(data[X], fontsize=10, weight='bold')
ax2.legend(fontsize=12)
ax2.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'Parch'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax3 = fig.add_subplot(gs[1, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax3.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax3.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax3.set_ylabel('Rows', fontsize=10, weight='bold')
ax3.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax3.set_xticks(x)
ax3.set_xticklabels(data[X], fontsize=10, weight='bold')
ax3.legend(fontsize=12)
ax3.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

fig.tight_layout()
plt.show()

## **Things to Note:**
* Travelling without family may indicate a lower survival rate
* There appears to be more family in the test dataset than in the train set

## **1.4 Class & Embarked**
It certainly appears that class has a big impact on survivability.  Also, there appears to be a lot more third-class passengers in the test data than there are in the train data.  This could mean that the test data will have a lower overall survivability rate than the train data.

There are a few missing values in the Embarked feature and we'll fill those with S since that is where the vast majority of passengers embarked from.  It also appears that leaving from Southhampton (Embarked==S) has a much lower survival rate than leaving from other locations.

In [None]:
DATA['Embarked'].fillna('S', inplace=True)
cat_features += ['Embarked','Pclass']

# -------------------------------------------
fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)

# -------------------------------------------
X = 'Pclass'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax1 = fig.add_subplot(gs[0, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax1.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax1.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax1.set_ylabel('Rows', fontsize=10, weight='bold')
ax1.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(data[X], fontsize=10, weight='bold')
ax1.legend(fontsize=12)
ax1.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'Embarked'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax2 = fig.add_subplot(gs[0, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax2.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax2.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax2.set_ylabel('Rows', fontsize=10, weight='bold')
ax2.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(data[X], fontsize=10, weight='bold')
ax2.legend(fontsize=12)
ax2.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

fig.tight_layout()
plt.show()

## **Things to Note:**
* These both look like good features for indicating survival
* But again, there are very different amounts of third-class passengers between train and test

## **1.5 Cabin Numbers**
We can do a couple of things with the Cabin Number, even though only about 30% of all passengers have a cabin (And it is certainly advantageous to have a cabin).  61,303 out of the 200,000 passengers have a cabin number.  There are 45,442 cabin numbers and each one begins with a letter.  For the Titanic data this letter indicated the Deck that the cabin was on and there are 8 decks (A-G and T) in the Synthantic data.  We'll pull the cabin number apart a bit to provide some "bins" of cabins and see what that shows.

We'll break the cabin number into three pieces, Deck (first letter), Section (thousands portion of the numerical part), and Room (last three numerical digits).  We'll call the fall numeric part of cabin the CabinNumber.  We'll also create a binary feature called "HasCabin."

In [None]:
DATA['Deck'] = DATA['Cabin'].apply(lambda x: 'Z' if pd.isnull(x) else x[0])
DATA['CabinNumber'] = DATA['Cabin'].apply(lambda x: np.nan if pd.isnull(x) else int(x[1:]))
DATA['CabinSection'] = DATA['Cabin'].apply(lambda x: np.nan if pd.isnull(x) else int(x[1:-3]))
DATA['CabinRoom'] = DATA['Cabin'].apply(lambda x: np.nan if pd.isnull(x) else int(x[-3:]))
DATA['HasCabin'] = DATA['Deck'].apply(lambda x: 'Cabin' if x!='Z' else 'NoCabin')
cat_features += ['Deck','CabinSection','CabinRoom','HasCabin']

# -------------------------------------------
fig = plt.figure(figsize=(16, 16), facecolor='white')
gs = fig.add_gridspec(2, 2)

# -------------------------------------------
X = 'HasCabin'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax0 = fig.add_subplot(gs[0, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax0.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax0.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
ax0.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax0.set_xticks(x)
ax0.set_xticklabels(data[X], fontsize=10, weight='bold')
ax0.legend(fontsize=12)
ax0.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'Deck'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax1 = fig.add_subplot(gs[0, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax1.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax1.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax1.set_ylabel('Rows', fontsize=10, weight='bold')
ax1.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(data[X], fontsize=10, weight='bold')
ax1.legend(fontsize=12)
ax1.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'CabinSection'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax2 = fig.add_subplot(gs[1, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax2.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax2.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax2.set_ylabel('Rows', fontsize=10, weight='bold')
ax2.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(data[X], fontsize=10, weight='bold')
ax2.legend(fontsize=12)
ax2.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'CabinRoom'
data = DATA.groupby([X], as_index=False)['Survived'].mean()
ax3 = fig.add_subplot(gs[1, 1])
ax3.scatter(data[X],data['Survived'], s=25, color='navy')
ax3.set_xlabel(X, fontsize=10, weight='bold')
ax3.set_ylabel('Survival Rate', fontsize=10, weight='bold')
ax3.set_title('Survival Rate by Cabin', fontsize=12, weight='bold')

fig.tight_layout()
plt.show()

## **Things to Note:**
* Having a Cabin looks like its good for survival
* The actual numeric part of the cabin number doesn't look to valuable at the moment but the Deck may be of use

## **1.6 Fare**

Fare is really the only continuous feature in the data and the values range from 0.00 to 744.60 with a few missing values.  The first thing we will do is try to fill in the missing fare values.  Let's look at the distribution of fare values based on where the passenger embarked from, the class of ticket they have and whether or not they have a cabin, all things we would assume to impact the fare prices.

In [None]:
fig = plt.figure(figsize=(16, 16), facecolor='white')
gs = fig.add_gridspec(2, 2)
ax0 = fig.add_subplot(gs[0, 0])
ax0.hist(DATA['Fare'], bins=100, color='navy')
ax0.set_xlabel('Fare', fontsize=10, weight='bold')
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
ax1 = fig.add_subplot(gs[0, 1])
sns.violinplot(data=DATA, x='HasCabin', y='Fare', hue='Survived', split=True, inner="quart", linewidth=1, palette={0: 'grey', 1: 'navy'})
ax2 = fig.add_subplot(gs[1, 0])
sns.violinplot(data=DATA, x='Embarked', y='Fare', hue='Survived', split=True, inner="quart", linewidth=1, palette={0: 'grey', 1: 'navy'})
ax3 = fig.add_subplot(gs[1, 1])
sns.violinplot(data=DATA, x='Pclass', y='Fare', hue='Survived', split=True, inner="quart", linewidth=1, palette={0: 'grey', 1: 'navy'})
fig.tight_layout()
plt.show()


These charts are a little hard to read since there are some extreme fare values.  A good way to adjust a distribution with some extreme values like this is to take the Log of all the values.  Let's see how this changes the visualizations.

In [None]:
DATA['LNFare'] = DATA['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
fig = plt.figure(figsize=(16, 16), facecolor='white')
gs = fig.add_gridspec(2, 2)
ax0 = fig.add_subplot(gs[0, 0])
ax0.hist(DATA['LNFare'], bins=100, color='navy')
ax0.set_xlabel('Log(Fare)', fontsize=10, weight='bold')
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
ax1 = fig.add_subplot(gs[0, 1])
sns.violinplot(data=DATA, x='HasCabin', y='LNFare', hue='Survived', split=True, inner="quart", linewidth=1, palette={0: 'grey', 1: 'navy'})
ax2 = fig.add_subplot(gs[1, 0])
sns.violinplot(data=DATA, x='Embarked', y='LNFare', hue='Survived', split=True, inner="quart", linewidth=1, palette={0: 'grey', 1: 'navy'})
ax3 = fig.add_subplot(gs[1, 1])
sns.violinplot(data=DATA, x='Pclass', y='LNFare', hue='Survived', split=True, inner="quart", linewidth=1, palette={0: 'grey', 1: 'navy'})
fig.tight_layout()
plt.show()


OK, it looks like have some differences in the Fare prices based on Embarkation, Cabin and Class.  So we'll use those three variables to fill in the missing fares.

The raw fare values look like they have 4 main cluster points (the peaks in the histogram).  So we'll use the log value (so the extreme values don't skew the clusters towards the high end) and do a quick k-means clustering to build 4 bins of fare values then look at survival rates in each bin.

In [None]:
faremeans = DATA.groupby(['Embarked','Pclass','HasCabin'], as_index=False)['Fare'].mean()
farecats = list(map(lambda x,y,z: (x,y,z), faremeans['Embarked'],faremeans['Pclass'],faremeans['HasCabin']))
faredict = dict(zip(farecats, faremeans['Fare'].tolist()))
DATA['Fare'] = list(map(lambda w,x,y,z: z if pd.notnull(z) else faredict[(w,x,y)], DATA['Embarked'],DATA['Pclass'],DATA['HasCabin'],DATA['Fare']))
del(faremeans, farecats)
DATA['LNFare'] = DATA['Fare'].map(lambda x: np.log(x) if x > 0 else 0)

model = KMeans(n_clusters=4)
clusters = model.fit_predict(np.array(DATA['LNFare']).reshape(-1,1))
cluster_dict = dict(zip([0,1,2,3],rankdata([x[0] for x in model.cluster_centers_])))
DATA['FareCluster'] = list(map(lambda x: cluster_dict[x], clusters))
cat_features += ['FareCluster']

# -------------------------------------------

fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)

ax0 = fig.add_subplot(gs[0, 0])
bin_list = np.linspace(DATA['Fare'].min(),DATA['Fare'].max(),100)
ax0.hist(DATA.loc[DATA['FareCluster']==1,'Fare'], bins=bin_list, color='navy')
ax0.hist(DATA.loc[DATA['FareCluster']==2,'Fare'], bins=bin_list, color='blue')
ax0.hist(DATA.loc[DATA['FareCluster']==3,'Fare'], bins=bin_list, color='lightsteelblue')
ax0.hist(DATA.loc[DATA['FareCluster']==4,'Fare'], bins=bin_list, color='slategrey')
ax0.set_xlabel('Fare', fontsize=10, weight='bold')
ax0.set_ylabel('Rows', fontsize=10, weight='bold')

# -------------------------------------------
X = 'FareCluster'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax1 = fig.add_subplot(gs[0, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax1.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax1.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax1.set_ylabel('Rows', fontsize=10, weight='bold')
ax1.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(data[X], fontsize=10, weight='bold')
ax1.legend(fontsize=12)
ax1.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

fig.tight_layout()
plt.show()

## **Things to Note:**
* Fare Clusters look to have some value in predicting survival
* There is some imbalance between train and test datasets for the fare

## **1.7 Ticket**
The ticket numbers are interesting.  A few have only a text vale, most have only a numeric value, and some have both a text and numeric value.  And we have 9,804 missing values across the train and test datasets.  We will break the ticket value into its text portion and call it "TicketType" and the numeric portion which we'll call "TicketNumber".  Then we'll bin the TicketType a bit more by just looking at the first letter of the "Type" and call that the "TicketCat."  We'll also bin the TicketNumbers based on their length (from 4-9 digits) and call it "TicketLen."

In [None]:
def ticket_type(t):
    T = str(t).upper().strip()
    T = T.split()
    T = [str(x).strip() for x in T]
    if T[0]=='NAN':
        return np.nan
    elif T[0][0].isalpha():
        T = T[0].replace('.','')
        T = T.replace('SOTON','STON')
        T = T.replace('A/4','A4')
        T = T.replace('A/5','A5')
        T = T.replace('A/S','A5')
        T = T.replace('CA5TON','CA/STON')
        return T
    else:
        return np.nan

def ticket_number(t):
    T = str(t).upper().strip()
    T = T.split()
    T = [str(x).strip() for x in T]
    if T[0].isnumeric():
        return int(T[0])
    elif (T[0][0].isalpha())&(len(T)==1):
        return np.nan
    else:
        return int(T[1])

DATA['TicketType'] = DATA['Ticket'].apply(ticket_type)
DATA['TicketNumber'] = DATA['Ticket'].apply(ticket_number)
DATA['TicketType'].fillna('NoType', inplace=True)
DATA['TicketNumber'].fillna(-1, inplace=True)
DATA['TicketCat'] = DATA['TicketType'].apply(lambda x: x[0])
DATA['TicketLen'] = DATA['TicketNumber'].apply(lambda x: len(str(x)))
cat_features += ['TicketType','TicketCat','TicketLen']

# -------------------------------------------

fig = plt.figure(figsize=(16, 16), facecolor='white')
gs = fig.add_gridspec(2, 2)

# -------------------------------------------
X = 'TicketType'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax0 = fig.add_subplot(gs[0, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax0.barh(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax0.barh(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax0.set_xlabel('Rows', fontsize=10, weight='bold')
ax0.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax0.set_yticks(x)
ax0.set_yticklabels(data[X], fontsize=10, weight='bold')
ax0.legend(fontsize=12)
ax0.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'TicketCat'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax0 = fig.add_subplot(gs[0, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax0.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax0.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
ax0.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax0.set_xticks(x)
ax0.set_xticklabels(data[X], fontsize=10, weight='bold')
ax0.legend(fontsize=12)
ax0.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'TicketLen'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax0 = fig.add_subplot(gs[1, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax0.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax0.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
ax0.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax0.set_xticks(x)
ax0.set_xticklabels(data[X], fontsize=10, weight='bold')
ax0.legend(fontsize=12)
ax0.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)
# -------------------------------------------

fig.tight_layout()
plt.show()




## **Things to Note:**
* Passengers not having a TicketType looks like it may survive more than those with one
* Low (4-digit) TicketNumbers and high (9-digit) TicketNumbers may be useful for modeling

## **1.8 Name**
Because of the extreme number of **Name** values we'll first break the name into first and last names and then we will *frequency* encode them by replacing the names with the number of rows of data that have that first or last name.

In [None]:
DATA['FirstName'] = DATA['Name'].apply(lambda x: x.split(',')[1].strip())
DATA['LastName'] = DATA['Name'].apply(lambda x: x.split(',')[0].strip())

firsts = DATA.groupby(['FirstName']).size()
DATA['FirstNameFreq'] = DATA['FirstName'].apply(lambda x: firsts[x])
lasts = DATA.groupby(['LastName']).size()
DATA['LastNameFreq'] = DATA['LastName'].apply(lambda x: lasts[x])
cat_features += ['FirstNameFreq','LastNameFreq']

# -------------------------------------------

fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)

# -------------------------------------------
X = 'FirstNameFreq'
data = DATA.groupby([X], as_index=False)['Survived'].mean()
ax0 = fig.add_subplot(gs[0, 0])
ax0.scatter(data[X],data['Survived'], s=25, color='navy')
ax0.set_xlabel(X, fontsize=10, weight='bold')
ax0.set_ylabel('Survival Rate', fontsize=10, weight='bold')
ax0.set_title('Survival Rate by FirstNameFreq', fontsize=12, weight='bold')

# -------------------------------------------
X = 'LastNameFreq'
data = DATA.groupby([X], as_index=False)['Survived'].mean()
ax1 = fig.add_subplot(gs[0, 1])
ax1.scatter(data[X],data['Survived'], s=25, color='navy')
ax1.set_xlabel(X, fontsize=10, weight='bold')
ax1.set_ylabel('Survival Rate', fontsize=10, weight='bold')
ax1.set_title('Survival Rate by LastNameFreq', fontsize=12, weight='bold')

# -------------------------------------------

fig.tight_layout()
plt.show()



It looks like these name features don't get us much except that high-cardinality first names have a low survival rate.  So let's bin the first name frequencies and see if we can use it.

In [None]:
model = KMeans(n_clusters=2)
clusters = model.fit_predict(np.array(DATA['FirstNameFreq']).reshape(-1,1))
cluster_dict = dict(zip([0,1],rankdata([x[0] for x in model.cluster_centers_])))
DATA['FirstNameCluster'] = list(map(lambda x: cluster_dict[x], clusters))
cat_features += ['FirstNameCluster']

# -------------------------------------------

fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)

# -------------------------------------------
X = 'FirstNameCluster'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax0 = fig.add_subplot(gs[0, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax0.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax0.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
ax0.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax0.set_xticks(x)
ax0.set_xticklabels(data[X], fontsize=10, weight='bold')
ax0.legend(fontsize=12)
ax0.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)
# -------------------------------------------

fig.tight_layout()
plt.show()

## **Things to Note:**
* Names don't seem to have much value except for the FirstNameCluster

# **2.0 FEATURE COMBINATIONS**


## **2.1 Cabin Number & Ticket Number**
A significant amount of the passengers have no Cabin number.  Let's see if there's a relationship between Ticket Numbers and Cabin Numbers.  Maybe we can use Ticket to fill in Cabin.

In [None]:
data = DATA[(DATA['CabinNumber'].notnull())&(DATA['TicketNumber']!=-1)].copy()

# -------------------------------------------

fig = plt.figure(figsize=(16, 16), facecolor='white')
gs = fig.add_gridspec(2, 2)

# -------------------------------------------
ax0 = fig.add_subplot(gs[0, 0])
ax0.ticklabel_format(style='plain')
ax0.scatter(data['TicketNumber'],data['CabinNumber'], s=5, color='navy')
ax0.set_xlabel('TicketNumber', fontsize=10, weight='bold')
ax0.set_ylabel('CabinNumber', fontsize=10, weight='bold')
ax0.set_title('TicketNumber vs. CabinNumber', fontsize=12, weight='bold')

# -------------------------------------------
ax1 = fig.add_subplot(gs[0, 1])
ax1.ticklabel_format(style='plain')
ax1.scatter(data['TicketNumber'],data['Deck'], s=5, color='navy')
ax1.set_xlabel('TicketNumber', fontsize=10, weight='bold')
ax1.set_ylabel('Deck', fontsize=10, weight='bold')
ax1.set_title('TicketNumber vs. Deck', fontsize=12, weight='bold')
# -------------------------------------------

fig.tight_layout()
plt.show()


Well, there doesn't seem to much of relationship between the two.

## **2.2 Cabin Population by Class**


In [None]:
cabinpops = pd.pivot_table(data=DATA, index='Cabin', columns='Pclass', values='PassengerId', aggfunc='count')
cabinpops['CabinPop'] = cabinpops.sum(axis=1)
cabinpops.reset_index(drop=False, inplace=True)
cabinpops.fillna(0, inplace=True)
cabinpops.rename(columns={1:'CabinPop1',2:'CabinPop2',3:'CabinPop3'}, inplace=True)
DATA = DATA.merge(cabinpops, on=['Cabin'], how='left')
DATA['CabinClassCount'] = (DATA[['CabinPop1','CabinPop2','CabinPop3']]>0).sum(axis=1)
DATA['MeanCabinClass'] = np.round((DATA['CabinPop1'] + (DATA['CabinPop2']*2) + (DATA['CabinPop3']*3))/DATA['CabinPop'],1)
for c in ['CabinPop1','CabinPop2','CabinPop3','CabinPop','MeanCabinClass']:
    DATA[c].fillna(-1, inplace=True)
cat_features += ['CabinPop','CabinClassCount','MeanCabinClass']

# -------------------------------------------

fig = plt.figure(figsize=(16, 16), facecolor='white')
gs = fig.add_gridspec(2, 2)

# -------------------------------------------
X = 'CabinClassCount'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax0 = fig.add_subplot(gs[0, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax0.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax0.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
ax0.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax0.set_xticks(x)
ax0.set_xticklabels(data[X], fontsize=10, weight='bold')
ax0.legend(fontsize=12)
ax0.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'CabinPop'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax1 = fig.add_subplot(gs[0, 1])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax1.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax1.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax1.set_ylabel('Rows', fontsize=10, weight='bold')
ax1.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(data[X], fontsize=10, weight='bold')
ax1.legend(fontsize=12)
ax1.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------
X = 'MeanCabinClass'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax2 = fig.add_subplot(gs[1, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax2.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax2.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax2.set_ylabel('Rows', fontsize=10, weight='bold')
ax2.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(data[X], fontsize=10, weight='bold')
ax2.legend(fontsize=12)
ax2.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)
# -------------------------------------------

fig.tight_layout()
plt.show()


## **2.2 Multi-Sex Cabins**

In [None]:
def cabin_sex(p):
    if p==0:
        return 'male'
    elif p==1:
        return 'female'
    else:
        return 'mixed'


DATA['female'] = (DATA['Sex']=='female').astype(int)
femper = DATA.groupby(['Cabin'], as_index=False)['female'].mean()
femper.rename(columns={'female':'CabinFemalePerc'}, inplace=True)
DATA = DATA.merge(femper, on=['Cabin'], how='left')
DATA['CabinSex'] = DATA['CabinFemalePerc'].apply(cabin_sex)
cat_features += ['CabinSex']

# -------------------------------------------
fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)

# -------------------------------------------
X = 'CabinSex'
trn = DATA[DATA['Survived'].notnull()].groupby([X])['Survived'].agg(['mean','size'])
trn.columns = ['SurvivalRate','TrainRows']
trn.reset_index(drop=False, inplace=True)
tst = DATA[DATA['Survived'].isnull()].groupby([X])['Survived'].agg(['mean','size'])
tst.columns = ['SurvivalRate','TestRows']
tst.reset_index(drop=False, inplace=True)
data = trn.append(tst, ignore_index=True)
data = data.groupby(X, as_index=False).sum()
ax0 = fig.add_subplot(gs[0, 0])
x = np.arange(len(data[X]))  # the label locations
width = 0.35  # the width of the bars
trbars = ax0.bar(x - width/2, data['TrainRows'], width, label='Train Data', color='navy', edgecolor='grey')
tebars = ax0.bar(x + width/2, data['TestRows'], width, label='Test Data', color='lightgrey',edgecolor='black')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
ax0.set_title(X+' & Survival Rates', fontsize=12, weight='bold')
ax0.set_xticks(x)
ax0.set_xticklabels(data[X], fontsize=10, weight='bold')
ax0.legend(fontsize=12)
ax0.bar_label(trbars, labels=np.round(data['SurvivalRate'],2), padding=3)

# -------------------------------------------

fig.tight_layout()
plt.show()

## **2.4 Cabin Prices**


In [None]:
cabinfare = pd.pivot_table(data=DATA, index='Cabin', columns='Pclass', values='Fare', aggfunc='sum')
cabinfare.rename(columns={1:'CabinPrice1',2:'CabinPrice2',3:'CabinPrice3'}, inplace=True)
cabinfare['CabinPrice'] = cabinfare.sum(axis=1)
cabinfare['CabinFareRange'] = cabinfare[['CabinPrice1','CabinPrice2','CabinPrice3']].max(axis=1)-cabinfare[['CabinPrice1','CabinPrice2','CabinPrice3']].min(axis=1)
cabinfare.reset_index(drop=False, inplace=True)
cabinfare.fillna(0, inplace=True)
DATA = DATA.merge(cabinfare, on=['Cabin'], how='left')
DATA['MeanCabinFare'] = DATA['CabinPrice']/DATA['CabinPop']

# -------------------------------------------
fig = plt.figure(figsize=(16, 16), facecolor='white')
gs = fig.add_gridspec(2, 2)

# -------------------------------------------
ax0 = fig.add_subplot(gs[0, 0])
ax0.hist(DATA['CabinPrice'], bins=100, color='navy')
ax0.set_xlabel('CabinPrice', fontsize=10, weight='bold')
ax0.set_ylabel('Rows', fontsize=10, weight='bold')
# -------------------------------------------
ax1 = fig.add_subplot(gs[0, 1])
ax1.hist(DATA['MeanCabinFare'], bins=100, color='navy')
ax1.set_xlabel('MeanCabinFare', fontsize=10, weight='bold')
ax1.set_ylabel('Rows', fontsize=10, weight='bold')
# -------------------------------------------
ax2 = fig.add_subplot(gs[1, 0])
ax2.hist(DATA['CabinFareRange'], bins=100, color='navy')
ax2.set_xlabel('CabinFareRange', fontsize=10, weight='bold')
ax2.set_ylabel('Rows', fontsize=10, weight='bold')
# -------------------------------------------
fig.tight_layout()
plt.show()

# **3.0 FEATURE SETS & CATEGORICAL ENCODING OPTIONS**
We've done a bit of feature engineering while we explored each feature, mainly through binning & clustering the values.  Now we'll build three datasets that we can use for modeling.  The first data set will consist of binary features only and the second will be a smaller set of the most impactful features, label-endcoded instead of binary.  The third set will use the same features as the label-encoded set but we will target-encode them instead of label encoding.


## **3.1 Binary Features**
For this set of data we'll make use of the power of confidence intervals and see which ones will really get us the most bang-for-the-buck.  What we want to do here is build binary features for each of the feature values that are the most indicative of survival.  The thought being that a limited number of specific feature values will be best at predicting without overfitting.

To do this we will be VERY conservative, and adandon statistical rigor a bit.  We want to be careful with feature values that are rare in the dataset but have extreme survivability rates.  For example, Deck==E has a survival rate of 62% but it's less prevalent than some of the other Deck values and may not be predictive because it doesn't occur as frequently.  To handle this we'll adjust the survival rate for each feature value by using a weighted average with the overal survival rate, *S<sub>r* = 0.42774.  The greater the number of rows of data we have the more we will weight the feature value's actual survival rate.  We'll use the weights based on the following sigmoid curve, which says that we only fully trust the feature-value survival rate if there at least 5,000 rows of train data with that feature value.

In [None]:
sig = pd.DataFrame({'Rows':range(5000), 'Weight':np.zeros(5000)})
sig['Weight'] = sig['Rows'].apply(lambda n: 1.0/(1 + np.exp( (2500-n)/500 )))

fig = plt.figure(figsize=(16, 8), facecolor='white')
X = 'FirstNameFreq'
plt.plot(sig['Rows'],sig['Weight'], color='red', linewidth=2.0)
plt.xlabel('Rows', fontsize=14, weight='bold')
plt.ylabel('Weight for Feature Value Rate', fontsize=14, weight='bold')
plt.title('Survival Rate Weighting', fontsize=16, weight='bold')
plt.tight_layout()
plt.show()

Continuing with the Deck==E example, we see that in the train dataset Deck==E occurs 1,749 times with a mean survival rate of 61.7496%.  The above curve says we weight Deck D's survival rate at 0.1821 and weight the overall rate, *S<sub>r* with 1-0.1821 = 0.8179.  This means we are estimating the actual survival rate for Deck==E as (0.1821 * 0.617496)+(0.8179 * 0.42774) = 0.46229.

Now we check to see if a confidence interval around the new estimated survival rate for Deck==E overlaps the 99.7% confidence interval we calculated for *S<sub>r* , (0.42305, 0.43243).  This is where we abandon statistical rigor a bit since this isn't really how one calculates a confidence interval on an adjusted estimate (i.e., the 0.46229 survival rate).  And again, we will be VERY conservative by using a 99.7% confidence interval on the adjusted estimate.  The formulas remain the same as those in section 1.0.

For Deck==E this gives an interval of (0.42653, 0.49806).  And because this interval overlaps with the *S<sub>r* interval we cannot say that the survival rate for Deck==E is significantly different from the overall survival rate and indicates a greater chance of survival.

So let's do this for all of the features we've identified and see which ones are the most significant.

In [None]:
train = DATA[DATA['Survived'].notnull()].copy()
train.reset_index(drop=True, inplace=True)

bin_features = []
mean_y = train['Survived'].mean()
lower_y = mean_y - 3.0*np.sqrt(mean_y*(1-mean_y)/100000)
upper_y = mean_y + 3.0*np.sqrt(mean_y*(1-mean_y)/100000)
print('MEAN TARGET: ',mean_y)
print('CONFIDENCE INTERVAL FOR MEAN: (',np.round(lower_y,5),' to ',np.round(upper_y,5),')')
for C in cat_features:
    lvls = sorted(train[C].unique().tolist())
    for lvl in lvls:
        n  = train[train[C]==lvl].shape[0]
        if n==0:
            continue
        s  = train.loc[train[C]==lvl, 'Survived'].mean()
        wt = 1.0/(1 + np.exp( (2500-n)/500 ))
        p = wt*s + (1-wt)*mean_y
        upper = p + 3*np.sqrt(p*(1-p)/n)
        lower = p - 3*np.sqrt(p*(1-p)/n)
        if (upper_y < lower)|(lower_y > upper):
            bin_col = C+'_'+str(lvl)
            DATA[bin_col] = (DATA[C]==lvl).astype(int)
            bin_features += [bin_col]
            print(C+' = '+str(lvl)+' has estimated Survival Rate of:', np.round(p,5))
del(train)
gc.collect()

## **3.2 Label-Encoded Features**
For this dataset we'll be conservative again, and a bit subjective, and pull out the basic features that show up the most in the set of binary features we identified above.  Looking at the list of binary features we can see the following show up repeatedly:
* **Sex** shows up on its own and as part of CabinSex.  It is also the single most predictive feature for survival.
* **Pclass** shows up on its own and as part of MeanCabinClass and CabinClassCount, so we'll keep Pclass but not the Cabin-Class features.
* **Embarked** shows up on its own so we'll keep that.
* **Cabin** shows up as part of several binary features but usually in regard to whether or not a passenger has a cabin.  For example the HasCabin feature is the same as Deck==Z.  So for this dataset we'll keep the **Deck** feature.
* **Parch & SibSp** Parch shows up on its own while SibSp shows up as part of the FamilySize feature.  So for this dataset we'll keep the **FamilySize** feature.
* **Name** The only part of the name that we found usable in the bianry features is the **FirstNameCluster** so we will keep that.
* **Ticket**  TicketType and TicketCat show up, but its apparent that having a TicketType is really the driver of these.  So we'll keep the **TicketCat**, where TicketType=N means no TicketType.
* **Age**  A couple of the AgeGroup levels show up, so we'll keep **AgeGroup**.
* **Fare**  The **FareCluster** shows up so we'll keep that.

Now let's label encode the features we have chosen but we will do this so that they have a more "ordinal" encoding.  To do this we'll sort the feature levels based on the survival rate of each level and encode them in that order.  For example, if we were to apply the sklearn LabelEncoder to the Sex feature, 'female' would be encoded as 0 while "male" is encoded with a 1.  This is because the sklearn LabelEncoder sorts the feature values and encodes them in (in this case) alphatbetical order.  But females have a much higher survival rate than males and we want to have the lowest survival rate encoded with 0 and progress up from there.  So female should be encoded with 1 while male is encoded with 0.

For tree-based models this sorted encoding is probably not necessary since the relative sizes of the encoded value matter less to the tree-splitting.  But for other model types the order & size of the encoded values do matter.

In [None]:
best_features = ['Sex','Pclass','Embarked','Deck','FamilySize','FirstNameCluster','TicketCat','AgeGroup','FareCluster']
mean_y = DATA['Survived'].mean()

for f in best_features:
    means = DATA.groupby(f, as_index=False)['Survived'].mean()
    means.fillna(mean_y, inplace=True)
    means[f+'_L'] = means['Survived'].rank(method='dense')
    DATA = DATA.merge(means[[f, f+'_L']], on=f, how='left')

label_features = [f+'_L' for f in best_features]

## **3.3 Target-Encoded Features**
For target encoding you replace the values of the categorical features with the mean of the target feature for each categorical value.  For example, the mean of the target feature (Survived) is 0.2058 for all males (i.e. Sex==male), so we would replace "male" with 0.2058 in the "Sex" feature.  We need to be careful when doing this as it can easily lead to overfitting especially for rare categorical values.  We're going to do one thing that could increase overfitting, but then also doa few things to mitigate any overfitting.

Increasing the overfitting risk, we'll segregate the male from the females since their survival rates differ so much.  This essentially doubles the number of categorical values we need to replace with the target mean.

To be conservative against overfitting we'll use the same sigmoid function and weighted average method we used in section 3.1 for identifying binary features.  In short we'll use the adjusted survival estimates as the numeric value for each categorical value.  And we'll use an out-of-fold method to do the encoding.

For these features we'll use the LNFare feature instead of the FareCluster because after target-encoding we'll have data that looks more continuous than categorical, and the Fare feature is the only real continuous variable in the data.


In [None]:
best_features.remove('FareCluster')
best_features.remove('Sex')
train = DATA[DATA['Survived'].notnull()].copy()
train.reset_index(drop=True, inplace=True)
test  = DATA[DATA['Survived'].isnull()].copy()
test.reset_index(drop=True, inplace=True)
mean_male = train.loc[train['Sex']=='male','Survived'].mean()
mean_female = train.loc[train['Sex']=='female','Survived'].mean()
priors = {'male':mean_male, 'female':mean_female}

for f in best_features:
    # Encode the test data based on all of training data
    means = train.groupby(['Sex',f])['Survived'].agg(['mean','count'])
    means.reset_index(drop=False, inplace=True)
    means['weight'] = 1.0/(1 + np.exp( (1250-n)/250 ))
    means['prior_mean'] = means['Sex'].map(priors)
    means[f+'_T'] = means['weight']*means['mean'] + (1-means['weight'])*means['prior_mean']
    test = test.merge(means[['Sex',f,f+'_T']], on=['Sex',f], how='left')
    
    # Encode the training data
    encoded = pd.DataFrame()
    skf = StratifiedKFold(n_splits=10)
    for train_idx, valid_idx in skf.split(train, train['Sex_female']):
        trn = train.loc[train_idx,['Sex',f,'Survived']]
        val = train.loc[valid_idx,['PassengerId','Sex',f]]
        
        means = trn.groupby(['Sex',f])['Survived'].agg(['mean','count'])
        means.reset_index(drop=False, inplace=True)
        means['weight'] = 1.0/(1 + np.exp( (1250-n)/250 ))
        means['prior_mean'] = means['Sex'].map(priors)
        means[f+'_T'] = means['weight']*means['mean'] + (1-means['weight'])*means['prior_mean']
        val = val.merge(means[['Sex', f, f+'_T']], on=['Sex',f], how='left')
        encoded = encoded.append(val, ignore_index=True, sort=False)
    
    encoded = encoded.append(test[['PassengerId','Sex',f,f+'_T']], ignore_index=True, sort=False)
    DATA = DATA.merge(encoded[['PassengerId',f+'_T']], on=['PassengerId'], how='left')
    DATA.loc[DATA['Sex']=='male', f+'_T'].fillna(mean_male)
    DATA.loc[DATA['Sex']=='female', f+'_T'].fillna(mean_female)

del(means,train, test, encoded)
gc.collect()

tgtenc_features = [f+'_T' for f in best_features]+['LNFare']


# **4.0 MODEL FUNCTIONS**
In thi section we're going to create a couple of models and build them as functions that we can call with different data and feature sets.

In [None]:
import lightgbm as lgb
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

#------------------------------------------------------------------------------
def add_folds(trn, numfolds):
    temp0 = trn[(trn['Survived']==0)&(trn['Sex']=='female')]
    temp0['fold'] = np.random.randint(numfolds, size=temp0.shape[0])
    temp1 = trn[(trn['Survived']==0)&(trn['Sex']=='male')]
    temp1['fold'] = np.random.randint(numfolds, size=temp1.shape[0])
    temp2 = trn[(trn['Survived']==1)&(trn['Sex']=='female')]
    temp2['fold'] = np.random.randint(numfolds, size=temp2.shape[0])
    temp3 = trn[(trn['Survived']==1)&(trn['Sex']=='male')]
    temp3['fold'] = np.random.randint(numfolds, size=temp3.shape[0])
    trn = temp0.append(temp1, ignore_index=True)
    trn = trn.append(temp2, ignore_index=True)
    trn = trn.append(temp3, ignore_index=True)
    trn.sort_values(['PassengerId'], inplace=True)
    trn.reset_index(drop=True, inplace=True)
    return trn

#------------------------------------------------------------------------------
def lgbm_model(train, test, features, param_dict, subpreds):
    MAX_ROUNDS = 10000
    STOP_ROUNDS = 50
    VERBOSE_EVAL = 500
    
    mean_score = 0.0
    test_preds = np.zeros(test.shape[0])
    oof_preds  = np.zeros(train.shape[0])
    import_scores = np.zeros(len(features))
    for fold in range(5):
        trn_X = np.array(train.loc[train['fold']!=fold, features])
        trn_y = np.array(train.loc[train['fold']!=fold, 'Survived'])
        val_X = np.array(train.loc[train['fold']==fold, features])
        val_y = np.array(train.loc[train['fold']==fold, 'Survived'])
        val_idx = train[train['fold']==fold].index.tolist()
        
        model = lgb.LGBMClassifier(**param_dict, n_estimators=MAX_ROUNDS, n_jobs=-1)
        model.fit(trn_X, trn_y, eval_set=(val_X, val_y), verbose=VERBOSE_EVAL, early_stopping_rounds=STOP_ROUNDS)
        
        val_preds = model.predict_proba(val_X)[:,1]
        oof_preds[val_idx] = val_preds
        score = roc_auc_score(val_y, val_preds)
        acc = accuracy_score(val_y, np.round(val_preds))
        mean_score += acc/5
        import_scores += model.feature_importances_/5
        print('Fold AUC Score: ', score)
        print('Fold ACC Score: ', acc)
        print('--------------------------------')
        
        if subpreds:
            test_preds += model.predict_proba(np.array(test[features]))[:,1]/5
    
    print('Accuracy CV Score: ', mean_score)
    imps = pd.DataFrame({'Feature':features, 'Importance':import_scores})
    imps.sort_values(['Importance'], ascending=True, inplace=True)
    return oof_preds, test_preds, mean_score, imps

#------------------------------------------------------------------------------
def lgbm_imp_chart(imp_df, title):
    fig = plt.figure(figsize=(16, 12), facecolor='white')
    labels = imp_df['Feature'].tolist()
    widths = imp_df['Importance'].tolist()
    plt.barh(labels, widths, height=0.5, color='navy', edgecolor='grey')
    plt.xlabel('Importance', fontsize=10, weight='bold')
    plt.title(title, fontsize=12, weight='bold')
    plt.ylabel('Feature', fontsize=10, weight='bold')
    plt.tight_layout()
    plt.show()
    return

#------------------------------------------------------------------------------
def sklearn_model(train, test, features, model, subpreds):
    mean_score = 0.0
    test_preds = np.zeros(test.shape[0])
    oof_preds  = np.zeros(train.shape[0])
    for fold in range(5):
        trn_X = np.array(train.loc[train['fold']!=fold, features])
        trn_y = np.array(train.loc[train['fold']!=fold, 'Survived'])
        val_X = np.array(train.loc[train['fold']==fold, features])
        val_y = np.array(train.loc[train['fold']==fold, 'Survived'])
        val_idx = train[train['fold']==fold].index.tolist()
    
        model.fit(trn_X, trn_y)
    
        val_preds = model.predict_proba(val_X)[:,1]
        oof_preds[val_idx] = val_preds
        score = roc_auc_score(val_y, val_preds)
        acc = accuracy_score(val_y, np.round(val_preds))
        mean_score += acc/5
        print('Fold AUC Score: ', score)
        print('Fold ACC Score: ', acc)
        print('--------------------------------')
        
        if subpreds:
            test_preds += model.predict_proba(np.array(test[features]))[:,1]/5
    
    print('Accuracy CV Score: ', mean_score)
    return oof_preds, test_preds, mean_score

#------------------------------------------------------------------------------


# **4.0 FEATURE SELECTION**
So now we have a whole bunch of features that we think will be useful for modeling.  But let's check them and see how they do.  We'll do checks here, the first is a RANDOM feature trap using feature importances and the second is a correlation trap to remove highly correlated features.


## **4.1 RANDOM Feature Trap**
The idea here is to identify the features that don't really add anything to the model becuase they are no more useful than a column of random numbers.  What we'll do is add a column of random numbers and then run a LightGBM model and look at the feature importances.  Any feature that is less important than the RANDOM column is suspect.



In [None]:
TRAIN = DATA.loc[DATA['Survived'].notnull()].copy()
TRAIN.reset_index(drop=True, inplace=True)
TEST  = DATA.loc[DATA['Survived'].isnull()].copy()
TEST.reset_index(drop=True, inplace=True)
TRAIN = add_folds(TRAIN, 5)
#------------------------------------------------------------------------------
params = {}
params['boosting_type']    = 'gbdt'
params['objective']        = 'binary'
params['metric']           = 'auc'
params['num_leaves']       = 51
params['learning_rate']    = 0.01
params['colsample_bytree'] = 0.8
params['subsample']        = 0.9
params['max_depth']        = 21
params['subsample_freq']   = 1
params['bagging_seed']     = 351
params['verbosity']        = -1

TRAIN['RANDOM'] = np.random.randint(2, size=TRAIN.shape[0])
bin_features += ['RANDOM']
oof, tst, score, importancesB = lgbm_model(TRAIN, TEST, bin_features, params, False)
lgbm_imp_chart(importancesB, 'Binary Feature Importances')
bin_features.remove('RANDOM')
importancesB = importancesB[importancesB['Feature']!='RANDOM']

params = {}
params['boosting_type']    = 'gbdt'
params['objective']        = 'binary'
params['metric']           = 'auc'
params['num_leaves']       = 31
params['learning_rate']    = 0.01
params['colsample_bytree'] = 0.8
params['subsample']        = 0.9
#params['max_depth']        = 5
params['subsample_freq']   = 1
params['bagging_seed']     = 351
params['verbosity']        = -1

TRAIN['RANDOM'] = np.random.randint(10, size=TRAIN.shape[0])
label_features += ['RANDOM']
oof, tst, score, importancesL = lgbm_model(TRAIN, TEST, label_features, params, False)
lgbm_imp_chart(importancesL, 'Label Encoded Feature Importances')
label_features.remove('RANDOM')
importancesL = importancesL[importancesL['Feature']!='RANDOM']

TRAIN['RANDOM'] = np.random.uniform(size=TRAIN.shape[0])
tgtenc_features += ['RANDOM']
oof, tst, score, importancesT = lgbm_model(TRAIN, TEST, tgtenc_features, params, False)
lgbm_imp_chart(importancesT, 'Target Encoded Feature Importances')
tgtenc_features.remove('RANDOM')
importancesT = importancesT[importancesT['Feature']!='RANDOM']

Wow.  The RANDOM feature finishes pretty high!!  That means there is a lot of randomness in the data and we should consider simpler models.  Simpler models generalize better - that is, they may not have as high a CV Score (or Public LB score) but they are more likely to have consistent performance when applied to more unseen data (i.e., the Private Leaderboard).  Another way of saying this is that more complex models may identify patterns that only exist in the randomness of the training dataset.

So we need to look at simplifying a bit.  We'll do this two ways:
* Whittle down the number of features by removing some that are redundant by examining the similarity of the features
* Take a look at less powerful models (LightGBM is pretty powerful)

Normally we could remove any features that are less important than the RANDOM feature and, with some tuning, the LGBM model performance could improve.  But for what we're doing here we'll leave all the features in place (since most finish with less improtance than RANDOM) and let the correlation trap whittle down the number of features.

Also, note that the feature importances really only apply to tree-based models.  Other model types may stress some of the features that have low feature importances here for the LightGBM model.

## **4.2 Feature Similarity/Correlation**
What we are looking for here are different features that provide the model the same information.  If we have two features that are the "same" we don't need both and including both can make the model look better than it is.

For the binary features we'll look at similarity, not correlation, because correlation doesn't tell us much when the features are binary.  We'll use the accuracy_score function to measure the similarity between features.  All this does is tell us what percentage of the two columns contain the same values.  While we want to see what features pairs are the "same" (accuracy_Score==1.0 or the two features are identical) we also want to check which features are mirror-images of each other (accuracy_score==0.0 or the two functions are the inverse of each other).  Binary features that are mirror-images of each other provide the same data to a model making one of them redundant.  For example, Sex==male is the same as Sex!=female, so we don't need both columns.

For the label-encoded data and the target-encoded data we'll stick with regular correlations.

In [None]:
F = len(bin_features)
sim_matrix = np.zeros((F,F))
for i in range(F):
    for j in range(F):
        sim_matrix[i,j] = accuracy_score(DATA[bin_features[i]], DATA[bin_features[j]])

fig = plt.figure(figsize=(16, 16), facecolor='white')
fig, ax = plt.subplots(figsize=(16,16))
cax = ax.matshow(sim_matrix)
ax.grid(True)
plt.title('Binary Feature Similarity')
plt.xticks(range(F), bin_features, rotation=90);
plt.yticks(range(F), bin_features);
fig.colorbar(cax, ticks=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
plt.show()

# -------------------------------------------
F = len(label_features)
corL_matrix = DATA[label_features].corr()

fig = plt.figure(figsize=(16, 16), facecolor='white')
fig, ax = plt.subplots(figsize=(16,16))
cax = ax.matshow(corL_matrix)
ax.grid(True)
plt.title('Label Encoded Feature Correlations')
plt.xticks(range(F), label_features, rotation=90);
plt.yticks(range(F), label_features);
fig.colorbar(cax)
plt.show()

# -------------------------------------------
F = len(tgtenc_features)
corT_matrix = DATA[tgtenc_features].corr()

fig = plt.figure(figsize=(16, 16), facecolor='white')
fig, ax = plt.subplots(figsize=(16,16))
cax = ax.matshow(corT_matrix)
ax.grid(True)
plt.title('Target Encoded Feature Correlations')
plt.xticks(range(F), tgtenc_features, rotation=90);
plt.yticks(range(F), tgtenc_features);
fig.colorbar(cax)
plt.show()

It looks like we have some highly-similar features in the binary data.  We'll fudge a bit here and say that any two binary features that have more than 95% in common or less than 5% in common are "similar."  Then we'll only keep the feature that has the higher importance in our LightGBM model.

Now, it's not just pairs of features we need to consider because there may be groups of features that are mutually similar.  In order to see these groups we're going to use a graph.  A [graph](http://en.wikipedia.org/wiki/Graph_theory#:~:text=In%20mathematics%2C%20graph%20theory%20is,also%20called%20links%20or%20lines) is a set of nodes (or vertices), connected by edges.  We'll say each feature is a node and two features are connected by an edge if they are "similar", or in the case of label-encoded and target-encoded data, correlated.

We'll then look at "Connected Components" of the graph.  These are the sets of mutually connected nodes/features and from each of these connected components we will keep the features with the highest importance to the LightGBM model.

In [None]:
def feature_similarity(adj_matrix, imp_df):
    features = imp_df['Feature'].tolist()
    G = nx.from_numpy_matrix(np.matrix(adj_matrix))  #  Create a graph from the similarity adjacency matrix
    components = sorted(nx.connected_components(G), key = len, reverse=True)  # separate out the connected components
    print('Components: ',len(components))
    i=0
    out=[]
    while i<len(components):
        comp = list(components[i])
        for n in comp:
            if len(comp)>1:
                f = features[n]
                out = out + [(f,i)]
        i=i+1
    C = pd.DataFrame(out,columns=['Feature','SimComponent'])
    imp_df = imp_df.merge(C, on='Feature', how='left')

    uncorr = imp_df[imp_df['SimComponent'].isnull()].copy()
    corrgp = imp_df[imp_df['SimComponent'].notnull()].copy()
    corrgp.sort_values(['SimComponent','Importance'], ascending=[True,False], inplace=True)
    corrgp.reset_index(drop=True, inplace=True)
    corrgp.drop_duplicates('SimComponent', keep='first', inplace=True)

    features = uncorr['Feature'].tolist() + corrgp['Feature'].tolist()
    print('Best Features:')
    print('------------------------')
    print('Features that are not similar to any of the other features:')
    print('-------------')
    print(uncorr['Feature'].tolist())
    print('-------------')
    print('Features with the higest LGBM importance out of their similarity component:')
    print('-------------')
    print(corrgp['Feature'].tolist())
    features = [f for f in features if f!='RANDOM']
    return features
print('----------------------------------------------')
print('BINARY FEATURES')
print('----------------------------------------------')
Bmatrix = np.where((sim_matrix>0.95)|(sim_matrix<0.05), 1, 0)
bin_features = feature_similarity(Bmatrix, importancesB)
print('----------------------------------------------')
print('LABEL-ENCODED FEATURES')
print('----------------------------------------------')
Lmatrix = np.where(np.absolute(corL_matrix)>0.95, 1, 0)
label_features = feature_similarity(Lmatrix, importancesL)
print('----------------------------------------------')
print('TARGET-ENCODED FEATURES')
print('----------------------------------------------')
Tmatrix = np.where(np.absolute(corT_matrix)>0.95, 1, 0)
tgtenc_features = feature_similarity(Tmatrix, importancesT)

# **5.0 FINAL MODELS**

OK, we've gone through some modeling and some feature selection and arrived at a robust set of "important" features.  Let's use them in a couple of different types of models and then do some ensembling (also called blending).  In order to explore the ensembling opportunities later, we'll keep track of the out-of-fold predictions and the test predictions.  We'll make all of predictions as probabilities vs binary outputs and save them in dataframes OOF_PRED and TESTPRED.

In [None]:
OOF_PRED = TRAIN[['PassengerId','fold','Survived']].copy()
TESTPRED = pd.DataFrame({'PassengerId':TEST['PassengerId'],'LGBM':np.zeros(TEST.shape[0])})
CV_Scores = {}

## **5.1 Binary Data Models**

In [None]:
params = {}
params['boosting_type']    = 'gbdt'
params['objective']        = 'binary'
params['metric']           = 'auc'
params['num_leaves']       = 51
params['learning_rate']    = 0.01
params['colsample_bytree'] = 0.8
params['subsample']        = 0.9
params['max_depth']        = 21
params['subsample_freq']   = 1
params['bagging_seed']     = 351
params['verbosity']        = -1
print('--------------------------------------------------')
print('LightGBM Model')
print('--------------------------------------------------')
oof, tst, score, importances = lgbm_model(TRAIN, TEST, bin_features, params, True)
OOF_PRED['LGBM_B'] = oof
TESTPRED['LGBM_B'] = tst
CV_Scores['LGBM_B'] = score
print('--------------------------------------------------')
print('Multi-Layer Perceptron Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, bin_features, MLPClassifier(), True)
OOF_PRED['MLP_B'] = oof
TESTPRED['MLP_B'] = tst
CV_Scores['MLP_B'] = score
print('--------------------------------------------------')
print('Logistic Regression Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, bin_features, LogisticRegression(), True)
OOF_PRED['LOG_B'] = oof
TESTPRED['LOG_B'] = tst
CV_Scores['LOG_B'] = score
print('--------------------------------------------------')
print('KNeighbors Classifier Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, bin_features, KNeighborsClassifier(n_neighbors=101, weights='distance'), True)
OOF_PRED['KNN_B'] = oof
TESTPRED['KNN_B'] = tst
CV_Scores['KNN_B'] = score
print('--------------------------------------------------')
print('Quadratic Discriminant Analysis Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, bin_features, QuadraticDiscriminantAnalysis(reg_param=0.2), True)
OOF_PRED['QDA_B'] = oof
TESTPRED['QDA_B'] = tst
CV_Scores['QDA_B'] = score


## **5.2 Label-Encoded Data Models**

In [None]:
params = {}
params['boosting_type']    = 'gbdt'
params['objective']        = 'binary'
params['metric']           = 'auc'
params['num_leaves']       = 31
params['learning_rate']    = 0.01
params['colsample_bytree'] = 0.8
params['subsample']        = 0.9
#params['max_depth']        = 5
params['subsample_freq']   = 1
params['bagging_seed']     = 351
params['verbosity']        = -1

print('--------------------------------------------------')
print('LightGBM Model')
print('--------------------------------------------------')
oof, tst, score, importances = lgbm_model(TRAIN, TEST, label_features, params, True)
OOF_PRED['LGBM_L'] = oof
TESTPRED['LGBM_L'] = tst
CV_Scores['LGBM_L'] = score
print('--------------------------------------------------')
print('Multi-Layer Perceptron Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, label_features, MLPClassifier(), True)
OOF_PRED['MLP_L'] = oof
TESTPRED['MLP_L'] = tst
CV_Scores['MLP_L'] = score
print('--------------------------------------------------')
print('Logistic Regression Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, label_features, LogisticRegression(), True)
OOF_PRED['LOG_L'] = oof
TESTPRED['LOG_L'] = tst
CV_Scores['LOG_L'] = score
print('--------------------------------------------------')
print('KNeighbors Classifier Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, label_features, KNeighborsClassifier(n_neighbors=101, weights='distance'), True)
OOF_PRED['KNN_L'] = oof
TESTPRED['KNN_L'] = tst
CV_Scores['KNN_L'] = score
print('--------------------------------------------------')
print('Quadratic Discriminant Analysis Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, label_features, QuadraticDiscriminantAnalysis(reg_param=0.2), True)
OOF_PRED['QDA_L'] = oof
TESTPRED['QDA_L'] = tst
CV_Scores['QDA_L'] = score

## **5.3 Target-Encoded Data Models**

In [None]:
print('--------------------------------------------------')
print('LightGBM Model')
print('--------------------------------------------------')
oof, tst, score, importances = lgbm_model(TRAIN, TEST, tgtenc_features, params, True)
OOF_PRED['LGBM_T'] = oof
TESTPRED['LGBM_T'] = tst
CV_Scores['LGBM_T'] = score
print('--------------------------------------------------')
print('Multi-Layer Perceptron Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, tgtenc_features, MLPClassifier(), True)
OOF_PRED['MLP_T'] = oof
TESTPRED['MLP_T'] = tst
CV_Scores['MLP_T'] = score
print('--------------------------------------------------')
print('Logistic Regression Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, tgtenc_features, LogisticRegression(), True)
OOF_PRED['LOG_T'] = oof
TESTPRED['LOG_T'] = tst
CV_Scores['LOG_T'] = score
print('--------------------------------------------------')
print('KNeighbors Classifier Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, tgtenc_features, KNeighborsClassifier(n_neighbors=101, weights='distance'), True)
OOF_PRED['KNN_T'] = oof
TESTPRED['KNN_T'] = tst
CV_Scores['KNN_T'] = score
print('--------------------------------------------------')
print('Quadratic Discriminant Analysis Model')
print('--------------------------------------------------')
oof, tst, score = sklearn_model(TRAIN, TEST, tgtenc_features, QuadraticDiscriminantAnalysis(reg_param=0.2), True)
OOF_PRED['QDA_T'] = oof
TESTPRED['QDA_T'] = tst
CV_Scores['QDA_T'] = score

# **6.0 ENSEMBLING**

## **6.1 Model Prediction Correlations**
We want to look at the correlations of the different model predictions to see if they are highly correlated.  If two models are essentially predicting the same survival probabilities we don't need to include both in the ensemble (also called a blend of models).  The cutoff for what is "highly" correlated is a bit subjective.  You need to consider the CV scores and the model types when deciding what to keep.  A good blned of models includes models of different types (tree-based, neural netwroks, nearest neighbors, etc.).  We'll keep all 5 here since they are all of different types (even though a couple have some high correlations).

In [None]:
models = ['LGBM','MLP','LOG','KNN','QDA']
model_list = [f+'_B' for f in models]+[f+'_L' for f in models]+[f+'_T' for f in models]
F = len(model_list)
oof_corr_matrix = OOF_PRED[model_list].corr()
test_corr_matrix = TESTPRED[model_list].corr()
# -------------------------------------------
fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)

# -------------------------------------------
ax0 = fig.add_subplot(gs[0, 0])
cax = ax0.matshow(oof_corr_matrix)
ax0.grid(True)
ax0.set_title('OOF Prediction Correlations', fontsize=12, weight='bold')
ax0.set_xticks(range(F))
ax0.set_xticklabels(model_list, fontsize=10, weight='bold')
ax0.set_yticks(range(F))
ax0.set_yticklabels(model_list, fontsize=10, weight='bold')

# -------------------------------------------
ax1 = fig.add_subplot(gs[0, 1])
cax = ax1.matshow(test_corr_matrix)
ax1.grid(True)
ax1.set_title('Test Prediction Correlations', fontsize=12, weight='bold')
ax1.set_xticks(range(F))
ax1.set_xticklabels(model_list, fontsize=10, weight='bold')
ax1.set_yticks(range(F))
ax1.set_yticklabels(model_list, fontsize=10, weight='bold')

# -------------------------------------------
fig.colorbar(cax, ticks=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
plt.tight_layout()
plt.show()
# -------------------------------------------
print('Correlations --------')
print(oof_corr_matrix)
print('CV Scores -----------')
print(CV_Scores)

## **6.2 Linear Optimization to Weight the Models**

To blend the models we'll use a weighted average.  To determine the weights we'll use a linear optimization problem to determine the best weights for each model.

In [None]:
from scipy.optimize import minimize

def RMSE(x,a,s):
    return np.sqrt(np.mean((np.matmul(a,x)-s)**2))

oof = np.array(OOF_PRED[model_list].copy())
S = np.array(OOF_PRED['Survived'])
result = minimize(RMSE, x0=np.zeros(len(model_list)), args=(oof,S))        
out = pd.DataFrame({'Model':model_list, 'Weight':result.x})
out['Weight'] = out['Weight']/out['Weight'].sum()
print(out)

## **6.3 Ensemble & Output**

In [None]:
OOF_PRED['EnsPred'] = np.matmul(oof, np.array(out['Weight']))
score = accuracy_score(OOF_PRED['Survived'], np.round(OOF_PRED['EnsPred']))
print('Ensemble Accuracy Score: ', score)

tst = np.array(TESTPRED[model_list])
TESTPRED['Survived'] = np.round(np.matmul(tst, np.array(out['Weight']))).astype(int)
SUB = TESTPRED[['PassengerId','Survived']]
SUB.to_csv('EnsembleSubmission.csv', index=False)

preds = np.round(oof).astype(int)
OOF_PRED['VotePred'] = np.round(preds.mean(axis=1)).astype(int)
score = accuracy_score(OOF_PRED['Survived'], OOF_PRED['VotePred'])
print('Voting Accuracy Score: ', score)


## **6.4 How good are our predictions?**
Let's take a quick look at the confusion matrix for our predictions:

In [None]:
outcomes = ['Died','Survived']
F = len(outcomes)
confusion = confusion_matrix(OOF_PRED['Survived'], np.round(OOF_PRED['EnsPred']))
fig = plt.figure(figsize=(16, 8), facecolor='white')
gs = fig.add_gridspec(1, 2)

# -------------------------------------------
ax0 = fig.add_subplot(gs[0, 0])
cax = ax0.matshow(confusion)
ax0.set_title('Ensemble Prediction Confusion Matrix', fontsize=12, weight='bold')
ax0.set_xlabel('Actual Outcomes', fontsize=10, weight='bold')
ax0.set_xticks(range(F))
ax0.set_xticklabels(outcomes, fontsize=10, weight='bold')
ax0.set_ylabel('Model Predictions', fontsize=10, weight='bold')
ax0.set_yticks(range(F))
ax0.set_yticklabels(outcomes, fontsize=10, weight='bold')
ax0.annotate(str(confusion[0,0]),(0,0), fontsize=10, weight='bold')
ax0.annotate(str(confusion[0,1]),(0,1), fontsize=10, weight='bold')
ax0.annotate(str(confusion[1,0]),(1,0), fontsize=10, weight='bold')
ax0.annotate(str(confusion[1,1]),(1,1), fontsize=10, weight='bold')
#fig.colorbar(cax)
plt.tight_layout()
plt.show()

We noticed some imbalances between the train and test datasets earlier which we'll want to look at and see what impacts they could have.  So let's take a look at how our predictions score on certain slices of the data.  In particular we noticed that:

**Sex**
* 56% of the train data set has Sex==male, however, 70% of the test data set is male.

**Pclass**
* 41% of the train data set has Pclass==3, however, 64% of the test data set falls in 3rd class.
* 29% of the train data set has Pclass==2, however, only 9% of the test data set falls in 2nd class.

**Age**
* 41% of the train data set has AgeGroup==AgeGrp3, however, 65% of the test data set falls in AgeGrp3.
* 45% of the train data set has AgeGroup==AgeGrp4, however, only 24% of the test data set falls in AgeGrp4.

**FareCluster**
* 42% of the train data set has FareCluster==1, however, 55% of the test data set falls in Cluster1.
* 36% of the train data set has FareCluster==2, however, 21% of the test data set falls in Cluster2.
*  8% of the train data set has FareCluster==4, however, 12% of the test data set falls in Cluster4.

**Family Sizes**
* 62% of the train data set has FamilySize==1, however, only 55% of the test data set has FamilySize==1.
* 73% of the train data set has SibSp==0, however, only 62% of the test data set has SibSp==0.
* 20% of the train data set has SibSp==1, however, 31% of the test data set has SibSp==1.


In [None]:
TRAIN['EnsPred'] = np.round(OOF_PRED['EnsPred'])
for F in ['Sex','Pclass','AgeGroup','FareCluster','FamilySize']:
    vals = sorted(TRAIN[F].unique().tolist())
    for v in vals:
        temp = TRAIN[TRAIN[F]==v]
        acc = accuracy_score(temp['Survived'], temp['EnsPred'])
        print(F+' == '+str(v)+' Accuracy:', np.round(acc,3))

# **7.0 Next Steps**
So we built a whole bunch of stuff here and introduced a lot of tricks and techniques.  To improve more I'd suggest exploring:
1. Feature Engineering.  There is almost no end to the possibilities of feature engineering.  For example, we did not explore PCA or ICA methods of dimensionality reduction.
2. Model Tuning.  Almost all of the models used here are not "tuned" to an optimal performance.  Tuning the various parameters of each of them could improve scores further.
3. Ensembling.  Ensembling can be done in a variety of ways and we only use one here.  Other methods could use Logistic Regression, neural networks or various types of means (like a geometric mean).  Also, you may note that we used the same fold-structure in our Cross-Validation scheme for all of the models.  This can be changed up, but it also provides a way to see if different models perform differently on different folds, which can give some additional insight into the data.
