## Titanic: Machine Learning from Disaster

In this example, we want to create a predictive model that will classify whether an individual survives the Titanic based on individual and trip characteristics. This is a toy example, but it serves as a pedagogical tool for showing many steps of modeling. Our final model should be able to take passenger information and predict whether that passenger would survive on the Titanic or not.


What type of ML problem is this?

In [5]:
# Imports
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings("ignore")
import pandas as pd
from sklearn import (
    ensemble,
    model_selection,    
    preprocessing,
    tree,
)
from sklearn.metrics import (
    auc,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
)
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
)

In [6]:
df = pd.read_excel("titanic.xls")
orig_df = df
df.head()

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

#### Columns description

- pclass - Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
- survived - Survival (0 = No, 1 = Yes)
- name - Name
- sex - Sex
- age - Age
- sibsp - Number of siblings/spouses aboard
- parch - Number of parents/children aboard
- ticket - Ticket number
- fare - Passenger fare
- cabin - Cabin
- embarked - Point of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
- boat - Lifeboat
- body - Body identification number

## Data Cleaning

Once we have the data, we need to ensure that it is in a format that we can use to create a model.

Most scikit-learn models require that the features be numeric (integer or float). Some models perform better if the data is standardized (given a mean value of 0 and a standard deviation of 1).

In [None]:
df.dtypes

Use the .shape attribute of the DataFrame to inspect the number of rows and columns:

In [None]:
df.shape

Use the .describe method to get summary stats as well as see the count of nonnull data.

In [None]:
df.describe()

### Outlier detection

Box plot

Below is an example of outlier detection in "fare" using box plot. 

In [None]:
import seaborn as sns

sns.boxplot(x=df['fare'])

Example of a multi variate outliers detection using Scatter plot

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df['age'], df['sibsp'])
ax.set_xlabel('Age')
ax.set_ylabel('Number of siblings/spouses aboard')
plt.show()

Use the .isnull method to find columns or rows with missing values.

In [None]:
df.isnull().sum()

Inspect rows with missing data

In [None]:
df.isnull().sum(axis=1).loc[:10]

Use the .value_counts method to examine the counts of the values

Pandas typically ignores null or NaN values. If we want to include those, use dropna=False to also show counts for NaN

In [None]:
df.sex.value_counts(dropna=False)

In [None]:
df.embarked.value_counts(dropna=False)

Unless we are using NLP or extracting data out of text columns where every value is different, a model will not be able to take advantage of this column. The name column is an example of this.

In [None]:
name = df.name
name.head(3)

We also want to drop columns that leak information. Both boat and body columns leak whether a passenger survived.

In [None]:
df = df.drop(
    columns=[
        "name",
        "ticket",
        "home.dest",
        "boat",
        "body",
        "cabin",
    ]
)

We need to create dummy columns from string columns. This will create new columns for sex and embarked.

In [None]:
df = pd.get_dummies(df)

In [None]:
df.columns

In [None]:
df.head()

At this point the sex_male and sex_female columns are perfectly inverse correlated. We could either drop the column sex_male or rerun get_dummies with the option drop_first equal to true.

In [None]:
df = orig_df

In [None]:
df = df.drop(
    columns=[
        "name",
        "ticket",
        "home.dest",
        "boat",
        "body",
        "cabin",
    ]
)

In [None]:
df = pd.get_dummies(df, drop_first=True)

In [None]:
df.columns

In [None]:
df.head()

Create a DataFrame (X) with the features and a series (y) with the labels.


In [7]:
X = df.drop(columns="survived")
y = df.survived

NameError: name 'df' is not defined

### Split train/test set

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.3, random_state=42)

In [None]:
X.columns

### Missing Data

There are various ways to handle missing data:
    
- Remove any row with missing data
- Remove any column with missing data
- Impute missing values
- Create an indicator column to signify data was missing


Percentage of missing values

In [None]:
df.isnull().mean() * 100

Example of imputing values with median

In [None]:
X_train.isnull().sum()

In [None]:
meds = X_train.median()
X_train = X_train.fillna(meds)
X_test = X_test.fillna(meds)

In [None]:
X_train.isnull().sum()

Example of imputing missing values using the **experimental** IterativeImputer class in Scikit-learn

https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.3, random_state=42)

In [None]:
# explicitly require this experimental feature
from sklearn.experimental import enable_iterative_imputer
# now we can import normally from sklearn.impute
from sklearn.impute import IterativeImputer

num_cols = [
    "pclass",
    "age",
    "sibsp",
    "parch",
    "fare",
    "sex_male",
]

In [None]:
imputer = IterativeImputer()
imputed = imputer.fit_transform(
    X_train[num_cols])
X_train.loc[:, num_cols] = imputed
imputed = imputer.transform(X_test[num_cols])
X_test.loc[:, num_cols] = imputed

In [None]:
X_train.isnull().sum()

### Data normalization & standardization



Normalizing or preprocessing data usually help many models perform better. 

Standardizing is translating the data so that it has a mean value of zero and a standard deviation of one.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
X_train.head()

In [None]:
cols = ['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_male',
   'embarked_Q', 'embarked_S']
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
# converting the results (numpy array) back into pd dataframe for easier manipulation later on
X_train = pd.DataFrame(X_train, columns=cols)
X_test = scaler.transform(X_test)
X_test = pd.DataFrame(X_test, columns=cols)

In [None]:
X_train.head()

### Code refactoring

In [None]:
def gen_dummies(df):
    df = df.drop(
        columns=[
            "name",
            "ticket",
            "home.dest",
            "boat",
            "body",
            "cabin",
        ]
    ).pipe(pd.get_dummies, drop_first=True)
    return df

In [None]:
def get_train_test(df, y_col, size=0.3, std_cols=None):
    y = df[y_col]
    X = df.drop(columns=y_col)
    X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, y, test_size=size, random_state=42
    )
    cols = X.columns
    num_cols = [
        "pclass",
        "age",
        "sibsp",
        "parch",
        "fare",
    ]
    impute = IterativeImputer()
    train_fit = impute.fit_transform(X_train[num_cols])
    X_train = X_train.assign(**{c:train_fit[:,i] for i, c in enumerate(num_cols)})
    test_fit = impute.transform(X_test[num_cols])
    X_test = X_test.assign(**{c:test_fit[:,i] for i, c in enumerate(num_cols)})
    if std_cols:
        std = preprocessing.StandardScaler()
        train_fit = std.fit_transform(X_train[std_cols])
        X_train = X_train.assign(**{c:train_fit[:,i] for i, c in enumerate(std_cols)})
        test_fit = std.transform(X_test[std_cols])
        X_test = X_test.assign(**{c:test_fit[:,i] for i, c in enumerate(std_cols)})

    return X_train, X_test, y_train, y_test

In [None]:
gen_dummies_df = gen_dummies(orig_df)
std_cols = "pclass,age,sibsp,fare".split(",")
X_train, X_test, y_train, y_test = get_train_test(
    gen_dummies_df, "survived", std_cols=std_cols
)

In [None]:
X_train.head()