## Predicting Survival on the Titanic

### History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Assignment:

Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.

Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.

In [1]:
import re

# to handle datasets
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler

# to build the models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import accuracy_score, roc_auc_score

# to persist the model and the scaler
import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

## Prepare the data set

In [2]:
# load the data - it is available open source and online

data = pd.read_csv('phpMYEkMl.csv')

# display data
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [3]:
# replace interrogation marks by NaN values

data = data.replace('?', np.nan)

In [4]:
# retain only the first cabin if more than
# 1 are available per passenger

def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
    
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [5]:
# extracts the title (Mr, Ms, etc) from the name variable

def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'
    
data['title'] = data['name'].apply(get_title)

In [6]:
# cast numerical variables as floats

data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')

In [7]:
# drop unnecessary variables

data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)

# display data
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked,title
0,1,1,female,29.0,0,0,211.3375,B5,S,Miss
1,1,1,male,0.9167,1,2,151.55,C22,S,Master
2,1,0,female,2.0,1,2,151.55,C22,S,Miss
3,1,0,male,30.0,1,2,151.55,C22,S,Mr
4,1,0,female,25.0,1,2,151.55,C22,S,Mrs


In [8]:
# save the data set

data.to_csv('titanic.csv', index=False)

## Data Exploration

### Find numerical and categorical variables

In [9]:
target = 'survived'

In [10]:
vars_num = [var for var in data.columns if data[var].dtypes != 'O']

vars_cat = [var for var in data.columns if data[var].dtypes == 'O']

print('Number of numerical variables: {}'.format(len(vars_num)))
print('Number of categorical variables: {}'.format(len(vars_cat)))

Number of numerical variables: 6
Number of categorical variables: 4


### Find missing values in variables

In [11]:
# first in numerical variables
vars_num_with_na = [var for var in vars_num if data[var].isnull().sum() > 0]

# determine percentage of missing values
data[vars_num_with_na].isnull().mean()


age     0.200917
fare    0.000764
dtype: float64

In [12]:
# now in categorical variables
vars_cat_with_na = [var for var in vars_cat if data[var].isnull().sum() > 0]

# determine percentage of missing values
data[vars_cat_with_na].isnull().mean()


cabin       0.774637
embarked    0.001528
dtype: float64

### Determine cardinality of categorical variables

In [13]:
data[vars_cat].nunique()

sex           2
cabin       181
embarked      3
title         5
dtype: int64

### Determine the distribution of numerical variables

In [14]:
data[vars_num].nunique()

pclass        3
survived      2
age          98
sibsp         7
parch         8
fare        281
dtype: int64

## Separate data into train and test

Use the code below for reproducibility. Don't change it.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1047, 9), (262, 9))

## Feature Engineering

### Extract only the letter (and drop the number) from the variable Cabin

In [17]:
import re

X_train['cabin'] =  X_train['cabin'].str.extract(r"([A-Za-z]+)")
X_train['cabin'].head
X_train.shape

(1047, 9)

### Fill in Missing data in numerical variables:

- Add a binary missing indicator
- Fill NA in original variable with the median

In [18]:
# replace engineer missing values as we described above

for var in vars_num_with_na:

    # calculate the mode using the train set
    mode_val = X_train[var].mode()[0]

    # add binary missing indicator (in train and test)
    X_train[var+'_na'] = np.where(X_train[var].isnull(), 1, 0)
    X_test[var+'_na'] = np.where(X_test[var].isnull(), 1, 0)

    # replace missing values by the mode
    # (in train and test)
    X_train[var] = X_train[var].fillna(mode_val)
    X_test[var] = X_test[var].fillna(mode_val)

# check that we have no more missing values in the engineered variables
X_train[vars_num_with_na].isnull().sum()
X_train.head

<bound method NDFrame.head of       pclass     sex      age  sibsp  parch      fare cabin embarked title  \
1118       3    male  25.0000      0      0    7.9250   NaN        S    Mr   
44         1  female  41.0000      0      0  134.5000     E        C  Miss   
1072       3    male  24.0000      0      0    7.7333   NaN        Q    Mr   
1130       3  female  18.0000      0      0    7.7750   NaN        S  Miss   
574        2    male  29.0000      1      0   21.0000   NaN        S    Mr   
...      ...     ...      ...    ...    ...       ...   ...      ...   ...   
763        3  female   0.1667      1      2   20.5750   NaN        S  Miss   
835        3    male  24.0000      0      0    8.0500   NaN        S    Mr   
1216       3  female  24.0000      0      0    7.7333   NaN        Q  Miss   
559        2  female  20.0000      0      0   36.7500   NaN        S  Miss   
684        3  female  32.0000      1      1   15.5000   NaN        Q   Mrs   

      age_na  fare_na  
1118     

In [19]:
# check the binary missing indicator variables

X_train[vars_num_with_na].head()


Unnamed: 0,age,fare
1118,25.0,7.925
44,41.0,134.5
1072,24.0,7.7333
1130,18.0,7.775
574,29.0,21.0


### Replace Missing data in categorical variables with the string **Missing**

In [20]:
# replace engineer missing values in categorical variable

for var in vars_cat_with_na:

    # calculate the mode using the train set
    mode_val = X_train[var].mode()[0]

    # add binary missing indicator (in train and test)
    X_train[var+'_na'] = np.where(X_train[var].isnull(), 1, 0)
    X_test[var+'_na'] = np.where(X_test[var].isnull(), 1, 0)

    # replace missing values by the mode
    # (in train and test)
    X_train[var] = X_train[var].fillna(mode_val)
    X_test[var] = X_test[var].fillna(mode_val)

# check that we have no more missing values in the engineered variables
X_train[vars_cat_with_na].isnull().sum()

cabin       0
embarked    0
dtype: int64

In [21]:
X_train['cabin']
X_train.shape

(1047, 13)

In [22]:
X_train['cabin'].head()
X_train_with_target = X_train.copy();
X_train_with_target[target] = y_train;
X_train.head

<bound method NDFrame.head of       pclass     sex      age  sibsp  parch      fare cabin embarked title  \
1118       3    male  25.0000      0      0    7.9250     C        S    Mr   
44         1  female  41.0000      0      0  134.5000     E        C  Miss   
1072       3    male  24.0000      0      0    7.7333     C        Q    Mr   
1130       3  female  18.0000      0      0    7.7750     C        S  Miss   
574        2    male  29.0000      1      0   21.0000     C        S    Mr   
...      ...     ...      ...    ...    ...       ...   ...      ...   ...   
763        3  female   0.1667      1      2   20.5750     C        S  Miss   
835        3    male  24.0000      0      0    8.0500     C        S    Mr   
1216       3  female  24.0000      0      0    7.7333     C        Q  Miss   
559        2  female  20.0000      0      0   36.7500     C        S  Miss   
684        3  female  32.0000      1      1   15.5000     C        Q   Mrs   

      age_na  fare_na  cabin_na  

### Remove rare labels in categorical variables

- remove labels present in less than 5 % of the passengers

In [23]:
def find_frequent_labels(df, var, rare_perc):
    
    # function finds the labels that are shared by more than
    # a certain % of the houses in the dataset

    df = df.copy()

    tmp = df.groupby(var)['survived'].count() / len(df)

    return tmp[tmp > rare_perc].index


for var in vars_cat:
    
    # find the frequent categories
    X_train_with_target = X_train.copy();
    X_train_with_target[target] = y_train;
    frequent_ls = find_frequent_labels(X_train_with_target, var, 0.05)
    
    # replace rare categories by the string "Rare"
    X_train[var] = np.where(X_train[var].isin(
        frequent_ls), X_train[var], 'Rare')
    
    X_test[var] = np.where(X_test[var].isin(
        frequent_ls), X_test[var], 'Rare')

In [24]:
rows, cols = np.where(X_train == 'Rare')


  result = method(y)


In [25]:
r = np.where(X_train['cabin']!='Rare')

X_train.iloc[r]


Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked,title,age_na,fare_na,cabin_na,embarked_na
1118,3,male,25.0000,0,0,7.9250,C,S,Mr,0,0,1,0
1072,3,male,24.0000,0,0,7.7333,C,Q,Mr,1,0,1,0
1130,3,female,18.0000,0,0,7.7750,C,S,Miss,0,0,1,0
574,2,male,29.0000,1,0,21.0000,C,S,Mr,0,0,1,0
500,2,male,46.0000,0,0,26.0000,C,S,Mr,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,3,female,0.1667,1,2,20.5750,C,S,Miss,0,0,1,0
835,3,male,24.0000,0,0,8.0500,C,S,Mr,1,0,1,0
1216,3,female,24.0000,0,0,7.7333,C,Q,Miss,1,0,1,0
559,2,female,20.0000,0,0,36.7500,C,S,Miss,0,0,1,0


### Perform one hot encoding of categorical variables into k-1 binary variables

- k-1, means that if the variable contains 9 different categories, we create 8 different binary variables
- Remember to drop the original categorical variable (the one with the strings) after the encoding

In [28]:
# this function will assign discrete values to the strings of the variables,
# so that the smaller value corresponds to the category that shows the smaller
# mean house sale price


def replace_categories(train, test, var, target):

    # order the categories in a variable from that with the lowest
    # house sale price, to that with the highest
    train_temp = train.copy()
    train_temp[target] = y_train;
    ordered_labels = train_temp.groupby([var])[target].mean().sort_values().index
    display(ordered_labels)

    # create a dictionary of ordered categories to integer values
    ordinal_label = {k: i for i, k in enumerate(ordered_labels, 0)}

    # use the dictionary to replace the categorical strings by integers
    train[var] = train[var].map(ordinal_label)
    test[var] = test[var].map(ordinal_label)
    

In [29]:
for var in vars_cat:
    replace_categories(X_train, X_test, var, 'survived')
    display(var)

Index(['male', 'female'], dtype='object', name='sex')

'sex'

Index(['C', 'Rare'], dtype='object', name='cabin')

'cabin'

Index(['S', 'Q', 'C'], dtype='object', name='embarked')

'embarked'

Index(['Mr', 'Rare', 'Miss', 'Mrs'], dtype='object', name='title')

'title'

In [32]:
X_train['embarked'].head()

1118    0
44      2
1072    1
1130    0
574     0
Name: embarked, dtype: int64

In [33]:
# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

[]

In [35]:
# check absence of na in the test set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]


[]

### Scale the variables

- Use the standard scaler from Scikit-learn

In [36]:
import warnings
warnings.simplefilter(action='ignore')

# feature scaling
from sklearn.preprocessing import MinMaxScaler

train_vars = X_train.columns

# create scaler
scaler = MinMaxScaler()

#  fit  the scaler to the train set
scaler.fit(X_train[train_vars]) 

# transform the train and test set
X_train[train_vars] = scaler.transform(X_train[train_vars])

X_test[train_vars] = scaler.transform(X_test[train_vars])



## Train the Logistic Regression model

- Set the regularization parameter to 0.0005
- Set the seed to 0

In [37]:
# to build the model
from sklearn.linear_model import Lasso
# set up the model
# remember to set the random_state / seed

lin_model = Lasso(alpha=0.0005, random_state=0)

# train the model

lin_model.fit(X_train, y_train)

Lasso(alpha=0.0005, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=0,
      selection='cyclic', tol=0.0001, warm_start=False)

## Make predictions and evaluate model performance

Determine:
- roc-auc
- accuracy

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

In [42]:
# to evaluate the models
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

# to persist the model and the scaler
import joblib

# make predictions for train set
pred_train = lin_model.predict(X_train)

# determine mse and rmse
print('train mse: {}'.format(int(
    mean_squared_error(y_train, pred_train))))
print('train rmse: {}'.format(int(
    sqrt(mean_squared_error(y_train, pred_train)))))
print('train r2: {}'.format(
    r2_score(y_train, pred_train)))
print()

# make predictions for test set
pred_test = lin_model.predict(X_test)

# determine mse and rmse
print('test mse: {}'.format(int(
    mean_squared_error(y_test, pred_test))))
print('test rmse: {}'.format(int(
    sqrt(mean_squared_error(y_test, pred_test)))))
print('test r2: {}'.format(
    r2_score(y_test, pred_test)))
print()


train mse: 0
train rmse: 0
train r2: 0.39444532815389866

test mse: 0
test rmse: 0
test r2: 0.4013770254759793



That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**

In [64]:
res = np.append(y_train.to_numpy(), pred_train, axis=0)
res.shape

(2094,)

In [80]:
p_train = 1*(pred_train > 0.5)
error_train = sum(abs(p_train - y_train)) / len(p_train) * 100
p_test = 1*(pred_test > 0.5)
error_test = sum(abs(p_test - y_test)) / len(p_test) * 100

In [81]:
print('error train: ', error_train)
print('error test: ', error_test)

error train:  19.579751671442217
error test:  20.610687022900763
