NB Titaic Tutorial
Converted to Python by Matthew Pecsok from Dr. Olivia Sheng's original tutorial in R
June 12, 2021

1 Data description

2 Library Setup

3 Overall data inspection

4 NB model building using sklearn package

5 Explanatory data exploration

6 Generate performance metrics

7 Simple hold-out evaluation


# 1 Data description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people such as women, children, and the upper-class were more likely to survive than others.

VARIABLE DESCRIPTIONS:

PassengerID Unique passenger identifier

Survived Survival (0 = No; 1 = Yes)

Pclass Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd) (Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower)

Name

Sex

Age - (Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5)

Sibsp - Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard

Ticket Number

Fare - Passenger Fare

Cabin

Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


# 2 Library Setup

https://scikit-learn.org/stable/modules/naive_bayes.html

In [None]:
import pandas as pd
import numpy as np

import os

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn import metrics

from sklearn.model_selection import cross_validate


# 3 overall data inspection

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
titanic = pd.read_csv("/content/drive/MyDrive/data_sets/titanic_cleaned.csv")

In [None]:
type(titanic)

pandas.core.frame.DataFrame

https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html

In [None]:
titanic.shape

(714, 9)

In [None]:
titanic.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

tranform the data from a numpy array and a list into a pandas dataframe for exploratory data analyisi

In [None]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Pclass    714 non-null    int64  
 2   Sex       714 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     714 non-null    int64  
 5   Parch     714 non-null    int64  
 6   Fare      714 non-null    float64
 7   Cabin     714 non-null    object 
 8   Embarked  714 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 50.3+ KB


In [None]:
# remove all non-categorical type columns
# also remove cabin as it is causing issues currently when splitting
titanic = titanic[['Survived','Sex','Embarked','Pclass']]

In [None]:
titanic

Unnamed: 0,Survived,Sex,Embarked,Pclass
0,0,male,S,3
1,1,female,C,1
2,1,female,S,3
3,1,female,S,1
4,0,male,S,3
...,...,...,...,...
709,0,female,Q,3
710,0,male,S,2
711,1,female,S,1
712,1,male,C,1


In [None]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Survived  714 non-null    int64 
 1   Sex       714 non-null    object
 2   Embarked  714 non-null    object
 3   Pclass    714 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 22.4+ KB


In [None]:
titanic.describe(include='all')

Unnamed: 0,Survived,Sex,Embarked,Pclass
count,714.0,714,714,714.0
unique,,2,4,
top,,male,S,
freq,,453,554,
mean,0.406162,,,2.236695
std,0.49146,,,0.83825
min,0.0,,,1.0
25%,0.0,,,1.0
50%,0.0,,,2.0
75%,1.0,,,3.0


## Dummy encoding the dataframe 

In [None]:
titanic.head(2)

Unnamed: 0,Survived,Sex,Embarked,Pclass
0,0,male,S,3
1,1,female,C,1


## 3.2 encode the data 

In [None]:
#convert all columns to 
titanic['Pclass'] = titanic['Pclass'].astype(str)
titanic.dtypes


Survived     int64
Sex         object
Embarked    object
Pclass      object
dtype: object

In [None]:
titanic_enc = pd.get_dummies(titanic)

In [None]:
titanic_enc.dtypes

Survived            int64
Sex_female          uint8
Sex_male            uint8
Embarked_C          uint8
Embarked_Q          uint8
Embarked_S          uint8
Embarked_missing    uint8
Pclass_1            uint8
Pclass_2            uint8
Pclass_3            uint8
dtype: object

In [None]:
titanic_enc.head(2)

Unnamed: 0,Survived,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_missing,Pclass_1,Pclass_2,Pclass_3
0,0,0,1,0,0,1,0,0,0,1
1,1,1,0,1,0,0,0,1,0,0


## 4 build a NB model and use cross validation to see how it performs across folds

In [None]:
y = titanic_enc.pop('Survived')

In [None]:
cnb = CategoricalNB() # create a categorical NB model

In [None]:
cross_validate(
    cnb, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=False)

{'fit_time': array([0.00794125, 0.00314331, 0.00313854, 0.00303268, 0.00302672]),
 'score_time': array([0.00393271, 0.00348783, 0.0033462 , 0.00332475, 0.00330782]),
 'test_accuracy': array([0.74825175, 0.81118881, 0.74825175, 0.76223776, 0.8028169 ]),
 'test_f1': array([0.70967742, 0.76521739, 0.7       , 0.70175439, 0.75438596]),
 'test_precision': array([0.66666667, 0.77192982, 0.67741935, 0.71428571, 0.76785714]),
 'test_recall': array([0.75862069, 0.75862069, 0.72413793, 0.68965517, 0.74137931])}

In [None]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=False)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_f1,test_accuracy,test_recall,test_precision
0,0.009446,0.007418,0.709677,0.748252,0.758621,0.666667
1,0.003784,0.003838,0.765217,0.811189,0.758621,0.77193
2,0.003398,0.003712,0.7,0.748252,0.724138,0.677419
3,0.00352,0.003814,0.701754,0.762238,0.689655,0.714286
4,0.003314,0.003671,0.754386,0.802817,0.741379,0.767857


In [None]:
five_fold = pd.DataFrame(scores)


In [None]:
print("mean\n\n",five_fold.mean(axis=0))
print("\n\nstd\n\n",five_fold.std(axis=0))

mean

 fit_time          0.004692
score_time        0.004490
test_f1           0.726207
test_accuracy     0.774549
test_recall       0.734483
test_precision    0.719632
dtype: float64


std

 fit_time          0.002663
score_time        0.001638
test_f1           0.031120
test_accuracy     0.030316
test_recall       0.028850
test_precision    0.049185
dtype: float64


In [None]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=False)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_f1,test_accuracy,test_recall,test_precision
0,0.00993,0.008464,0.75,0.777778,0.827586,0.685714
1,0.00754,0.004149,0.666667,0.722222,0.689655,0.645161
2,0.003653,0.003589,0.716981,0.791667,0.655172,0.791667
3,0.004659,0.003572,0.806452,0.833333,0.862069,0.757576
4,0.00327,0.003544,0.644068,0.704225,0.655172,0.633333
5,0.003222,0.003388,0.741935,0.774648,0.793103,0.69697
6,0.004312,0.003587,0.724138,0.774648,0.724138,0.724138
7,0.003194,0.003349,0.690909,0.760563,0.655172,0.730769
8,0.00317,0.003374,0.758621,0.802817,0.758621,0.758621
9,0.003345,0.003405,0.75,0.802817,0.724138,0.777778


In [None]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=False)
ten_fold = pd.DataFrame(scores)

print("mean\n\n",ten_fold.mean(axis=0))
print("std\n\n",ten_fold.std(axis=0))

        

mean

 fit_time          0.003680
score_time        0.003729
test_f1           0.724977
test_accuracy     0.774472
test_recall       0.734483
test_precision    0.720173
dtype: float64
std

 fit_time          0.001171
score_time        0.000971
test_f1           0.047705
test_accuracy     0.038350
test_recall       0.074580
test_precision    0.054087
dtype: float64


In [None]:
# !cp "/content/drive/My Drive/Colab Notebooks/4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb" ./

# run the second shell command, jupyter nbconvert --to html "file name of the notebook"
# create html from ipynb

# !jupyter nbconvert --to html "4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb"

[NbConvertApp] Converting notebook 4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb to html
[NbConvertApp] Writing 304443 bytes to 4482_Naive_Bayes_CV-Titanic-Tutorial.html
