# Titanic Bayes

Using the Titanic dataset:
- clean up the data (handle missing values either by removal or filling, and transforming non-numerical data into number values) 
- build Gaussian and Bernoulli Naive Bayes models to predict Titanic passengers' survival status (1=survived, 0=did not survive). 
- Compare the two models against each other. Did one model perform better than the other? How does the performance of these two models compare to the other classification algorithms, logistic regression and decision trees?

In [1]:
import pandas as pd
import numpy as np

In [3]:
filename = "titanic-1.xls"
df = pd.read_excel(filename)

df.head() #first 5 rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [4]:
df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [6]:
len(df)

1309

In [5]:
df.isnull().sum() 

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [11]:
df.dtypes

pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

In [7]:
df.dropna(subset=["age"], inplace = True)

In [8]:
df.dropna(subset=["fare"], inplace = True)

In [49]:
df.dropna(subset=["embarked"], inplace = True)

In [55]:
df['age'].astype(int)

0       29
1        0
2        2
3       30
4       25
        ..
1301    45
1304    14
1306    26
1307    27
1308    29
Name: age, Length: 1043, dtype: int32

## Gaussian Naïve Bayes

In [56]:
from sklearn.naive_bayes import GaussianNB   #import Gaussian Bayes modeling function
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [64]:
modeldf = df[['pclass', 'sex', 'age', 'fare', 'survived']]
modeldf.head()

Unnamed: 0,pclass,sex,age,fare,survived
0,1,female,29.0,211.3375,1
1,1,male,0.9167,151.55,1
2,1,female,2.0,151.55,0
3,1,male,30.0,151.55,0
4,1,female,25.0,151.55,0


In [65]:
modeldf['sex'] = modeldf['sex'].map({'female': 0, 'male': 1})
modeldf.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  modeldf['sex'] = modeldf['sex'].map({'female': 0, 'male': 1})


Unnamed: 0,pclass,sex,age,fare,survived
0,1,0,29.0,211.3375,1
1,1,1,0.9167,151.55,1
2,1,0,2.0,151.55,0
3,1,1,30.0,151.55,0
4,1,0,25.0,151.55,0


In [66]:
modeldf.corr()

Unnamed: 0,pclass,sex,age,fare,survived
pclass,1.0,0.141032,-0.409082,-0.564558,-0.317737
sex,0.141032,1.0,0.066007,-0.1864,-0.536332
age,-0.409082,0.066007,1.0,0.177205,-0.057416
fare,-0.564558,-0.1864,0.177205,1.0,0.247858
survived,-0.317737,-0.536332,-0.057416,0.247858,1.0


In [67]:
X = modeldf.drop('survived', axis=1)

#column of predictive target values
y = modeldf['survived']

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=50)

In [69]:
gnb = GaussianNB()

In [70]:
gnb.fit(X_train, y_train)

GaussianNB()

In [71]:
gnb.score(X_train, y_train)

0.7647058823529411

In [72]:
y_pred = gnb.predict(X_test)

In [73]:
cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,123,24
True Passed,31,83


In [74]:
y_test.value_counts()

0    147
1    114
Name: survived, dtype: int64

In [75]:
gnb.score(X_test, y_test)

0.789272030651341

In [76]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.84      0.82       147
           1       0.78      0.73      0.75       114

    accuracy                           0.79       261
   macro avg       0.79      0.78      0.78       261
weighted avg       0.79      0.79      0.79       261



## Bernoulli's Naive Bayes

In [77]:
from sklearn.naive_bayes import BernoulliNB

In [78]:
bnb = BernoulliNB()

In [79]:
bnb.fit(X_train, y_train)

BernoulliNB()

In [80]:
bnb.score(X_train, y_train)

0.7762148337595908

In [81]:
y_pred = gnb.predict(X_test)

In [82]:
cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,123,24
True Passed,31,83


In [83]:
gnb.score(X_test, y_test)

0.789272030651341