## Machine Learning Tutorial 15: Naive Bayes Tutorial Part 1

#### Predicting survival from titanic crash

This is part 1 of Naive Bayes classifier algorithm machine learning tutorial. Naive Bayes Theorem uses Bayes Theorem for conditional probability with a naive assumption that the features are not correlated to each other and tries to find conditional probability of target variable given the probabilities of features. We will use titanic survival dataset here and using Naive Bayes classifier find out the survival probability of titanic travellers. We use `sklearn library` and python for this beginners machine learning tutorial. GaussianNB is the classifier we use to train our model. There are other classifiers such as `MultinomialNB` but we will use that in part 2 of the tutorial.

#### Topics covered:
* Introduction
* Basics of probability
* Conditional probability
* Bayes Theorem
* Titanic crash survival
* GaussianNB classifier

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

In [2]:
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns', inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [4]:
inputs = df.drop('Survived', axis='columns')
target = df.Survived

In [5]:
dummies = pd.get_dummies(inputs.Sex)
dummies.head(3)

Unnamed: 0,female,male
0,False,True
1,True,False
2,True,False


In [6]:
inputs = pd.concat([inputs, dummies], axis='columns')
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,False,True
1,1,female,38.0,71.2833,True,False
2,3,female,26.0,7.925,True,False


**Here we drop male column as well to avoid dummy variable trap theory. One column is enough to represent male vs female**

In [7]:
inputs.drop(['Sex','male'], axis='columns', inplace=True)
inputs.head(3)

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,False
1,1,38.0,71.2833,True
2,3,26.0,7.925,True


In [9]:
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [10]:
inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [11]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,False
1,1,38.0,71.2833,True
2,3,26.0,7.925,True
3,1,35.0,53.1,True
4,3,35.0,8.05,False


In [13]:
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.3)

In [14]:
model = GaussianNB()

In [15]:
model.fit(X_train, y_train)

In [16]:
model.score(X_test, y_test)

0.7723880597014925

In [17]:
X_test[0:10]

Unnamed: 0,Pclass,Age,Fare,female
744,3,31.0,7.925,False
430,1,28.0,26.55,False
77,3,29.699118,8.05,False
841,2,16.0,10.5,False
281,3,28.0,7.8542,False
656,3,29.699118,7.8958,False
682,3,20.0,9.225,False
126,3,29.699118,7.75,False
318,1,31.0,164.8667,True
534,3,30.0,8.6625,True


In [18]:
y_test[0:10]

744    1
430    1
77     0
841    0
281    0
656    0
682    0
126    0
318    1
534    0
Name: Survived, dtype: int64

In [19]:
model.predict(X_test[0:10])

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int64)

In [20]:
model.predict_proba(X_test[:10])

array([[9.66459536e-01, 3.35404643e-02],
       [7.73348435e-01, 2.26651565e-01],
       [9.65909576e-01, 3.40904237e-02],
       [9.06556859e-01, 9.34431414e-02],
       [9.65045380e-01, 3.49546202e-02],
       [9.65886166e-01, 3.41138336e-02],
       [9.59736931e-01, 4.02630693e-02],
       [9.65863473e-01, 3.41365275e-02],
       [1.94974568e-05, 9.99980503e-01],
       [4.42982674e-01, 5.57017326e-01]])

**Calculate the score using cross validation**

In [22]:
cross_val_score(GaussianNB(), X_train, y_train, cv=5)

array([0.744     , 0.784     , 0.776     , 0.82258065, 0.77419355])