# Gaussian Naive Bayes Classifier

Link to the Youtube video tutorial: https://www.youtube.com/watch?v=PPeaRc-r1OI&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=15


Link to the Youtube video, to understand the basics of how Naive Bayes Classifier (Multinomial Naive Bayes Classifier) works: https://www.youtube.com/watch?v=O2L2Uv9pdDA

Link to the Youtube video, to understand the basics of how Gaussian Naive Bayes Classifier works: https://www.youtube.com/watch?v=H3EjCKtlVog

Link to the Youtube video, to understand the Normal Distribution (Gaussian Distribution): https://www.youtube.com/watch?v=rzFX5NWojp0

1) Gaussian Naive Bayes Classifier is used when we have continuous data (independent variable data [features] are continuous), EG: 
    1) In Iris dataset, the features are sepal width, sepal length, petal width, and petal length.


In [1]:
import pandas as pd

df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


# Data exploration

In [2]:
# survived is tha dependent/target variable of this dataset
# drop the features which you assumed they don't have impact on your target variable
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.925,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


# Data preprocessing

In [3]:
# Load the dependent variable data of the dataset to the variable called target
target = df.Survived

# Load the independent variables data of the dataset to the variable called inputs
inputs = df.drop('Survived', axis = 'columns')

In [4]:
# since the sex feature is a categorical variable and its data are text labels, we need to convert its text labels into integer labels so that a machine learning model can learn the feature. Since sex feature is not an ordinal category, use onehotencoding/get_dummies is better than label encoder.

# create dummy columns (female and male columns) of the sex feature in the inputs dataframe using get_dummies. Save the outputs to variable dummies
dummies = pd.get_dummies(inputs.Sex,dtype='int')
dummies.head()

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [5]:
# concatenate the dummy columns to the inputs dataframe
inputs = pd.concat([inputs,dummies],axis='columns')
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0
3,1,female,35.0,53.1,1,0
4,3,male,35.0,8.05,0,1


In [6]:
# drop the sex column (because it contains text labels which a machine learning model cannot learn the feature from it)
inputs.drop('Sex',axis='columns',inplace=True)
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1


In [7]:
# determine if any column of the inputs dataframe contains missing value (EG: Na / NaN) as an entry
inputs.columns[inputs.isna().any()]

# show the first 10 rows of age column in the inputs dataframe
inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [8]:
## Handle missing values (Na/NaN) of a dataset

# take a mean of the entire column (age column is chosen because it contains missing value). Then, fill the Na values of the specified column with the computed mean value. Then, assign the specified filled column back to its original column in inputs dataframe
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.Age[:10]


0    22.000000
1    38.000000
2    26.000000
3    35.000000
4    35.000000
5    29.699118
6    54.000000
7     2.000000
8    27.000000
9    14.000000
Name: Age, dtype: float64

In [9]:
# Split the dataset into train and test set

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(inputs, target, test_size=0.2)

len(X_train)
len(inputs)

891

# Develop machine learning model

1) There are different Naive Bayes model, Gaussian naive Bayes is one of them.
2) Gaussian Naive Bayes is used when your data distribution is normal
3) In statistics, gaussian distribution is also called a bell curve.

In [10]:
from sklearn.naive_bayes import GaussianNB

# create the gaussian naive bayes model (machine learning model)
model = GaussianNB() 

# train the model
model.fit(X_train,Y_train)

# show the performance of the trained model
model.score(X_test,Y_test)

0.770949720670391

Observe how the trained model make predictions over samples in X_test, and compare the predictions with their corresponding ground truth in Y_test

In [11]:
# show the first 10 rows of Y_test, which are the ground truths of the first 10 samples/rows in X_test
print(Y_test[:10])

156    1
332    0
849    1
707    1
340    1
821    1
260    0
231    0
238    0
744    1
Name: Survived, dtype: int64


In [12]:
# show the prediction (0 means the person not survived, 1 means the person survived) made by the model based on the features of first 10 samples/rows in X_test
model.predict(X_test[:10])

array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [13]:
# show the prediction (probability of 0 [the person not survived], probability of 1 [the person survived]) made by the model based on the features of first 10 samples/rows in X_test
model.predict_proba(X_test[:10])

array([[6.75470904e-02, 9.32452910e-01],
       [1.54400086e-03, 9.98455999e-01],
       [7.29482339e-04, 9.99270518e-01],
       [9.26786178e-01, 7.32138222e-02],
       [9.60667778e-01, 3.93322224e-02],
       [9.91704023e-01, 8.29597714e-03],
       [9.91915966e-01, 8.08403438e-03],
       [9.91856944e-01, 8.14305554e-03],
       [9.77966673e-01, 2.20333270e-02],
       [9.92024982e-01, 7.97501776e-03]])