## Naive Bayesian Classifier

Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonging to each class to make a prediction. It simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method.

## Data Dictionary
Dictionary is in the form of (variable):(definition)<br>
1) survival: Survival (0 = No, 1 = Yes)<br>
2) pclass: Ticket class	(1 = 1st, 2 = 2nd, 3 = 3rd)<br>
3) sex: Sex	<br>
4) Age: Age in years	
5) sibsp: # of siblings / spouses aboard the Titanic	
6) parch: # of parents / children aboard the Titanic	
7) ticket: Ticket number	
8) fare: Passenger fare	
9) cabin: Cabin number	
10) embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

## Variable Notes
1) pclass: A proxy for socio-economic status (SES)<br>
<li>1st = Upper</li>
<li>2nd = Middle</li>
<li>3rd = Lower</li>

2) age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

3) sibsp: The dataset defines family relations in this way...
<li>Sibling = brother, sister, stepbrother, stepsister</li>
<li>Spouse = husband, wife (mistresses and fiancés were ignored)</li>

4) parch: The dataset defines family relations in this way...<br>
<li>Parent = mother, father</li>
<li>Child = daughter, son, stepdaughter, stepson</li>
<li>Some children travelled only with a nanny, therefore parch=0 for them.</li>

In [60]:
# Required libraries
import pandas as pd
import numpy as np
from pandas import read_csv

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
from ggplot import *

# Importing the datasets we need
df_train = pd.read_csv('train.csv')
print(df_train)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25          

## Cleaning our dataset
Before we begin building our Naive Bayes model, let us go through the dataset and understand what attributes should we place our emphasis on. For attributes such as ticket number and port of embarkation, which have little to zero relevance to survival of a passenger, we shall remove them from the dataset. 

Also, we can consider removing rows with missing values.

In [61]:
# Deleting name, ticket and port of embarkation columns from dataset
del df_train['Name']
del df_train['Ticket']
del df_train['Embarked']

# Find number of missing values in each column
def num_missing(x):
    return sum(x.isnull())
print("Missing values per column")
print(df_train.apply(num_missing, axis=0))
    
# Remove cabin column as well since 687 out of 891 rows have no values
del df_train['Cabin']

# Replacing male=1 and female=0 for Sex column since Naive Bayes model can only run with float value data
df_train['Sex'] = df_train['Sex'].replace(['male','female'], [1,0])

# Let's analyse to see if there's any pattern for the missing values for 'Age' column
for passenger in df_train['PassengerId']:
    if np.isnan(df_train['Age'][passenger-1]):
        print(df_train.iloc[passenger-1])
        
del df_train['PassengerId']

Missing values per column
PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Cabin          687
dtype: int64
PassengerId    6.0000
Survived       0.0000
Pclass         3.0000
Sex            1.0000
Age               NaN
SibSp          0.0000
Parch          0.0000
Fare           8.4583
Name: 5, dtype: float64
PassengerId    18.0
Survived        1.0
Pclass          2.0
Sex             1.0
Age             NaN
SibSp           0.0
Parch           0.0
Fare           13.0
Name: 17, dtype: float64
PassengerId    20.000
Survived        1.000
Pclass          3.000
Sex             0.000
Age               NaN
SibSp           0.000
Parch           0.000
Fare            7.225
Name: 19, dtype: float64
PassengerId    27.000
Survived        0.000
Pclass          3.000
Sex             1.000
Age               NaN
SibSp           0.000
Parch           0.000
Fare            7.225
Name: 26, dtype: float64
Pas

Since there is no pattern in the missing data on any variables, we shall delete rows of data with missing age values. Given a relatively huge sample, we can drop data without substantial loss of statistical power.

In [62]:
# Removal of rows with missing age values
df_train = df_train.dropna(axis=0)

## Deciding which Naive Bayes model to build under scikit learn library
<strong>1) Gaussian:</strong><br>
It is used in classification and it assumes that features follow a normal distribution

<strong>2) Multinomial:</strong><br>
It is used for discrete counts. We can think of it as “number of times outcome number x_i is observed over the n trials”.

<strong>3) Bernoulli:</strong><br>
It is useful if our feature vectors are binary (i.e zeros and ones). An example would be text classification where 1 represents 'word occurs in the document' and 0 represents 'word does not occur in the document' respectively.

In [63]:
# Implementation of a Gaussian model
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
import numpy as np

# Retrieving test data
df_test = pd.read_csv('test.csv')

#for column in df_test:
    #print(df_test[column].isnull().sum())
    
# Deleting irrelevant columns from test data
del df_test['Name']
del df_test['Ticket']
del df_test['Embarked']
del df_test['Cabin']

# Replace nan values with 0
for column in df_test:
    df_test[column].fillna(0, inplace=True)
#df_test['Age'].fillna(0, inplace=True)
# Store indexes of rows to remove data from prediction results data
index_to_remove = []

for row in range(len(df_test['Age'])):
    if df_test['Age'][row] == 0 or df_test['Fare'][row] == 0:
        index_to_remove.append(row)

# Removal of rows with missing age values
df_test = df_test.loc[df_test['Age'] != 0]
df_test = df_test.loc[df_test['Fare'] != 0]

# Replacing male=1 and female=0 for Sex column since Naive Bayes model can only run with float value data
df_test['Sex'] = df_test['Sex'].replace(['male','female'], [1,0])

# Retrieve target values for training data
df_target = df_train['Survived']
del df_train['Survived']

# Reverse the indexes of the row to ensure that python for-loop does not go out of index since we are removing rows
index_to_remove = index_to_remove[::-1]

# Retrieving test prediction data
df_target_pred = pd.read_csv('gender_submission.csv')
for i in range(len(index_to_remove)):
    df_target_pred.drop(df_target_pred.index[index_to_remove[i]], inplace=True)

del df_test['PassengerId']
del df_target_pred['PassengerId']

# Convert our dataframes into matrices before we can use the GaussianNB() function
#df_train = df_train.as_matrix().astype(np.float)
#df_target = df_target.as_matrix().astype(np.float)
#df_test = df_test.as_matrix().astype(np.float)
#pd.set_option('display.max_rows', 1000)

# Fitting a Naive Bayes model to the data
model = GaussianNB()
model.fit(df_train, df_target)
target_pred = model.predict(df_test)
accuracy = np.round(metrics.accuracy_score(df_target_pred, target_pred, normalize=True)*100, decimals=2)
accuracy = accuracy.astype('str') #Initial accuracy is of np.float type
print("We attain an accuracy of " + accuracy + "% for our Naive Bayes Model.")

[[180  23]
 [  3 124]]
We attain an accuracy of 92.12% for our Naive Bayes Model.


## Conclusion
Given an accuracy rate of above 90%, we can be rather confident of our prediction for the survivability of each passenger, given the parameters as specified above. Nonetheless, more in-depth feature engineering can be practiced to better the prediction model. And, we should be wary that the Naive Bayes Model assumes that features are independent of each other.