# Naive Bayes Classifier

## Model Summary

The Naive Bayes Classifier is an algorithm based on Baye's Theorem, which can be represented as:

![Title](extras/bayes_rule.png)

where:
 - P(c|x) is the posterior probability of class (c, or outcome variable) given predictor (x, or attributes or features).
 - P(x|c) is the likelihood/probability of the predictor occuring given class.
 - P(c) is the prior probability of class (overall probability that it occurs).
 - P(x) is the prior probability of predictor (also known as 'evidence' in Bayesian probability terminology).

The conditional probabilities are then multiplied across all features for each class, and the class with the highest probability is chosen. It is very important to note that a key assumption of this model is that the features are independent, which is where the _naive_ name is derived.

## Theoretical Example

Let us predict the probability that a phone will explode given that it's a Samsung Note, P(c|x). If 1% of all phones explode, 25% of all phones are Samsung Note, and 75% of all exploding phones are Samsung Note, then P(c|x) is ((.75)(.25))/(.01) = 18.75%.

## Pros & Cons

Pros:
 - It is a very fast classifier.
 - It performs well with categorical input variables compared to numerical ones.

Cons:
 - If a categorical variable has a value that is observed in the Test data set that didn't exist in the Training data set, the model assumes a zero probability and cannot make a prediction. This can be fixed using the Laplace estimation smoothing technique.
 - The model is known to be a bad predictor, meaning outputs from predict_proba are worthless.
 - The assumptions of independent features or normal distribution in the Gaussian NB model are very strong.
  
## Applications

 - Real time prediction: NB is a very fast classifier. 
 - Multi-class prediction
 - Text Classification
 - Recommender Systems: NB is often used with Collaborative Filtering to build good recommender systems.

## Model Types

There are three primary Naive Bayes models, each determined by the kind of feature variables we're working with: 
 - Gaussian NB: assumes that the features are numerical follow a normal distribution.
 - Multinomial NB: assumes that the feature variables are discrete counts, and is often used in text classification that analyzes word counts.
 - Bernoulli NB: assumes that the features are binary variables, and is often used in text classification that looks for the presence of a given word.


## Bernoulli Classification Coding Example

In this example, we will use the Bernoulli classification model to predict the _Gender_ of an ASU PSC applicant using the _Military_ and _Ethnicity_ fields. 

In [46]:
import os
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import sys

First we load the data (available on Kaggle):

In [41]:
data = pd.read_csv('Data/titanic_train.csv')

In [6]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Judging by the quick inspection of columns, we cna take Pclass, Sex, and Embarked as our categorical predictors, which we will turn into binary variables.

In [43]:
X = pd.concat([pd.get_dummies(data[['Sex', 'Embarked']]),pd.get_dummies(data['Pclass'])], axis=1)
y = data['Survived']

Split the data into train and test sets:

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .33, random_state = 101)

Run the model using a Bernoulli Naive Bayes:

In [45]:
nb = BernoulliNB()
nb.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

Evaluate the model on the test dataset:

In [49]:
yhat = nb.predict(X_test)

print confusion_matrix(y_test, yhat)
print '\n', classification_report(y_test, yhat)

[[138  31]
 [ 35  91]]

             precision    recall  f1-score   support

          0       0.80      0.82      0.81       169
          1       0.75      0.72      0.73       126

avg / total       0.78      0.78      0.78       295



## Additional Resources

 - http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes
 - https://en.wikipedia.org/wiki/Naive_Bayes_classifier
 - http://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/

## Future Improvements

 - Add examples of Gaussian and Multinomial models
 - Answer question: what happens if the assumption of independent features is violated?
 - Answer question: what happens if features are both numerical _and_ categorical?
 - Answer question (not necessarily just NB): in which modeling techniques do we _have to_ drop a category value when creating dummy variables?