# Naïve Bayes Classifier

Probability is a way to figure out how likely something is to happen. Probability is calculated by taking the number of chances something can happen and divide it by the total number of possible outcomes. For example, when flipping a coin there are 2 possible outcomes. The probability of getting heads is 50% (1 chance to get heads, with 2 possible outcomes). The formula would look like:

### \begin{align} probability = \frac{number of chances}{total outcomes} \end{align}

The Naïve Bayes classification model is an algorithm based on Bayes' Theorem, which is a way to find the probability of a variable when other values have been known to occur already. It is represented by the following formula:

### \begin{align} P(B|A) = \frac{P(B)\times P(A|B)}{P(A)} \end{align}

Where the probability of B given that A happened is equal to the probability of B times the probability of A given that B happened, divided by the probability of A. For example, in a bag of 2 blue marbles and 3 red marbles, if a blue marble is pulled from the bag then the probability of getting another blue marble is affected by the fact that a blue marble was already drawn (and thus, there are fewer blue marbles in the bag).

<center>![Marbles Probability](https://notebooks.azure.com/priesterkc/projects/testdb/raw/marbles.png "Probability using marbles")</center>

## Naïve Bayes Probability Calculation

In the following dataset, let's find the probability of a student passing a test (60% or higher) given that they studied 5 hours or less. Here are the things we'll need to know:

- the total number of students
- the number of students that passed the test
- the number of students that studied 5 hours or less
- the number of students that studied 5 hours or less, given that they already passed

Using those values, then we can calculate:

- the probability of passing the test
- the probability of studying 5 hours or less
- the probability of studying 5 hours or less, given already passing the test

In [None]:
import pandas as pd
import numpy as np

In [None]:
#load data
filename = "datasets/gradedata.csv"
df = pd.read_csv(filename)

df.head() #first 5 rows

In [None]:
#descriptive statistics
df.describe()

In [None]:
#total number of students
total = len(df)

In [None]:
#rows of students that passed the test
df_pass = df[df['grade'] >= 60]

#number of students that passed
numpass = len(df_pass)

In [None]:
#rows of students that studied 5 hours or less
df_less5hr = df[df['hours'] <= 5]

#number of students that studied 5 hours or less
num_less5hr = len(df_less5hr)

In [None]:
#rows of students that studied 5 hours or less and passed
df_5less_pass = df_pass.loc[df['hours'] <= 5]

#number of students that studied 5 hours or less and passed
num_5less_pass = len(df_5less_pass)

In [None]:
#probability of passing the test
#number of students that passed divided by total number of students
P_pass = numpass/total
P_pass

In [None]:
#probability of studying 5 hours or less
#number of students that studied 5 hours or less divided by total number of students
P_less5hr = num_less5hr/total
P_less5hr

In [None]:
#probability of studying 5 hours or less given that you passed
#number of students that studied 5 hours or less given they passed, divided by total students that passed
P_5hr_pass = num_5less_pass/numpass
P_5hr_pass

In [None]:
#SOLUTION: probability of passing given that you studied 5 hours or less

#probability of passing times probability of studying 5 hours or less given that you passed
#divded by probability of studying 5 hours or less
P_pass_less5hr = (P_pass * P_5hr_pass)/(P_less5hr)
P_pass_less5hr

#### The probability of a passing the test, given that a student studied 5 hours or less is about 93.5%. So a student only has a 6.5% chance of failing. That's not too bad; maybe the test is fairly easy.

***

## Naïve Bayes using Scikit-Learn

Let's use the same dataset above and build a Naïve Bayes classification model to predict student grades.

### Gaussian Naïve Bayes

There are different types of Naive Bayes functions and in the examples below, we will use Gaussian Bayes to build the predictive model. Gaussian Bayes uses conditional probability on data that is normally distributed.

In [None]:
from sklearn.naive_bayes import GaussianNB   #import Gaussian Bayes modeling function
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [None]:
#check to see if there are any missing values
df.count()

In [None]:
df.dtypes

In [None]:
#create a dataframe with columns to use in the model
modeldf = df[['gender', 'age', 'exercise', 'hours', 'grade']]
modeldf.head()

In [None]:
#transform gender column to binary values (0,1)
modeldf['gender'] = modeldf['gender'].map({'female': 0, 'male': 1})
modeldf.head()

In [None]:
#see which features are correlated to each other
modeldf.corr()

In [None]:
#create a column to label if a student passed or failed a test
modeldf['passed'] = np.where(df['grade']>= 60, 1, 0)

#drop grade column
modeldf.drop('grade', axis=1, inplace=True)

In [None]:
#dataframe with predicting features
X = modeldf.drop('passed', axis=1)

#column of predictive target values
y = modeldf['passed']

In [None]:
#create training and test data
#will leave test size at default (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=109)

In [None]:
#initialize Gaussian Bayes classifier
gnb = GaussianNB()

In [None]:
#train the model to learn trends
gnb.fit(X_train, y_train)

In [None]:
#predictive score of the model on the training data
gnb.score(X_train, y_train)

In [None]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [None]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

In [None]:
#frequency of passed students to failed students in the test dataset
y_test.value_counts()

In [None]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

In [None]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

### Bernoulli's Naïve Bayes

Bernoull's Naïve Bayes classifier is best on a target variable that is binary (Boolean; True/False (1,0) values). Let's try this method on the dataset from the previous example.

In [None]:
#import Bernoulli Naïve Bayes function from scikit-learn library
from sklearn.naive_bayes import BernoulliNB

In [None]:
#initialize Bernoulli Naïve Bayes function to a variable
bnb = BernoulliNB()

In [None]:
#build the model with training data
bnb.fit(X_train, y_train)

In [None]:
#model's predictive score on the training data
bnb.score(X_train, y_train)

In [None]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [None]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

In [None]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

Overall, the model is really good at finding students that passed but in this test dataset, it didn't have enough data points to find the trend of predicting features for students that failed the test. One way to improve the results would be to decrease the size of the training data so that data points for failing students seem more significant. This dataset is also small, so new data with more students that failed could help the model see the trends for failing students. Lastly, it could just be that Naïve Bayes isn't the best model to use for the data and we should compare its results to other predictive classification models.