## Introduction

This tutorial will introduce you to the Gaussian Naive Bayes. algorithm, which is one of the simplest and most commonly used classification algorithms in machine learning. Although the traditional naive Bayes will be covered in the lecture of 15-688, we will focus on the Gaussian Naive Bayes. We will introduce the rationale and theoretical framework behind the Gaussian Naive Bayes classifier, discuss about its advantages and disadvantages, and finally walk you through an example implementation of the Gaussian Naive Bayes classifier with an example dataset.

## Tutorial content
In this tutorial, we will show how the Gaussian Naive Bayes algorithm works and how to implement a Gaussian Naive Bayes classifier with an example dataset.

We will cover the following topics in this tutorial:
- Theoretical framework
- Pros and Cons of Gaussian Naive Bayes classifier
- Example application
- References and further resources

## Theoretical framework
In this section, we will introduce the theoretical frameworks of Gaussian Naive Bayes classifier:
1. Assumption of Independent and Identically Distribution
2. Bayes Rule
3. Conditional Independence Assumption
4. Gaussian Distribution Assumption
5. Summary

### 1. Assumption of Independent and Identically Distribution
We assume our data are drawn independent and identically distributed (i.i.d) from a joint probability distribution over feature vectors X and labels Y.

### 2. Bayes Rule
$$P(Y\mid X) = \frac{P(X\mid Y)P(Y)}{P(X)}$$

Equivalently: $$P(Y\mid X) = \frac{P(X\mid Y)P(Y)}{\sum_k P(X\mid Y_{k})P(Y_k)}$$


### 3. Conditional Independence Assumption
Gaussian Naive Bayes assumes that  $X_i$ and $X_j$ are conditionally independent given Y, for all i≠j. (This is why we call it "Naive").

$$P(X_1...X_n\mid Y) =\prod_i P(X_i\mid Y)$$

### 4. Gaussian Distribution Assumption
Gaussian Naive Bayes assumes that the values of each numerical attributes are normally distributed. (This is why we call it "Gaussian").

 $$P(X_i\mid Y_k) \sim {\mathcal {N}}(\mu_{ik} ,\sigma_{ik} ^{2})$$
 

### 5. Summary
To predict the label, we will choose the most probable class label given X = {$X_1$...$X_n$}.

$$\DeclareMathOperator*{\argmax}{argmax} \widehat{y} = \argmax_y P(Y = y | X = x)\ $$

Using the Bayes Rule, we can rewrite the formula as follows:

$$\DeclareMathOperator*{\argmax}{argmax} \widehat{y} = \argmax_y \frac{P(X_1...X_n| Y=y)P(Y = y)}{P(X_1...X_n)} \\=\argmax_y  P(X_1...X_n| Y=y)P(Y = y)$$

Using the Conditional Independence Assumption, we can rewrite the formula as follows:

$$\DeclareMathOperator*{\argmax}{argmax} \widehat{y} = \argmax_y \prod_i P(X_i| Y=y)P(Y = y)$$

Using the Gaussian Distribution Assumption, we can compute $P(X_i| Y=y)$ as follows:

$$ P(X_i|Y=y_k) = {{\frac {1}{\sqrt {2\pi \sigma_{ik} ^{2}}}}e^{-{\frac {(x-\mu_{ik} )^{2}}{2\sigma_{ik} ^{2}}}},}$$
where $\mu_{ik}$ and $\sigma_{ik}$ are the mean and standard deviation of the distribution.

## Pros and Cons of Gaussian Naive Bayes algorithm
This section will briefly discuss the advantages and disadvantages of the Gaussian Naive Bayes classifier.

### Advantages
Gaussian Naive Bayes classifier is a simple and powerful algorithm to make predictions, especially when dataset is small. It has following advantages:
1. __Gaussian Naive Bayes can handle continuous numeric attributes.__ <br>
> When an numerical attribute is continuous, it is impossible to count the frequency. In this case, we would calculate the probability densities using probability density functions. Gaussian Naive Bayes assumes that each numerical attribute is normally distributed. Thus we can easily calculate the probability densities of the normal distribution.

2. __Gaussian Naive Bayes can handle missing data.__ <br>
> If a data instance has a missing value for an attribute, it can be ignored, because Gaussian Naive Bayes handles attributes seperately while calculating probabilities.

3. __Gaussian Naive Bayes performs well even with small datasets.__ <br>
> Gaussian Naive Bayes does not need too much data to learn about the probabilistic relationship between an certain attribute and the predicted attribute. Furthermore, it is less likely to overfit the training data with a smaller sample size.




### Disdvantages
Although Gaussian Naive Bayes algorithm usually perform well, in some cases, somehow it might perform poorly, if used inappropriately. Gaussian Naive Bayes algorithm has following disadvantages:
1. __Gaussian Naive Bayes makes very strong assumptions.__ <br>
> Due to the Conditional Independence assumption and Gaussian Distribution assumption, Gaussian Naive Bayes may not work when any two attributes are highly correlated given a class value or when any attribute's distribution is highly skewed. However, sometimes the predicted result is surprisingly robust when the assumptions are somewhat violated.

2. __Gaussian Naive Bayes is not good at handling large datasets__ <br>
> Underfitting might occurs when Gaussian Naive Bayes is trained on a large dataset or when the data distribution is uneven.






## Example application
In this section, we will walk through the Gaussian Naive Bayes algorithm with the Blood Transfusion Service Center Data Set from the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The dataset is composed of 748 donor data, each one includes 4 attributes: R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation). The predicted attribute is a binary indicator representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Our implementation is broken into the following steps:

1. Import data
2. Split dataset
3. Seperate data by class
4. Summarize attribute distributions for each instance
5. Calculate probabilities for each instance
6. Classify
7. Evaluate the accuracy

### 1. Import data
First, we need to load the data in csv format using the reader function in the csv module. We remove the header row and convert the data into numbers.

In [1]:
import csv
import numpy as np
def loadDataset(filename):
    with open(filename, 'rb') as csvfile:
        lines = csv.reader(csvfile)
        next(csvfile)
        dataset = list(lines)
    dataset = np.array(dataset)
    dataset = dataset.astype('float')
    return dataset

Now we can use the loadDataset function to import the Blood Transfusion Service Center dataset. We can test it by printing the number of instances and the number of attributes.

In [2]:
dataset = loadDataset('transfusion.csv')
print ('Number of Instances: ' + repr(len(dataset)))
print ('Number of Attributes: ' + repr(len(dataset[0])))

Number of Instances: 748
Number of Attributes: 5


### 2. Split dataset
Next, we need to split the data into a training dataset and a test dataset randomly with a given split ratio, by using the random module.

In [3]:
import random

def splitDataset(dataset, split):
    trainingSet = []
    testSet = list(dataset)
    trainingSize = int(len(dataset) * split)
    while len(trainingSet) < trainingSize:
        i = random.randrange(len(testSet))
        trainingSet.append(testSet.pop(i))
    trainingSet = np.array(trainingSet)
    testSet = np.array(testSet)
    return trainingSet, testSet

Now we can use the splitDataset function to split the loaded data into a training dataset and a test dataset, with a ratio of 67% training and 33% test. We can test it by printing the size of each splitted dataset.

In [4]:
train, test = splitDataset(dataset, 0.67)
print('Train: ' + repr(len(train)))
print('Test: ' + repr(len(test))) 

Train: 501
Test: 247


### 3. Seperate data by class
Next, in order to summarize statistics for each class, we need to seperate the training dataset instances by class value first.

In [5]:
def seperate(dataset):
    yes = []
    no = []
    for row in dataset:
        if row[-1] == 1:
            yes.append(row)
        else :
            no.append(row)
    yes = np.array(yes)
    no = np.array(no)
    return (no, yes)

You can test the function with some sample data as follows:

In [6]:
sampleData = np.array([[1,11,1],[2,12,0],[3,13,1],[4,14,0]])
no, yes = seperate(sampleData)
print('No: ' + repr(no))
print('Yes: ' + repr(yes)) 

No: array([[ 2, 12,  0],
       [ 4, 14,  0]])
Yes: array([[ 1, 11,  1],
       [ 3, 13,  1]])


### 4. Summarize attribute distributions for each class
Now we need to summarize statistics for each class, by calculating the mean and standard deviation for each attribute in the class, because Gaussian Naive Bayes model has assumed that each attribute in each class is accorded with Gaussian Distribution, as known as Normal Distribution. We use the builtin function in the numpy module to calculate the mean and standard deviation.

In [7]:
import math

def summarize(dataset):
    instances = seperate(dataset)
    summary = {}
    for k, i in enumerate(instances):
        summary_i = []
        for atr in i.T[:-1]:
            mean = atr.mean()
            std = atr.std(ddof = 1)
            summary_i.append((mean, std))
        summary[k] = summary_i
    return summary 

You can test the function with some sample data as follows:

In [8]:
sampleData = np.array([[1,11,1],[2,12,0],[3,13,1],[4,14,0]])
summary = summarize(sampleData)
print('Summary: ' + repr(summary)) 

Summary: {0: [(3.0, 1.4142135623730951), (13.0, 1.4142135623730951)], 1: [(2.0, 1.4142135623730951), (12.0, 1.4142135623730951)]}


You can also take a look at the summary statistics for the training dataset.

In [9]:
summary = summarize(train)
print('Summary: ' + repr(summary)) 

Summary: {0: [(10.771653543307087, 8.791014230790875), (4.6955380577427821, 4.4893682636981813), (1173.8845144356956, 1122.3420659245455), (33.490813648293965, 24.094726320120291)], 1: [(5.625, 5.094176111217676), (7.6083333333333334, 7.8497399553709277), (1902.0833333333333, 1962.4349888427319), (32.333333333333336, 23.650229033422345)]}


### 5. Calculate probabilities for each class
Now with the summary statistics for each class, we can calculate the probability of a class and the probability of a data instance belonging to a given class.

Calculating the probability of a class is fairly easy, so we will not get into details here.

In [10]:
def probY(trainingSet):
    prob_yes = float(sum(trainingSet[:,-1]))/len(trainingSet)
    prob_no = 1 - prob_yes
    return {0: prob_no, 1: prob_yes}

Next, we calculate the probability of a data instance belonging to a given class, by multiplying together the probibilities of all of the attribute values for the data instance. Therefore, for each class value, we come up with a probability of a data instance belonging to it.

In [11]:
def probXGivenY(testSet, summary):
    prob_xgiveny = []
    for x in testSet:
        prob = {}
        for k in summary:
            prob_k = 1;
            for i, xi in enumerate(x[:-1]):
                meani = summary[k][i][0]
                stdi = summary[k][i][1]
                prob_ki = (1/(((2*np.pi)**0.5)*stdi)) * np.exp((-(xi-meani)**2)/(2*(stdi**2))) 
                prob_k *= prob_ki
            prob[k] = prob_k
        prob_xgiveny.append(prob)
    return prob_xgiveny

You can test the function with some sample data as follows:

In [12]:
sampleTrain = np.array([[1,11,1],[2,12,0],[3,13,1],[4,14,0]])
summary = summarize(sampleTrain)
x = [[5,15,1]]
prob_Y = probY(sampleTrain)
prob_XGivenY = probXGivenY(x, summary)
print('Probability of Y: ' + repr(prob_Y))
print('Probability of X given Y: ' + repr(prob_XGivenY))

Probability of Y: {0: 0.5, 1: 0.5}
Probability of X given Y: [{0: 0.010769639650924319, 1: 0.00088402585592600919}]


### 6. Classify
Now that we can calculate the probability of a data instance belonging to each class, we are able to choose the largest probability and take its corresponding class value as prediction.

In [13]:
def classify(probXGivenY, probY):
    ypred = []
    for pxy in probXGivenY:
        prob_no = pxy[0] * probY[0]
        prob_yes = pxy[1] * probY[1]
        if prob_no > prob_yes:
            ypred.append(0.0)
        else:
            ypred.append(1.0)
        
    ypred = np.array(ypred)
    return ypred

You can test the function with some sample data as follows:

In [14]:
y_pred = classify(prob_XGivenY,prob_Y)
print ('Prediction: '+ repr(y_pred))

Prediction: array([ 0.])


### 7. Evaluate the accuracy
Now that we have the result of classification prediction, we can compare the predictions with the class value in the test dataset. Moreover, we can evaluate the classification accuracy by calculating a ratio of the total number of correct predictions out of all predictions.

In [15]:
def getAccuracy(ypred, testSet):
    count = 0.0
    for i in range(len(ypred)):
        if ypred[i] == testSet[i,-1]:
            count += 1.0
    return count/len(ypred)

You can test the function with some sample data as follows:

In [16]:
testSet = np.array([[1,1], [2,0], [3,1]])
predictions = np.array([1, 1, 1])
accuracy = getAccuracy(predictions, testSet)
print(accuracy)

0.666666666667


## Putting together
We now have all the functions and we need tie them together. Below is the complete example application of the Gaussian Naive Bayes algorithm implemented in Python.

In [17]:
#prepare data
split = 0.67
dataset = loadDataset('transfusion.csv')
train, test = splitDataset(dataset, split)
print('Train: ' + repr(len(train)))
print('Test: ' + repr(len(test))) 

#train data
summary = summarize(train)
prob_Y = probY(train)

#generate predictions
prob_XGivenY = probXGivenY(test, summary)
ypred = classify(prob_XGivenY, prob_Y)
accuracy = getAccuracy(ypred, test)

print('Accuracy: ' + repr(accuracy))

#example outputs
print(ypred[0:18])
print(test[0:18,-1])

Train: 501
Test: 247
Accuracy: 0.7692307692307693
[ 1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  1.  1.  0.]
[ 1.  1.  1.  1.  0.  1.  1.  1.  0.  1.  0.  1.  1.  0.  1.  1.  1.  0.]


## References and further resources
1. Blood Transfusion Service Center Data Set: https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center
2. Wikipedia: Naive Bayes classifier. https://en.wikipedia.org/wiki/Naive_Bayes_classifier
3. Wikipedia: Normal distribution: https://en.wikipedia.org/wiki/Normal_distribution
4. Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm. https://machinelearningmastery.com/better-naive-bayes/
5. Tom M. Mitchell (1997). Machine Learning. Chapter 6.