This is an overview of the Naive Bayes Classifier.

Broadly, the goal of Naive Bayes is to assign a finite number of class labels to a set of problem instances. These instances are represented as vectors of features. There are many different flavors of this algorithm, but what they have in common is that they assume that the value of a feature is independent of any other feature, hence the "naive" title.

Naive Bayes is a conditional probability model, so to start we will review some basics of probability that will be needed later.

# Basics of Probability

Consider an experiment that can produce a number of results. We call the collection of all possible events the sample space $Ω$. 

Given an event in this sample space $X$, we will denote the **probability** of $A$ occuring as $p(A)$. Given a second event $Y$, we can write the **joint probability** of both events occuring as $p(X,Y)$.

The **conditional probability** is the probability of some event $X$, given that $Y$ has occured and is deonoted $p(X|Y)$. 

We can relate the joint probability and the conditional probability by the **chain rule** $p(X,Y)=p(Y|X)p(X)$. This is also called the product rule by some.



**Bayes' theorem** relates the conditional probabilities $p(Y|X)=\frac{p(X|Y)p(Y)}{p(X)}$. Using the sum rule we can express the denominator as $p(X)=\sum_Y p(X|Y)p(Y)$ so it can be looked at as a sort of normalization constant to endure that the sum of all conditional probabilities on the left hand side $p(Y|X)$ equals 1. 

In plain english if we think of $Y$ as an event we are interested in and $X$ as some event that will occur, $p(Y|X)$ is the probability of $Y$ occuring now that $X$ has occured and $p(Y)$ is the probability prior to $X$ occuring. The probability $p(X|Y)$ is called the likelihood, so the relation between the conditional probabilities can be expressed as:

$$\text{posterior } \propto \text{ likelihood} \times \text{prior}$$



# Naive Bayes

We mentioned earlier that Naive Bayes is a conditional probability model meaning that given a problem instance $\vec{x} = (x_1,...,x_n)$ with $n$ features it assigns a conditional probability $p(C_k|x_1,...x_n)$ for each of the $K$ possible classes $C_k$.

From Bayes' theorem we have $p(C_k|\vec{x})=\frac{p(\vec{x}|C_k)p(C_k)}{p(\vec{x})}$. 

The denominator is a constant that we can write as a normalization factor with the sum rule $p(\vec{x})=\sum_k p(C_k)p(\vec{x}|C_k)$.

Since $\vec{x}$ is given, we know the denominator and are only interested in computing the numerator of this equation.

Notice that by the chain rule we can write the numerator as a joint probability $p(\vec{x}|C_k)p(C_k) =p(C_k,\vec{x}) =p(C_k,x_1,...,x_n)$. By repeatedly applying the chain rule again we can write this as:

$$\begin{align}
p(C_k,x_1,...,x_n)&=p(x_1,...,x_n,C_k)\\
&=p(x_1|x_2...,x_n,C_k)p(x_2...,x_n,C_k)\\
&=p(x_1|x_2...,x_n,C_k)p(x_2|x_3...,x_n,C_k)p(x_3...,x_n,C_k)\\
&=...\\
&=p(x_1|x_2...,x_n,C_k)p(x_2|x_3...,x_n,C_k)...p(x_{n-1}|x_n,C_k)p(x_n|C_k)p(C_k)\\
\end{align}$$

Now using our "naive" assumption that all of these conditional probabilities are mutually independant on the category $C_k$ we get $$p(x_i|x_{i-1},...,x_n,C_k)=p(x_i|C_k), \forall i$$

Therefore our joint model can be expressed as
$$\begin{align}
p(C_k|x_1,...,x_n) &= \frac{1}{Z}p(C_k,x_1,...,x_n)\\
&= \frac{1}{Z}p(C_k)p(x_1|C_k)p(x_2|C_k)...p(x_n|C_k)\\
&= \frac{1}{Z}p(C_k) \Pi_{i=1}^{n}p(x_i|C_k) 
\end{align}
$$

where $Z=p(x)=\sum_k p(C_k)p(\vec{x}|C_k)$ from before.


### What do we do with this?

To turn this calculation of a conditional probability into a classifier all we have to do is combine it with a decision rule. The easiest rule to implement is to choose the hypothesis that is most probable to minimize the probability of misclassification. This called a Bayes classifier and can be written as:

$$\hat{y} = \text{argmax}_{k\in\{1,...K\}} p(C_k)\Pi_{i=1}^n p(x_i|C_k)$$



# Implementation

We now need a way to use our formulation from above to tell us about the parameters for a feature's distribution. 

The first thing we need to do is have a way to calculate the classes prior $p(C_k)$. We can either assume that the classes are all equally likely or to do even better we can calculate an estimate of the class probability from the training set: $p(C_k)=\frac{\text{number of samples in the class}}{\text{total number of samples}}$.

To use Naive Bayes on a real world problem we need to make an assumption about the probability distribution of our data. For discrete features like document classification we can use a multinomial or Bernoulli distribution. For continuous features a popular assumption is a normal distribution. We will use these distributions when calculating the likelihoods $p(x|C_k)$ to use in our Gaussian Naive Bayes.


# Gaussian Naive Bayes

When our data is continuous we might assume that the values associated with each class are distributed according to a normal distribution. 

If we do this, we can take the training data and sort it by class and then calculate the mean $μ_k$ and variance $σ^2_k$ of the data for each class $C_k$. Then given an observation $v$, we can calculate the probability density of a civen class by plugging these values into a normal distribution:
$$p(x=v|C_k) = \frac{1}{\sqrt{2\pi σ^2_k}} e^{{-\frac{(v-\mu_k)^2}{2σ^2_k} }}$$  

Then use this with our priors to calculate the conditional distributions for each class and choose the class with the largest probability.

Below we will implement a simple example of Gaussian Naive Bayes.

We will consider the problem of classifying by gender given the continuous data of height, weight and foot size. We will use a small dataset shown below to illustrate the implementation of Gaussian Naive Bayes.


In [None]:
#Our training dataset
import pandas as pd

train_ = pd.read_csv('/content/drive/MyDrive/NaiveBayes/Train.csv')
print(train_)

   Person  Height  Weight  Foot Size
0    male    6.00     190         12
1    male    5.90     175         11
2    male    5.61     165         10
3    male    5.74     180         10
4  female    5.10     100          6
5  female    5.00     120          5
6  female    5.50     131          8
7  female    5.23     124          9


In [None]:
#Gaussian Naive Bayes
import numpy as np
import pandas as pd
import math

#Creates the numerator of Naive Bayes
def NBayes(likelihood,prior):
  out=prior
  for i in range(0, len(likelihood)):
    out=out*likelihood[i]
  return out

#Calculates the likelihoods
def Gaussian(x,mean,var):
  #Calculates the Gaussian probability distribution
  return (1/math.sqrt(2*math.pi*var))*math.exp(-(x-mean)*(x-mean)/(2*var))

#Our test values
test_ = pd.read_csv('/content/drive/MyDrive/NaiveBayes/Test.csv')
x=['female',5.1,120,7]

train_ = pd.read_csv('/content/drive/MyDrive/NaiveBayes/Train.csv')

#Parameters for training dataframe
n_cols=len(train_.columns)
n_male=len(train_.loc[train_.Person == "male"])
n_female=len(train_.loc[train_.Person == "female"])


#Assume prior distribution of 50% males 50% Females
prior_male = 0.5
prior_female = 0.5

#Separate males and females in dataframe
males = train_.loc[train_.Person == "male"]
females = train_.loc[train_.Person == "female"]

#Calculating the likelihoods for each attribute
likelihood_male=[]
likelihood_female=[]

for i in range(1,n_cols):
  likelihood_male.append(Gaussian(x[i], males.iloc[:,i].mean(), males.iloc[:,i].var()))
  likelihood_female.append(Gaussian(x[i], females.iloc[:,i].mean(), females.iloc[:,i].var()))


#Using NBayes
prob_male = NBayes(likelihood_male, prior_male)
prob_female = NBayes(likelihood_female, prior_female)

if prob_male >= prob_female:
  print("Prediction is Male")
else:
  print("Prediction is Female")



Prediction is Female
