## Naive Bayes

- a classification algorithm
- models are probabilistic distributions
- general assumption
    - data follows some *unknown* probabilistic distribution **D** over input/output pairs (x,y)
- suppose we _know_ **D**
    - we have a function _computeD(x,y)_ that returns the probability of pair (x,y) under **D**
    - then, classification is simple
        - **Bayes optimal classifier** $f^{BO}$ returns $y'$ for input $x'$ so that _computeD(x',y')_ is maximum possible
        - returns $y'$ (e.g., the class) with the highest probability (likelihood)
    - optimal: smallest error of all possible classifiers
- we try to _estimate_ **D** with some **D'**, based on training set
    - we hope **D** and **D'** are similar
    - we use **D'** for classification
    
    

## Bayes’ Theorem 

- provides a way to calculate the probability of some data belonging to a given class, given prior knowledge
- <span style="color:blue">P(class|data) = (P(data|class) * P(class)) / P(data)</span>

Smoke and fire example: <span style="color:red">**Fire => Smoke**</span>

- What is the probability that there is fire given that there is smoke? 

Where <span style="color:blue">P(Fire|Smoke)</span> is the **Posterior probability**, <span style="color:blue">P(Fire)</span> is the **Prior probability**, <span style="color:blue">P(Smoke|Fire)</span> is the **Likelihood**, and <span style="color:blue">P(Smoke)</span> is the **Evidence**:

- <span style="color:blue">P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)</span>
- <span style="color:green">**Posterior = Likelihood * Prior / Evidence**</span>




## A side example: diagnostic set

Consider a population that might have or not the corona virus flu (<span style="color:blue">Corona is T/F</span>) and a medical test that returns positive or negative for detecting corona (<span style="color:blue">Test is T/F</span>) 

__Problem__: If a randomly selected patient has the test and it comes back positive, what is the probability that the patient has the virus?

<span style="color:red">P(Test=T | Corona=T) = 0.85</span> 

[Test: go to http://etc.ch/36Hx](https://directpoll.com/r?XDbzPBd3ixYqg8gh4tg26y59cvoG6LjOoN3TeIlKs)

## Some calculations

Test ignores the probability of a randomly selected person having corona, regardless of the results of a diagnostic test.

- <span style="color:red">P(Test=T | Corona=T) = 0.85</span>

- <span style="color:red">P(Corona=T) = 0.02</span>

Bayes theorem: P(A|B) = P(B|A) * P(A) / P(B)

- P(Corona=T | Test=T) = P(Test=T|Corona=T) * P(Corona=T) / P(Test=T)

<span style="color:green">**P(Corona=T | Test=T) = 0.85 * 0.0002 /** *P(Test=T*)</span>


### More calculations

*P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)*

- P(Test=T) = P(Test=T|Corona=T) * P(Corona=T) + P(Test=T|Corona=F) * P(Corona=F)

Firstly, we can calculate P(Corona=F) as the complement of P(Corona=T), which we already know

- P(Corona=F) = 1 – P(Corona=T)
= 1 – 0.0002
= 0.9998

Let’s plugin what we have:


- <span style="color:green">P(Test=T) = 0.85 * 0.0002 + *P(Test=T|Corona=F*) * 0.9998</span>


### More

We need to know how good the test is at correctly identifying people that do not have corona. 

That is, testing negative (Test=F) when the patient does not have corona (Corona=F).

We will use a contrived specificity value of 95%.

P(Test=F | Corona=F) = 0.95

P(Test=T|Corona=F) = 1 – P(Test=f | Corona=F)
= 1 – 0.95
= 0.05

## So

P(Test=T) = 0.85 * 0.0002 + 0.05 * 0.9998
= 0.00017 + 0.04999
= 0.05016

<span style="color:green">**P(Corona=T | Test=T) = 0.85 * 0.0002 /** *P(Test=T*)</span>

= 0.85 * 0.0002 / 0.05016

P(Corona = T | Test=T) = 0.00017 / 0.05016

P(Corona = T | Test=T) = 0.003389154704944

HENCE, if the patient is informed they have corona with this test, then there is only 0.33% chance that they do.

It is a terrible diagnostic test!



## Careful when dealing with these probabilities

- Sensitivity: 85% of people with corona will get a positive test result.
- Base Rate: 0.02% of people have corona.
- Specificity: 95% of people without corona will get a negative test result.

The rest of the entities had to be calculated.

We might imagine that Bayes Theorem allows us to be even more precise about a given scenario. For example, if we had more information about the patient (e.g. their age) and about the domain (e.g. corona rates for age ranges), and in turn we could offer an even more accurate probability estimate.

SO, how to use this theorem for classification?

In [19]:
# Naive Bayes Classification

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pandas as pd

In [20]:
# Importing the dataset
dataset = pd.read_csv("iris.csv")

In [21]:
#looking at the first 5 values of the dataset
dataset.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [22]:
#Spliting the dataset in independent and dependent variables
X = dataset.iloc[:,:4].values
y = dataset["species"].values

In [9]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 82)

In [10]:
# Feature Scaling to bring the variable in a single scale
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [11]:
# Fitting Naive Bayes Classification to the Training set with linear kernel
from sklearn.naive_bayes import GaussianNB
nvclassifier = GaussianNB()
nvclassifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [12]:
# Predicting the Test set results
y_pred = nvclassifier.predict(X_test)
print(y_pred)

['Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-virginica'
 'Iris-versicolor' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa'
 'Iris-virginica' 'Iris-setosa' 'Iris-virginica' 'Iris-virginica'
 'Iris-versicolor' 'Iris-virginica' 'Iris-setosa' 'Iris-virginica'
 'Iris-versicolor']


In [13]:
#lets see the actual and predicted value side by side
y_compare = np.vstack((y_test,y_pred)).T
#actual value on the left side and predicted value on the right hand side
#printing the top 5 values
y_compare[:5,:]

array([['Iris-virginica', 'Iris-virginica'],
       ['Iris-virginica', 'Iris-virginica'],
       ['Iris-setosa', 'Iris-setosa'],
       ['Iris-setosa', 'Iris-setosa'],
       ['Iris-setosa', 'Iris-setosa']], dtype=object)

In [14]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[11  0  0]
 [ 0  8  1]
 [ 0  1  9]]


In [18]:
#finding accuracy from the confusion matrix.
a = cm.shape
corrPred = 0
falsePred = 0

for row in range(a[0]):
    for c in range(a[1]):
        if row == c:
            corrPred +=cm[row,c]
        else:
            falsePred += cm[row,c]
print('Correct predictions: ', corrPred)
print('False predictions', falsePred)
print ('\n\nAccuracy of the Naive Bayes Clasification is: ', corrPred/(cm.sum()))

Correct predictions:  28
False predictions 2


Accuracy of the Naive Bayes Clasification is:  0.9333333333333333


## Naive Bayes

Probability for a single data point:
- p(y,x) = p(y, x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>D</sub>)

It's a distribution over a LOT of variables:
- p(x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>D</sub>, y) = p(y) * p(x<sub>1</sub>|y) * p(x<sub>2</sub>|y,x<sub>1</sub>) * ... * p(x<sub>D</sub>|y,x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>D-1</sub>) = p(y) * $\prod_{d} p(x_d|y,x_1, x_2, ..., x_{d-1})$

Naive Bayes assumption
- features are independent, conditioned on the label
- $p(x_d|y,x_{d'})=p(x_d|y), \forall d\ne d'$
- p(x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>D</sub>, y) = p(y) * p(x<sub>1</sub>|y) * p(x<sub>2</sub>|y,x<sub>1</sub>)...p(x<sub>D</sub>|y,x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>D-1</sub>) = p(y) * $\prod_{d} p(x_d|y)$