# Naive Bayes

Bayes theorem is the premier method for understanding the probability of some event P(A|B) given some new information, P(B|A) and a prior belief in the probability event.

P(A|B) = [ P(B|A) * P(A) ]  /  P(B)

Qualities of Naive Bayes Classifiers:

- Intuitive approach
- The ability to work with small data
- Low computation cost for training and prediction
- Often solid results in a variety of settings



---

 **FORMULA : P(y|x1, ..., xj) = [P(xi, ..., xj | y) . P(y)] / P(x1, ... , xj)**

P(y|x1, ..., xj) : POSTERIOR. Probability that an observation is class y given the observation's values for the j features x1, ..., xj

P(xi, ..., xj | y) : LIKELIHOOD. Likelyhood of an observation's values for features x1, ..., xj given their class y

P(y) : PRIOR. Our belief for the probability of class y bedore looking at the data

P(x1, ... , xj) MARGINAL PROBABILITY




In naive bayes we compare an observation's posterior values for each possible class.

The marginal probability is constant across this comparisons.

We calculate the numerators of the posterior for each class. 

For each operation the class with the greatest posterior numerator becomes the predicted class ŷ

---

1 -. In Naive Bayes for each feature in the data we have to assume the statistical distribution of the likelihood P(Xj|y)

The common distributions are : Normal(gaussian) ; Multinomial ; Bernoulli

The distribution chosen is oftenly determinided by the nature of features ( continous, binary, etc ... ) 

2-. Naive Bayes. Get it names because we assume that each feature and its resulting likelihood is independent. 

This "naive" assumptions os frequently wrong, yet in practice does little to prevent bulding high quality classifiers.

# Training a Classifier for Continuous features

You have only continous features and you want to train GAUSSIAN Naive Bayes classifier.

It assumes that the likelihood of the feature values, x, given an observation is of class y, folows a Normal Distribution. N~


In [1]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB

In [2]:
# Load data

iris = load_iris()

features = iris.data
target = iris.target

In [3]:
# Create gaussian naive Bayes Object

classifier = GaussianNB()

In [4]:
# Train model

model = classifier.fit(features, target)

In [5]:
# Create new observation 

observation = [[4, 4, 4, 0.4]]

In [6]:
# Predict observation class

model.predict(observation)

array([1])

In [7]:
# Create Gaussian Naive ayes object with prior probabilities of each class

clf = GaussianNB(priors = [0.25, 0.25, 0.5])

In [8]:
# Train Model

model = classifier.fit(features,target)

model.predict_proba_ is not working weel calibrated, we will need to calibrate them using an isotonic regression or a related method.

# Training a Classifier for Discrete and Count Features.

Given discrete or count data you need to train a naive Bayes Classifier.

### Multinomial Naive Bayes Classifier

Works similar to Gaussian NB but the features are assumed to be multinomially distributed.

This means that this classifier is commonly used when we have discrete data.

Bag of words or tf-idf approaches.

We have created a toy text dataset of three observations, and converted the text strings sto a bag-of-words feature matrix and an acoompanying target vector.

We then used MultinomialNB to train a model while defining the prior probabilities for the two classes. (Pro Granada VS Pro Zanzibar)

is class_prior is noy specified, prior_probabilities are learned using the data. However if we want a uniform distribution to be used as te prior we can set fit_prior = False.

alpha = smoothing parameter  that should be tunned (0.0,1.0) meaning no smoothing = 0

In [9]:
# Load libraries

import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
# Create text

text_data = np.array([ " I love Granada, Granada is awesome!", "Granada is the best", "Zanzibar is also great"])

In [11]:
# Create bag-of-words

count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)


In [12]:
# Create feature matrix

features = bag_of_words.toarray()
print(count.get_feature_names())
print(features)

['also', 'awesome', 'best', 'granada', 'great', 'is', 'love', 'the', 'zanzibar']
[[0 1 0 2 0 1 1 0 0]
 [0 0 1 1 0 1 0 1 0]
 [1 0 0 0 1 1 0 0 1]]


In [13]:
# Create target vector 

target = np.array([0,0,1])

In [14]:
# Create multinomial naive Bayes object with prior probabilities of each class

classifier = MultinomialNB(class_prior = [0.25, 0.5]) 

In [15]:
# Train Model

model = classifier.fit(features,target)

In [16]:
# Create new observation

new_observation = [[0, 0, 0, 1, 0, 1, 0, 1,0]]

In [17]:
# Predict new observation's class

model.predict(new_observation)

array([0])

# Training a Naive Bayes Classifier for Binary Features

Bernoulli Naive Bayes Classifier Assumes that all features are binary, two values. 

BernoulliNB:

It is often used in text classification, when our feature matrix is the presence/absence of a word in a document.

class_prior : True return a list of prior probabilities for each class.  False = uniform

In [18]:
# Load libraries 

import numpy as np
from sklearn.naive_bayes import BernoulliNB

In [19]:
# Create three binary features 

features = np.random.randint(2, size = (100,3)) # two classes , 100 rows, 3 features
features[2]

array([0, 0, 1])

In [20]:
# Create binary target vector 

target = np.random.randint(2, size = (100,1)).ravel()
target

array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1])

In [21]:
# Create Bernoulli NB object with prior probabilities of each  class 

classifier = BernoulliNB(class_prior = [0.25, 0.5])

In [22]:
# Train model

model = classifier.fit(features, target)
model

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=[0.25, 0.5], fit_prior=True)

# Calibrating Predicted Probabilities

Calibrate the predicted probabilities from naive bayes classifier so they are interpretable

In [32]:
# Load Libraries

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV

In [33]:
# Load data

iris = load_iris()
features=iris.data
target = iris.target

In [34]:
# Create Gaussian Naive Bayes object

classifier_gaussian = GaussianNB()
classifier_gaussian

GaussianNB(priors=None, var_smoothing=1e-09)

In [35]:
# Create calibrated cross validation with sigmoid calibration

classifier_sigmoid = CalibratedClassifierCV(classifier_gaussian, cv= 2, method = "sigmoid")

# method = sigmoid / isotonic regression (nonparametric, it tend to overfit when sample size are vert small (i.e: 100 observations))

In [36]:
# Create model

model_sigmoid = classifier_sigmoid.fit(features, target)
model_sigmoid

CalibratedClassifierCV(base_estimator=GaussianNB(priors=None,
                                                 var_smoothing=1e-09),
                       cv=2, method='sigmoid')

In [37]:
# Create new observation 

observation_new = [[2.6, 2.6, 2.6, 0.4]]

In [38]:
# View calibrated probabilities

model_sigmoid.predict_proba(observation_new)

array([[0.31859969, 0.63663466, 0.04476565]])

Predict_proba is extremely useful when predicting a certain class if the model predicts the probability of being that class over 90%.

Naive Bayes often output probabilities that are no based on the real world. 

While the ranking of predicted probabilitie for the different target classes is valid, the raw predicted probabilities tent to take on extreme values 0,1

To obtain meaningful predicted probabilities we need to conduct what is called calibration.

### Calibration Classifier CV
 
Use Croos validation

We can see the difference between the raw and well-calibrated predicted probabilities. 

Using a GaussianNB Classifier we can see very extreme probability estimates

In [39]:
# Train a GaussianNB, then predict probabilities

classifier_gaussian.fit(features,target).predict_proba(observation_new)

array([[2.31548432e-04, 9.99768128e-01, 3.23532277e-07]])

In [40]:
model_sigmoid.predict_proba(observation_new)

array([[0.31859969, 0.63663466, 0.04476565]])