
# **Naive Bayes Classifier** 

A Naive Bayes classifier is a supervised machine learning algorithm that leverages Bayes’ Theorem to make predictions and classifications. This equation is finding the probability of A given B. This can be turned into a classifier if we replace B with a data point and A with a class. For example, let’s say we’re trying to classify an email as either spam or not spam. We could calculate P(spam | email) and P(not spam | email). Whichever probability is higher will be the classifier’s prediction. Naive Bayes classifiers are often used for text classification.

So why is this a supervised machine learning algorithm? In order to compute the probabilities used in Bayes’ theorem, we need previous data points. For example, in the spam example, we’ll need to compute P(spam). This can be found by looking at a tagged dataset of emails and finding the ratio of spam to non-spam emails.


## **Smoothing**

But what happens if “crib” was never in any of the positive reviews in our dataset? This fraction would then be 0, and since everything is multiplied together, the entire probability P(review | positive) would become 0.

This is especially problematic if there are typos in the review we are trying to classify. If the unclassified review has a typo in it, it is very unlikely that that same exact typo will be in the dataset, and the entire probability will be 0. To solve this problem, we will use a technique called smoothing.

In this case, we smooth by adding 1 to the numerator of each probability and N to the denominator of each probability. N is the number of unique words in our review dataset. For example, P("crib" | positive) goes from this:

$P(\text{"crib" | positive}) = \Large \frac{ \text{\# of "crib" in positive}} {\text{\# of words in positive}}$

To this:

$P(\text{"crib" | positive}) = \Large \frac{\text{\# of "crib" in positive} + 1}{\text{\# of words in positive} + N}$

## **Model Fit**

### **sklearn.feature_extraction.text.CountVectorizer**

In order to use scikit-learn’s Naive Bayes classifier, we need to first transform our data into a format that scikit-learn can use. To do so, we’re going to use scikit-learn’s `CountVectorizer` object. To begin, we need to create a CountVectorizer and teach it the vocabulary of the training set. This is done by calling the `.fit()` method.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectorizer.fit(["Training review one", "Second review"])

After fitting the vectorizer, we can now call its `.transform()` method. The `.transform()` method takes a list of strings and will transform those strings into counts of the trained words. Take a look at the code below.

In [None]:
counts = vectorizer.transform(["one review two review"])

`counts` now stores the array `[[1 2 0 0]]`. The word `"review"` appeared twice, the word `"one"` appeared once, and neither `"Training"` nor `"Second"` appeared at all. But how did we know that the 2 corresponded to review? You can print `vectorizer.vocabulary_` to see the index that each word corresponds to. It might look something like this:

In [None]:
vectorizer.vocabulary_

{'training': 3, 'review': 1, 'one': 0, 'second': 2}

Finally, notice that even though the word `"two"` was in our new review, there wasn’t an index for it in the vocabulary. This is because `"two"` wasn’t in any of the strings used in the `.fit()` method.

We can now use `counts` as input to our Naive Bayes Classifier.

### **sklearn.naive_bayes.MultinomialNB**

Now that we’ve formatted our data correctly, we can use it using scikit-learn’s `MultinomialNB` classifier.

```python
from reviews import counter, training_counts
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

review = "This crib was great amazing and wonderful"
review_counts = counter.transform([review])

classifier = MultinomialNB()

training_labels = [0] * 1000 + [1] * 1000

classifier.fit(training_counts, training_labels)
print(classifier.predict(review_counts))
print(classifier.predict_proba(review_counts))
```