# 13.1 Lesson: Naive Bayes

### Naive Bayes
The Naive Bayes algorithm is a generative model: It presumes that we can guess how some data was generated. Then it calculates the likelihood that the data was generated under two hypotheses: 
- $X$ is true, 
- $X$ is false. 

The relative probabilities of these two outcomes, combined with knowledge of the relative probability that $X$ is actually true or false, gives us the probability that $X$ is true or false in our particular case. Naive Bayes is an effective algorithm for classifying data, especially in text-based applications like detecting spam (unwanted messages, such as unsolicited advertisements) and analyzing sentiment. A common application of Naive Bayes is to detect whether an email is spam or not, given its subject line. 

For an applied example: Suppose that $X$ is true with probability 0.3 and false with probability 0.7. Maybe $X$ is true if the moon is visible and false if it’s invisible. Suppose that someone says: 

“The cow jumped over the moon.” 

Imagine that we can calculate that if the moon is visible ($X$ is true), then a person utters this sentence with probability 0.04%. And if the moon is invisible ($X$ is false), then they utter this sentence with probability 0.01%. What is the probability that the moon is visible, given that the sentence was uttered? 

- The joint probability that the moon is visible and the sentence was uttered is $0.3 \times 0.0004 = 0.00012$. 
- The joint probability that the moon is invisible and the sentence was uttered is $0.7 \times 0.0001 = 0.00007$. 


Then, the probability that the moon is visible comes from the relative probability of these two events. 

So, the moon is visible with probability $\frac{0.00012}{0.00012 + 0.00007} = \frac{12}{19} = 63\%$

Now, here’s the problem: How will we ever hope to calculate the probability of the sentence “the cow jumped over the moon” under the two conditions? How likely is it that we have enough data that we can count even one utterance of this sentence, let alone the many instances we’d need to get some statistical significance? The answer is that we estimate the probability of each word and simply multiply them. Does that sound naive — like it’s a huge guess we’re making? Yes! That’s why it’s called Naive Bayes.

So, if the probability of uttering the word “cow” when the moon is visible is 10% and “jumped” is 20% (which we can ascertain from our dataset), then the probability of uttering “cow jumped” is $10\% \times 20\% = 2\%$ or  $0.02$. 


If, on the other hand, when the moon is invisible, the probability of uttering the word “cow” is 20% and “jumped” is 30%, then the joint probability is $20\% \times 30\% = 6\%$ or $0.06$. We can go on this way to compute the likelihood of the full sentence under each condition.

This naive approach is unlikely to be valid all the time. For example, consider the words “ice” and “cream” and (on the other hand) “rice” and “cream.” The phrase “ice cream” will be quite common — more common than the words’ individual probabilities would suggest. But the phrase “rice cream” will be relatively uncommon. That’s why multiplying the probabilities, as if the words are independent from each other, is “naive.” 

We then use the relative probability (which is just Bayes’ method) to compare the probabilities that:
1. The moon is visible, AND this sentence was spoken. 
2. The moon is invisible, AND this sentence was spoken. 


Which is just the same (naively) as the probabilities that: 
1. The moon is visible, AND the word “cow” was spoken, AND the word “jumped” was spoken, etc. 
2. The moon is invisible, AND the word “cow” was spoken, AND the word “jumped” was spoken, etc. 

The ratio $\frac{P_1}{P_1 + P_2}$ gives the probability that the moon is visible, given that the sentence was spoken. 

The ratio $\frac{P_2}{P_1 + P_2}$ gives the probability that the moon is invisible, given that the sentence was spoken. 

This is an example of Bernoulli Naive Bayes — the feature (such as the word “cow”) is either present or not, and the probabilities are multiplied together. This model is useful when dealing with binary/boolean features, like the presence or absence of words in text classification tasks. In Multinomial Naive Bayes, the features would be a count (how many instances of the word “cow?”). This variant is effective in natural language processing tasks where word frequency matters, such as sentiment analysis. Finally, in Gaussian Naive Bayes, the feature is a continuous number — perhaps the “sentence” is a sequence of test scores drawn from a Gaussian distribution, and the probability of the sequence equals the product of the probabilities of each score. Gaussian Naive Bayes is useful for problems involving continuous data. 

Think About It
- Why might a model based on simple, independent features still perform well in complex real-world situations? 
- How does the example of calculating the probability that the moon is visible — based on someone saying a specific sentence — illustrate how Naive Bayes updates prior beliefs using new evidence? 
- Based on the example of “ice cream” versus “rice cream,” what kinds of patterns in data might cause the Naive Bayes assumption of feature independence to break down? 