## Lighthouse Labs
### W05D04 Naive Bayes
Instructor: Socorro Dominguez  
October 15, 2020

**Second Half Agenda:**
* Conditional Probability Review
    * Bayes theorem
    
    
* Naive Bayes
    * Multionomial
    * Gaussian

## Review: Conditional probabilities

**Brain teaser 1**: 

1. A couple has two children. The older child is a girl. 
2. A couple has two children. One of them is a girl. 

In each case, what's the probability that the other child is also a girl? (The two cases are not the same!)  
Warning: This exercise assumes `Gender` as a binary.

### Definitions

- Conditional probabilities are a way of using information we have about random variables.
- For example, for a fair 6-sided dice, what if we already know the roll is odd, because someone told us. What are the 6 _conditional_ probabilities of each outcome?
  - They are: ...
- We write conditional probabilities with a vertical bar, `|`. The information we're conditioning on goes after the bar.
  - E.g., $P(X=i \mid \text{X is odd})$ is a **conditional probability**
  - The set of these values form the **conditional distribution**
    - Conditional distributions are still probability distributions - they must sum up to 1.

- So, what's the pattern here?
  - The conditioning _eliminates some possible outcomes_.
  - For the remaining outcomes, we _renormalized_ the distribution.
  - That is, we took the proportion of the allowed outcomes that satisfy the event description.

**Brain teaser Solution:**

The 4 equally likely possibilities are (with the first one being older): `GB`, `GG`, `BG`, `BB`.


1. Since we're conditioning on the older child being a girl, we eliminate `BG` and `BB`. Thus the only possibilities are `GB` and `GG`, equally likely. So the probability is $1/2$.
2. Here, we're conditioning on the fact that one of them is a girl, so we only eliminate `BB`. Thus the remaining possibilities are `GB`, `GG`, `BG`, so the conditional probability of the other being a girl (i.e. 2 girls) is $1/3$.

### Conditional probabilities - formalizing things

The key equation with conditional probabilities is

$$P(A\mid B)=\frac{P(A \cap B)}{P(B)}$$

The "renormalizing" trick is a consequence of this. 

Consider, what's the probability of rolling a 6 given that the roll is not 1?

- Let $A$ be the roll is a 6
- Let $B$ be the roll is a not a 1

$$P(A\mid B)=\frac{P(A \cap B)}{P(B)}=\frac{P(A)}{P(B)}=\frac{1/6}{5/6}=\frac{1}{5}$$

In this case, we had the simplification that $P(A\cap B)=P(A)$. This is often not the case.

### Bayes' Theorem

**Brain teaser:** A heritable disease occurs randomly in 10% of the population. If someone has the disease, it is passed on to their children with probability 50%. A mother has 1 healthy child. Given this, what's the conditional probability that the mother has the disease? 

- Is the answer 10%? Less? More? How do we quantify it?
  - Let $M$ be the event that the mother has the disease.
  - Let $C$ be the event that the child has the disease.
  - We want $P(M\mid \textrm{not } C)$. We have $P(M)=0.1$ and $P(\textrm{not }C\mid M)=0.5$.

Solution:

$$P(M \mid \textrm{not } C) = \frac{P(\textrm{not } C \mid M)P(M)}{P(\textrm{not } C)}$$

So we still need $P(\textrm{not } C)$. This could happen in 2 ways ("law of total probability")

$$P(\textrm{not } C)=P(\textrm{not } C \mid M)P(M) + P(\textrm{not } C \mid \textrm{not } M)P(\textrm{not } M)$$

We know $P(\textrm{not } M)=1-P(M)=0.9$.   
We assume $P( C \mid \textrm{not } M)=0.1$ because the child can randomly get the disease like anyone else,   
so then $P(\textrm{not } C \mid \textrm{not } M)=1-P( C \mid \textrm{not } M)=0.9$. 

Finally, then, we're left with:

$$P(M \mid \textrm{not } C) = \frac{0.5 \times 0.1}{0.5\times 0.1 + 0.9 \times 0.9} = 0.058$$


- We can get what we need using **Bayes' Theorem**.
- We've seen above that, for events $A$ and $B$, $P(A,B)=P(A\mid B)P(B)$. 
- We can also write this as $P(A,B)=P(B\mid A)P(A)$. 
- Since these are equal, we get the famous Bayes' theorem:
​
$$P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)}$$

If curious, you should also review:
- Law of Total Probability  $P(X=x)=\sum_y P(X=x\mid Y=y)P(Y=y)$
- Conditional Expectations

## Naive Bayes Algorithm

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. 

Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods and works well with text data.

**Multinomial Naive Bayes**

You are working for Netflix. And you receive `positive` reviews for movies. You also receive `negative` reviews.

You want to filter the `negative` reviews out so that you can move those movies out of the platform (and keep people binged).

What would you do?

Do you remember that for Logistic Regression we used CountVectorizer for SkLearn? This is because CountVectorizer
counts all the words that are in positive (and negative) reviews.

In Naive Bayes, we will calculate the probability of seeing each word GIVEN that it is in a `Positive` review.

Let's say you have the phrase: "Awesome movie" and let's just focus on a total of 5 positve reviews and 5 negative reviews.

The total words in the 5 positive reviews are 17.  The total words in negative reviews are 18.

We calculate probabilities for each word. 

(I am not writing the exact reviews, so use your imagination for that)

For example,
assume the words in the positive reviews are:

| Positive Reviews | Count | P(word,Pos) |
| ----------- | -------- | ---------|
|awesome	|4| 0.22 |
|movie	|2| 0.11|
|popcorn|	3| 0.17|
|exciting|	4| 0.24|
|terrific|	4| 0.24|
|trash|	0| 0|
|film |0 |0 |

We guess a PRIOR probability that the message was originally a positive review. This is often assumed from the training set. Since we are doing 50% good reviews and 50% bad reviews, we can say that the prior probability for possitive reviews is 50%

We do the similar process for Negative Reviews

| Negative Reviews | Count | P(word,Neg) |
| ----------- | -------- | ---------|
|awesome	|1| 0.06 |
|movie	|4| 0.22|
|popcorn|	3| 0.16|
|exciting|	1| 0.06|
|terrific|	1| 0.06|
|trash|	7| 0.38|
|film | 1 | 0.06 |

Prior for Negative reviews is also 50%

We get a new review:

"film was awesome!"

How would you classify this review?

How does NB classify it?

What do we need?   
> $P(PPos) = 0.5$  
> $P(awesome|pos) = 0.22$  
> $P(film|pos) = 0.0$  

$P(pos) = 0$

Using the same algorithm, we also get a negative score:

$P(neg) = 0.0018$  

Review is classified as **NEGATIVE**

How do we fix this?

**Laplace Smoothing**

In order to avoid having probabilities that alter our model,we add some extra counts (1 suffies most times) to each word. 

This parameter is called $\alpha$ in sklearn.

This *does not change* the prior probabilities.

Let's try again with $\alpha = 1$

| Positive Reviews | Count | P(word,Pos) |
| ----------- | -------- | ---------|
|awesome	|3| 0.13 |
|movie	|5| 0.21|
|popcorn|	4| 0.17|
|exciting|	5| 0.21|
|terrific|	5| 0.21|
|trash|	1| 0.04|
|film |1 |0.04 |


$P(pos) = 0.0044$


| Negative Reviews | Count | P(word, Neg) |
| ----------- | -------- | ---------|
|awesome	|2| 0.08 |
|movie	|5| 0.2 |
|popcorn|	4| 0.16|
|exciting|	2| 0.08|
|terrific|	2| 0.08|
|trash|	8| 0.32|
|film | 2 | 0.08 |

$P(neg) = 0.0032$

We can now classify the review as **positive**.

**Downsides** NB treats the words order the same ignoring all grammar rules and colloquial expressions. 

\***** in a review, can mean 5 stars or it can be the reviewer using big words because the movie was boring.

Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

**Gaussian Naive Bayes**

It can be used to predict over parameters where we have knowledge about the features distribution - Gaussian. Continuous data. 

We want to predict if someone would likes sushi or not. 

We collect Data from people that claim to like sushi and who claim to not like sushi.

We collect data of how much fish and rice they eat each day.

For people who like sushi:  
Mean(grams of eaten fish daily) = 120, SD = 20  
Mean(grams of eaten rice) = 100, SD = 40  
Mean(grams of eaten beef) = 30, SD = 5  

For people who don't like sushi:  
Mean(grams of eaten fish daily) = 60, SD = 10  
Mean(grams of eaten rice) = 90, SD = 40  
Mean(grams of eaten beef) = 120, SD = 30  

We assume that the data is represented by Gaussian distributions.

A new individual comes and they report eating:
61 g of fish, 
100 g of rice, 
120 g of beef every day.

Does this person like Sushi?

We make an initial guess that they like Sushi.  
Common guess comes from the training data.   

And we get the $log(P(Prior_{likes-sushi})*L(fish | sushi)*L(rice | sushi) * L(beef | sushi))$

We compute the same for does not like sushi.

We compare the scores and make a decision based on that.

We use log to avoid **OVERFLOW**
Intial guess for not love A, is 0.5