## Conditional Probability and Total Probability

### Exercise 1

We can find the word "secret" in many spam emails. However, some emails are not spam even though they contain the word "secret." Let's say we know the following probabilities:

The probability of getting a spam email is 23.88%. That is,
$P(Spam) = 0.2388$

The probability of an email containing the word "secret" given that the email is spam is 48.02%. That is,
$P(\text{"secret"}| Spam) = 0.4802$

The probability of an email containing the word "secret" given that the email is not spam is 12.84%. That is,
$P(\text{"secret"}| Spam^C) = 0.1284$

Calculate:
- $P(Spam^C)$. Assign the result to p_non_spam.
- $P(Spam ∩ \text{"secret"})$. Assign the result to p_spam_and_secret.
- $P(Spam^C ∩ \text{"secret"})$. Assign the result to p_non_spam_and_secret.
- $P(\text{"secret"})$. Assign the result to p_secret.

```python
p_spam = 0.2388
p_secret_given_spam = 0.4802
p_secret_given_non_spam = 0.1284

p_non_spam = 1 - p_spam
p_spam_and_secret = p_spam * p_secret_given_spam
p_non_spam_and_secret = p_non_spam * p_secret_given_non_spam
p_secret = p_spam_and_secret + p_non_spam_and_secret
```

### Exercise 2

An airline transports passengers using two types of planes: a Boeing 737 and an Airbus A320.

The Boeing operates 73% of the flights. Out of these flights, 3% arrive at the destination with a delay.
The Airbus operates the remaining 27% of the flights. Out of these flights, 8% arrive with a delay.
Convert the percentages above to probabilities:

1. Assign the probability of flying with a Boeing to p_boeing (to better understand what this probability means, imagine a passenger having bought a ticket with this airline — what's the probability that this passenger will be assigned to fly to her destination with a Boeing?).
2. Assign the probability of flying with an Airbus to p_airbus.
3. Assign the probability of arriving at the destination with a delay given that the passenger flies with a Boeing to p_delay_given_boeing.
4. Assign the probability of arriving at the destination with a delay given that the passenger flies with an Airbus to p_delay_given_airbus.

Calculate:
5. The probability that a passenger will arrive at her destination with a delay. Assign your answer to p_delay. Check the hint if you get stuck.

```python
p_boeing = .73
p_airbus = .27
p_delay_given_boeing = .03
p_delay_given_airbus = .08

p_delay = p_boeing * p_delay_given_boeing + p_airbus * p_delay_given_airbus
```

A more general formula is as given below. This formula accounts for all the events that may be present. For example, in addition to airbus and boeing, we have delay for another flight carrier named fl1, so, the below formula will be able to accommodate such an addition.

$\begin{equation}
\overbrace{P(A)}^{P(Delay)} = \overbrace{P(B_1)}^{P(Boeing)} \cdot P(A|B_1) + \overbrace{P(B_2)}^{P(Airbus)} \cdot P(A|B_2) + \overbrace{P(B_3)}^{P(ERJ)} \cdot P(A|B_3)
\end{equation}$

An airline transports passengers using three types of planes: a Boeing 737, an Airbus A320, and an ERJ 145.

The Boeing operates 62% of the flights. Out of these flights, 6% arrive at the destination with a delay.
The Airbus operates 35% of the flights. Out of these flights, 9% arrive with a delay.
The ERJ operates the remaining 3% of the flights. Out of these flights, 1% arrive with a delay.

1. Calculate the probability of delay and assign your result to p_delay. See the hint if you get stuck.

```python
p_boeing = 0.62
p_airbus = 0.35
p_erj = 0.03
p_delay_boeing = 0.06 
p_delay_airbus = 0.09
p_delay_erj = 0.01

p_delay = p_boeing*p_delay_boeing + p_airbus*p_delay_airbus + p_erj*p_delay_erj
```

Using the same reasoning as we used above, the formula for n events is:

$\begin{equation}
P(A) = P(B_1) \cdot P(A|B_1) + P(B_2) \cdot P(A|B_2) + \dots + P(B_n) \cdot P(A|B_n)
\end{equation}$

The above formula is called the law of total probability.

$\begin{equation}
P(A) = \sum_{i = 1}^{n} P(B_i) \cdot P(A|B_i)
\end{equation}$

## Bayes Theorem

On previous screens, we discussed a few examples around plane delays and tried to use the law of total probability to find P(Delay), the probability that a passenger will arrive at her destination with a delay. Once a plane arrived with a delay, however, we might be interested to calculate the probability that it's a Boeing. In other words, what's the probability that the plane is a Boeing given that it arrived with a delay?

Let's bring back a concrete example we've used earlier. An airline transports passengers using two types of planes: a Boeing 737 and an Airbus A320.

The Boeing operates 73% of the flights. Out of these flights, 3% arrive at the destination with a delay.
The Airbus operates the remaining 27% of the flights. Out of these flights, 8% arrive with a delay.
Let's say a plane did arrive with a delay and we want to find the probability that the plane is a Boeing. In other words, we want to find P(Boeing|Delay). Let's begin by expanding P(Boeing|Delay) using the conditional probability formula:
$\begin{equation}
P(Boeing|Delay) = \frac{P(Boeing \cap Delay)}{P(Delay)} = \frac{P(Boeing) \cdot P(Delay|Boeing)}{P(Delay)}
\end{equation}$

We already know from the problem statement that  and . We don't know the value of P(Delay), but we can find it using the law of total probability:
$\begin{aligned}
P(Delay) &= P(Boeing) \cdot P(Delay|Boeing) + P(Airbus) \cdot P(Delay|Airbus) \\
&= 0.73 \cdot 0.03 + 0.27 \cdot 0.08 = 0.0435
\end{aligned}$

Now we can plug in the values in our initial conditional probability formula and find P(Boeing|Delay):
$\begin{equation}
P(Boeing|Delay) = \frac{0.73 \cdot 0.03}{0.0435} = 0.5034
\end{equation}$

This is an instance where we applied Bayes' theorem to solve a probability problem. Mathematically, Bayes' theorem can be defined as:
$\begin{equation}
P(B|A) = \frac{P(B) \cdot P(A|B)}{\displaystyle \sum_{i = 1}^{n} P(B_i) \cdot P(A|B_i)}
\end{equation}$

Note that we arrived at Bayes' theorem by substituting the law of total probability into the conditional probability formula and expanding the numerator P(B ∩ A) using the multiplication rule:
\begin{aligned}
\text{Conditional Probability} &\implies P(B|A) = \frac{P(B \cap A)}{P(A)} \\
\text{The Law of Total Probability} &\implies P(A) = \sum_{i = 1}^{n} P(B_i) \cdot P(A|B_i) \\
\text{Bayes' Theorem} &\implies P(B|A) = \frac{P(B) \cdot P(A|B)}{\displaystyle \sum_{i = 1}^{n} P(B_i) \cdot P(A|B_i)}
\end{aligned}

Above, we defined the formulas for P(B|A), but we can also define them for P(A|B):
\begin{aligned}
\text{Conditional Probability} &\implies P(A|B) = \frac{P(A \cap B)}{P(B)} \\
\text{The Law of Total Probability} &\implies P(B) = \sum_{i = 1}^{n} P(A_i) \cdot P(B|A_i) \\
\text{Bayes' Theorem} &\implies P(A|B) = \frac{P(A) \cdot P(B|A)}{\displaystyle \sum_{i = 1}^{n} P(A_i) \cdot P(B|A_i)}
\end{aligned}

Now let's use Bayes' theorem to find P(Airbus|Delay). On the next screen, we'll learn more about Bayes' theorem.

### Exercise 3

An airline transports passengers using two types of planes: a Boeing 737 and an Airbus A320.

The Boeing operates 73% of the flights. Out of these flights, 3% arrive at the destination with a delay.
The Airbus operates the remaining 27% of the flights. Out of these flights, 8% arrive with a delay.
1. Use Bayes' theorem to find P(Airbus|Delay). Assign your answer to p_airbus_delay. Don't forget you can check the hint if you get stuck.

```python
p_boeing = 0.73
p_airbus = 0.27
p_delay_given_boeing = 0.03
p_delay_given_airbus = 0.08

numerator = p_airbus * p_delay_given_airbus
denominator = p_airbus*p_delay_given_airbus + p_boeing*p_delay_given_boeing
p_airbus_delay = numerator / denominator
```

### Exercise 4

The probability of getting a positive test result given that a patient is infected with HIV is 99.78%. That is, $P(T^+ | HIV) = 0.9978$

The probability of getting a positive test result given that a patient is not infected with HIV is 1.05%. That is, $P(T^+ | HIV^C) = 0.0105$

The probability of being infected with HIV is 0.14%. That is, $P(HIV) = 0.0014$

The probability of not being infected with HIV is 99.86%. That is, $P(HIV^C) = 0.9986$

Since $P(T^+ | HIV) = 0.9978$, it means that 99.78% of the people infected with HIV get a correct diagnosis — they test positive for a virus they actually have.

The value of $P(T^+ | HIV^C) = 0.0105$ means 1.05% of the persons who are not infected with HIV get a wrong diagnosis — they test positive for a virus they don't have. All in all, this suggests the test is quite efficient.

Now let's say a person comes in for a test and we don't know beforehand whether they have HIV or not. The patient tests positive. One important question we may have now is: Given the positive test result, what's the probability of being infected with HIV? In other words, what is P(HIV|T+)?

We can find the answer using Bayes' theorem. Let's begin by expanding P(HIV|T+) using the conditional probability formula:
$\begin{equation}
P(HIV | T^+) = \frac{P(HIV \cap T^+)}{P(T^+)} = \frac{P(HIV) \cdot P(T^+|HIV)}{P(T^+)}
\end{equation}$

From the problem statement, we know that  and . We don't know the value of P(T+), but we can find it using the law of total probability:
$\begin{aligned}
P(T^+) &= P(HIV) \cdot P(T^+|HIV) + P(HIV^C) \cdot P(T^+|HIV^C) \\
&= 0.0014 \cdot 0.9978 + 0.9986 \cdot 0.0105 = 0.0119
\end{aligned}$

We now have all the values we need to calculate P(HIV|T+):
$\begin{equation}
P(HIV | T^+) = \frac{0.0014 \cdot 0.9978}{0.0119} = 0.1174
\end{equation}$

We see that if a person tests positive, the probability of being infected with HIV is still pretty low: 11.74%. This low value may be a bit counter-intuitive given the high efficiency of the test. However, the probability is low because P(HIV) — the probability of having HIV — is very low in the first place: 0.14%.

Notice, however, that if a person tests positively, the probability of being infected with HIV actually increases a lot. The regular person in the population has a 0.14% chance to be infected with HIV — since . But if a person tests positively, the probability of HIV infection increases to 11.74%, which is about 84 times more than the initial probability!
$\begin{equation}
\frac{P(HIV|T^+)}{P(HIV)} = \frac{0.1174}{0.0014} = 83.85
\end{equation}$

In the above example, we've considered the probability of being infected with HIV in two scenarios:

1. Before doing any test: P(HIV)
2. After testing positive: P(HIV|T+)
The probability of being infected with HIV before doing any test is called the prior probability ("prior" means "before"). The probability of being infected with HIV after testing positive is called the posterior probability ("posterior" means "after"). So, in this case, the prior probability is 0.14%, and the posterior probability is 11.74%.

Now let's look at an exercise about spam emails and wrap up this mission on the next screen.

### Exercise 5

Many spam emails contain the word "secret". However, some emails are not spam even though they contain the word "secret". Let's say we know the following probabilities:

The probability of getting a spam email is 23.88%. That is, $P(Spam) = 0.2388$

The probability of an email containing the word "secret" given that the email is spam is 48.02%. That is, $P(\text{"secret"}| Spam) = 0.4802$

The probability of an email containing the word "secret" given that the email is not spam is 12.84%. That is, $P(\text{"secret"}| Spam^C) = 0.1284$

1. Use Bayes' theorem to find P(Spam|"secret"). Assign your answer to p_spam_given_secret.

2. Assign the prior probability of getting a spam email to prior.

3. Assign the posterior probability of getting a spam email (after we see the email contains the word "secret") to posterior.

4. Calculate the ratio between the posterior and the prior probability — you'll need to divide the posterior probability by the prior probability. Assign your answer to ratio.

```python
p_spam = 0.2388
p_secret_given_spam = 0.4802
p_secret_given_non_spam = 0.1284

p_non_spam = 1 - p_spam
numerator = p_spam * p_secret_given_spam
denominator = p_spam * p_secret_given_spam + p_non_spam * p_secret_given_non_spam
p_spam_given_secret = numerator / denominator

prior = p_spam
posterior = p_spam_given_secret
ratio = posterior / prior
```

## Naive Bayes Algorithm

Steps to classify a message as spam or non-spam
1. The computer learns how humans classify messages.
2. Then it uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Finally, the computer classifies a new message based on the probability values it calculated in step 2 — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may want a human to classify the message — we'll come back to this issue in the guided project).

When a new message comes in, the algorithm requires the computer to calculate the following probabilities:

$\begin{equation}
P(Spam | New\ message) =\ ? \\
P(Spam^C |New\ message) =\ ?
\end{equation}$

Let's take the first equation and expand it using Bayes' theorem:
$\begin{equation}
P(Spam | New\ message) = \frac{P(Spam) \cdot P(New\ Message | Spam)}{P(New\ message)}
\end{equation}$

Now let's do the same for the second equation:
$\begin{equation}
P(Spam^C | New\ message) = \frac{P(Spam^C) \cdot P(New\ Message | Spam^C)}{P(New\ message)}
\end{equation}$

For the sake of example, let's assume the following probabilities are already known:

$\begin{aligned}
&P(Spam) = 0.5 \\
&P(Spam^C) = 0.5 \\
&P(New\ message) = 0.4167 \\
&P(New\ Message | Spam) = 0.5 \\
&P(New\ Message | Spam^C) = 0.3334
\end{aligned}$

If the computer knows these values, then it can calculate the probabilities it needs to classify a new message:
$\begin{equation}
P(Spam | New\ message) = \frac{0.5 \cdot 0.5}{0.4167} = 0.6 \\
P(Spam^C | New\ message) = \frac{0.5 \cdot 0.3334}{0.4167} = 0.4
\end{equation}$

Since $P(Spam | New\ message) > P(Spam^C | New\ message)$, the computer will classify the new message as spam.

### Exercise 6

A new mobile message has been received: "URGENT!! You have one day left to claim your $873 prize." The following probabilities are known:

$\begin{aligned}
&P(Spam) = 0.5 \\
&P(Spam^C) = 0.5 \\
&P(New\ message) = 0.5417 \\
&P(New\ Message | Spam) = 0.75 \\
&P(New\ Message | Spam^C) = 0.3334
\end{aligned}$

Classify this new message as spam or non-spam:

Calculate P(Spam|New Message). Assign your answer to p_spam_given_new_message.
Calculate P(SpamC|New Message). Assign your answer to p_non_spam_given_new_message.
Classify the message by comparing the probability values. If the message is spam, then assign the string 'spam' to the variable classification. Otherwise, assign the string 'non-spam'.

```python
p_spam = 0.5
p_non_spam = 0.5
p_new_message = 0.5417
p_new_message_given_spam = 0.75
p_new_message_given_non_spam = 0.3334

p_spam_given_new_message = (p_spam * p_new_message_given_spam) / p_new_message
p_non_spam_given_new_message = (p_non_spam * p_new_message_given_non_spam) / p_new_message

classification = 'spam' if p_spam_given_new_message > p_non_spam_given_new_message else 'non-spam'
```

On the last screen, we saw the computer can use these two equations to calculate the probabilities it needs to classify new messages:

$\begin{equation}
P(Spam | New\ message) = \frac{P(Spam) \cdot P(New\ Message | Spam)}{P(New\ message)} \\
P(Spam^C | New\ message) = \frac{P(Spam^C) \cdot P(New\ Message | Spam^C)}{P(New\ message)}
\end{equation}$

Although we've taken a great first step so far, the actual equations of the Naive Bayes algorithm are a bit different — we'll gradually develop the equations throughout this mission. Let's start by pointing out that both equations above have the same denominator: P(New message).

When a new message comes in, P(New message) has the same value for both equations. Since we only need to compare the results of the two equations to classify a new message, we can ignore the division:

$\begin{equation}
\frac{P(Spam) \cdot P(New\ Message | Spam)}{P(New\ message)}\ \ \ \  \xrightarrow[]{becomes}\ \ \ \  P(Spam) \cdot P(New\ Message | Spam) \\
\frac{P(Spam^C) \cdot P(New\ Message | Spam^C)}{P(New\ message)}\ \ \  \xrightarrow[]{becomes}\ \ \ \  P(Spam^C) \cdot P(New\ Message | Spam^C)
\end{equation}$

This means our two equations reduce to:

$\begin{equation}
P(Spam | New\ message) = P(Spam) \cdot P(New\ Message | Spam) \\
P(Spam^C | New\ message) = P(Spam^C) \cdot P(New\ Message | Spam^C)
\end{equation}$

Ignoring the division doesn't affect the algorithm's ability to classify new messages. For instance, let's repeat the classification we did on the previous screen using the new equations above. Recall that we assumed we already know these values:

$\begin{aligned}
&P(Spam) = 0.5 \\
&P(Spam^C) = 0.5 \\
&P(New\ message) = 0.4167 \\
&P(New\ Message | Spam) = 0.5 \\
&P(New\ Message | Spam^C) = 0.3334
\end{aligned}$

Previously, the algorithm classified the new message as spam. Using the new equations, we see the conclusion is identical — the new message is spam because $P(Spam | New\ message) > P(Spam^C | New\ message)$:

$\begin{aligned}
P(Spam^C | New\ message) &= P(Spam^C) \cdot P(New\ Message | Spam^C) \\
&= 0.5 \cdot 0.3334 = 0.1667
\end{aligned}$

The classification works fine, but ignoring the division changes the probability values, and some probability rules also begin to break. For instance, let's take this conditional probability rule that we've learned about in a previous mission:

$\begin{equation}
P(A|B) + P(A^C|B) = 1
\end{equation}$

On the previous screen, we saw $P(Spam | New\ message) = 0.6$ and $P(Spam^C | New\ message) = 0.4$, and the rule holds with these values:

$\begin{equation}
P(Spam | New\ message) + P(Spam^C | New\ message) = 0.6 + 0.4 = 1
\end{equation}$

With the values we got from the new equations, however, the law breaks:

$\begin{equation}
P(Spam | New\ message) + P(Spam^C | New\ message) = 0.25 + 0.1667 = 0.4167 \not = 1
\end{equation}$

Even though probability rules break, the Naive Bayes algorithm still requires us to ignore the division by P(New message). This might not make a lot of sense, but there's actually a very good reason we do that.

The main goal of the algorithm is to classify new messages, not to calculate probabilities — calculating probabilities is just a means to an end. Ignoring the division by P(New message) means less calculations, which can make a lot of difference when we use the algorithm to classify 500,000 new messages.

It's true the probability values are not accurate anymore. However, this is not important with respect to the the goal of the algorithm — correctly classifying new messages (not to accurately estimate probabilities).

The classification itself remains completely unaffected because we ignore division for both equations (not just for one). The probability values change, but they change directly proportional with one another, so the result of the comparison doesn't change.

For instance, $\frac{8}{4} > \frac{4}{4}$. If we ignore the division, the values change directly proportional with respect to one another such that the result of the comparison stays the same: $8 > 4$.

The symbol for directly proportional is $\propto$, and it's more accurate to replace the equality sign with  in our two equations:

$\begin{equation}
P(Spam | New\ message) \propto P(Spam) \cdot P(New\ Message | Spam) \\
P(Spam^C | New\ message) \propto P(Spam^C) \cdot P(New\ Message | Spam^C)
\end{equation}$

### Exercise 7

A new mobile message has been received: "URGENT!! You have one day left to claim your $873 prize." The following probabilities are known:

$\begin{aligned}
&P(Spam) = 0.5 \\
&P(Spam^C) = 0.5 \\
&P(New\ Message | Spam) = 0.75 \\
&P(New\ Message | Spam^C) = 0.3334
\end{aligned}$

Use the new equations we learned on this screen, and classify the new message as spam or non-spam:

1. Calculate P(Spam|New Message). Assign your answer to p_spam_given_new_message.
2. Calculate P(SpamC|New Message). Assign your answer to p_non_spam_given_new_message.
3. Classify the message by comparing the probability values — if the message is spam, then assign the string 'spam' to the variable classification. Otherwise, assign the string 'non-spam'.

```python
p_spam = 0.5
p_non_spam = 0.5
p_new_message_given_spam = 0.75
p_new_message_given_non_spam = 0.3334

p_spam_given_new_message = p_spam * p_new_message_given_spam
p_non_spam_given_new_message = p_non_spam * p_new_message_given_non_spam

classification = 'spam' if p_spam_given_new_message > p_non_spam_given_new_message else 'non-spam'
```

On the previous screen, we optimized the algorithm and concluded that we can use these two optimized equations if all we're interested in is classifying messages (and not calculating accurate probabilities):

$\begin{equation}
P(Spam | New\ message) \propto P(Spam) \cdot P(New\ Message | Spam) \\
P(Spam^C | New\ message) \propto P(Spam^C) \cdot P(New\ Message | Spam^C)
\end{equation}$

We'll now look at how the algorithm can use messages that are already classified by humans to calculate the values it needs for:

- P(Spam) and P(SpamC)
- P(New message|Spam) and P(New message|SpamC).

We'll start with some examples that may look a bit too simplistic and unrealistic, but they will make it easier to understand the mathematics behind the algorithm.

Let's say we have three messages that are already classified:

|ix| Label   | SMS   |
|--|---------|-------|
|0 | spam    | secret money secret secret |
|1 | spam    | money secret place |
|2 | non-spam| you know the secret |

Now let's say the one-word message "secret" comes in and we want to use the Naive Bayes algorithm to classify it — to tell whether it's spam or non-spam.

|ix| Label   | SMS   |
|--|---------|-------|
|0 | spam    | secret money secret secret |
|1 | spam    | money secret place |
|2 | non-spam| you know the secret |
|__3__|__?__|__secret__ |

As we learned, we first need to answer these two probability questions (note that we changed New Message to "secret" inside the notation below) and then compare the values (recall that the $\propto$ symbol replaces the equal sign):

$\begin{equation}
P(Spam | \text{"secret"}) \propto P(Spam) \cdot P(\text{"secret"} | Spam) \\
P(Spam^C | \text{"secret"}) \propto P(Spam^C) \cdot P(\text{"secret"} | Spam^C)
\end{equation}$

Let's begin with the first equation, for which we need to find the values of P(Spam) and P("secret"|Spam). To find P(Spam), we use the messages that are already classified and divide the number of spam messages by the total number of messages:

$\begin{equation}
P(Spam) = \frac{\text{number of spam messages}}{\text{total number of messages}} = \frac{2}{3}
\end{equation}$

To calculate P("secret"|Spam), we only look at the spam messages and divide the number of times the word "secret" occurred in all the spam messages by the total number of words.

$\begin{equation}
P(\text{"secret"}| Spam) = \frac{\text{number of times the word "secret" occurs}}{\text{total number of words in all spam messages}}
\end{equation}$

Notice that "secret" occurs four times in the spam messages:

![image.png](attachment:image.png)

We have two spam messages and there's a total of seven words in all of them, so P("secret"|Spam) is:

$\begin{equation}
P(\text{"secret"}| Spam) = \frac{\text{number of times the word "secret" occurs}}{\text{total number of words in all spam messages}} = \frac{4}{7}
\end{equation}$

Now that we know the values for P(Spam) and P("secret"|Spam), we have all we need to calculate P(Spam|"secret"):

$\begin{aligned}
P(Spam | \text{"secret"}) &\propto P(Spam) \cdot P(\text{"secret"} | Spam) \\
&= \frac{2}{3} \cdot \frac{4}{7} = \frac{8}{21}
\end{aligned}$

For the exercise below, we'll take the same steps as above to calculate P(SpamC|"secret"). Then, we can compare the values of P(SpamC|"secret") and P(Spam|"secret") to classify the message "secret" as spam or non-spam.

### Exercise 8

- Calculate P(SpamC) and assign the answer to p_non_spam.
- Calculate P("secret"|SpamC) and assign the answer to p_secret_given_non_spam.
- Calculate P(SpamC|"secret") and assign the answer to p_non_spam_given_secret.
- Compare P(SpamC|"secret") with P(Spam|"secret") and classify the message "secret" — if the message is spam, then assign the string 'spam' to the variable classification, otherwise assign the string 'non-spam'.

```python
p_spam_given_secret = 8/21

p_non_spam = 1/3
p_secret_given_non_spam = 1/4
p_non_spam_given_secret = p_non_spam*p_secret_given_non_spam

classification = 'spam' if p_spam_given_secret > p_non_spam_given_secret else 'non-spam'
```

On the previous screen, we used our algorithm to classify the message "secret", and we concluded it's spam. The message "secret" has only one word, but what about the situation where we have to classify messages that have more words?

Let's say we want to classify the message "secret place secret secret" based on four messages that are already classified (the four messages below are different than what what we saw on the previous screen):

![](image1.png)

To calculate the probabilities we need, we'll treat each word in our new message separately. This means that the word "secrete" at the beginning is different and separate from the word "secret" at the end. There are four words in the message "secret place secret secret", and we're going to abbreviate them "w1", "w2", "w3" and "w4" (the "w" comes from "word").

![](image2.png)

Since we treat each word separately, these are the two equations we can use to calculate the probabilities:

$\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
P(Spam^C | w_1,w_2,w_3,w_4) \propto P(Spam^C) \cdot P(w_1|Spam^C) \cdot P(w_2|Spam^C) \cdot P(w_3|Spam^C) \cdot P(w_4|Spam^C) \\
\end{equation}$

Let's begin with calculating P(Spam|w1, w2, w3, w4). To calculate the probabilities we need, we'll look at the four messages that are already classified. We have four messages and two of them are spam, so:

$\begin{equation}
P(Spam) = \frac{2}{4} = \frac{1}{2}
\end{equation}$

The first word, w1, is "secret", and we see that "secret" occurs four times in all spam messages. There's a total of seven words in all the spam messages, so:

$\begin{equation}
P(w_1|Spam) = \frac{4}{7} 
\end{equation}$

Applying a similar reasoning, we have:

$\begin{equation}
P(w_2|Spam) = \frac{1}{7} \\
P(w_3|Spam) = \frac{4}{7} \\
P(w_4|Spam) = \frac{4}{7}
\end{equation}$

We now have all the probabilities we need to calculate P(Spam|w1, w2, w3, w4):

$\begin{aligned}
P(Spam | w_1,w_2,w_3,w_4) &\propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
&= \frac{1}{2} \cdot \frac{4}{7} \cdot \frac{1}{7} \cdot \frac{4}{7} \cdot \frac{4}{7} = \frac{64}{4802} = 0.01333
\end{aligned}$

Let's now take similar steps to calculate P(SpamC|w1, w2, w3, w4), and then classify the message "secret place secret secret" as spam or non-spam.

### Exercise 9

- Calculate P(SpamC|w1, w2, w3, w4). Assign the answer to p_non_spam_given_w1_w2_w3_w4. Check the hint if you get stuck.
- Compare P(SpamC|w1, w2, w3, w4) with P(Spam|w1, w2, w3, w4) and classify the message "secret place secret secret" — if the message is spam, then assign the string 'spam' to the variable classification. Otherwise, assign the string 'non-spam'.

```python
p_spam_given_w1_w2_w3_w4 = 64/4802

p_non_spam = 2/4
p_w1_given_non_spam = 2/9
p_w2_given_non_spam = 1/9
p_w3_given_non_spam = 2/9
p_w4_given_non_spam = 2/9

p_non_spam_given_w1_w2_w3_w4 = (p_non_spam *
                                p_w1_given_non_spam * p_w2_given_non_spam *
                                p_w3_given_non_spam * p_w4_given_non_spam
                               )

classification = 'spam' if p_spam_given_w1_w2_w3_w4 > p_non_spam_given_w1_w2_w3_w4 else 'non-spam'
```

On the previous screen, we introduced these two equations without much explanation:

$\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
P(Spam^C | w_1,w_2,w_3,w_4) \propto P(Spam^C) \cdot P(w_1|Spam^C) \cdot P(w_2|Spam^C) \cdot P(w_3|Spam^C) \cdot P(w_4|Spam^C) \\
\end{equation}$

To explain the mathematics behind these equations, let's start by looking at P(Spam|w1, w2, w3, w4). Using the conditional probability formula, we can expand P(Spam|w1, w2, w3, w4) like this (below, make sure you notice the $\cap$ symbol in the numerator):

$\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) = \frac{P(Spam \cap (w_1, w_2, w_3, w_4))}{P(w_1, w_2, w_3, w_4)}
\end{equation}$

Recall that we learned in a previous screen that we can ignore the division, which means we can drop P(w1, w2, w3, w4) to avoid redundant calculations (when we ignore the division, we also replace the equals sign with $\propto$, which means directly proportional):

$\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam \cap (w_1, w_2, w_3, w_4))
\end{equation}$

Note that (w1, w2, w3, w4) can be modeled as an intersection of four events:

$\begin{equation}
w_1,w_2,w_3,w_4 = w_1 \cap w_2 \cap w_3 \cap w_4
\end{equation}$

For instance, we could think of a message like "thanks for your help" as the intersection of four words inside a single message: "thanks", "for, "your", and "help". In probability jargon, finding the value of $P(w_1 \cap w_2 \cap w_3 \cap w_4)$ means finding the probability that the four words w1, w2, w3, w4 occur together in a single message — this is similar to $P(A \cap B \cap C \cap D)$, which is the probability that events A, B, C, and D occur together.

With this in mind, our equation above transforms to:

$\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam \cap \underbrace{(w_1 \cap w_2 \cap w_3 \cap w_4)}_{\displaystyle (w_1,w_2,w_3,w_4)})
\end{equation}$

From set theory, we know that $A \cap (B \cap C) = A \cap B \cap C = C \cap B \cap A$, which means we can transform $P(Spam \cap (w_1 \cap w_2 \cap w_3 \cap w_4))$ in our equation above to make it suitable for further expansion:

$\begin{aligned}
P(Spam \cap (w_1 \cap w_2 \cap w_3 \cap w_4)) &= P(Spam \cap w_1 \cap w_2 \cap w_3 \cap w_4) \\
&= P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam)
\end{aligned}$

Now let's use the multiplication rule to expand $P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam)$:

$\begin{equation}
P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam) = P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 \cap w_3 \cap w_4 \cap Spam)
\end{equation}$

We can use the multiplication rule again to expand P(w2 ∩ w3 ∩ w4 ∩ Spam):

$\begin{equation}
P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam) = P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot \underbrace{P(w_2 | w_3 \cap w_4 \cap Spam) \cdot P(w_3 \cap w_4 \cap Spam)}_{\displaystyle P(w_2 \cap w_3 \cap w_4 \cap Spam)}
\end{equation}$

We can use the multiplication rule successively, until there's nothing more left to expand:

$\begin{aligned}
P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam) &= P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 \cap w_3 \cap w_4 \cap Spam) \\
&= P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 | w_3 \cap w_4 \cap Spam) \cdot P(w_3 \cap w_4 \cap Spam) \\
&= P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 | w_3 \cap w_4 \cap Spam) \cdot P(w_3 | w_4 \cap Spam) \cdot P(w_4 \cap Spam) \\
&= P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 | w_3 \cap w_4 \cap Spam) \cdot P(w_3 | w_4 \cap Spam) \cdot P(w_4|Spam) \cdot P(Spam) \\
\end{aligned}$

In theory, the last equation you see above is what we'd have to use if we wanted to calculate P(Spam|w1, w2, w3, w4). However, the equation is pretty long for just four words. Also, imagine how would the equation look for a 50-word message — just think of how many calculations we'd have to perform!!

To make the calculations tractable for messages of all kinds of lengths, we can assume conditional independence between w1, w2, w3, and w4. This implies that:

$\begin{aligned}
&P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) = P(w_1|Spam) \\
&P(w_2 | w_3 \cap w_4 \cap Spam) = P(w_2|Spam) \\ 
&P(w_3 | w_4 \cap Spam) = P(w_3|Spam) \\
&P(w_4|Spam) = P(w_4|Spam) \\
\end{aligned}$

Under the assumption of independence, our lengthy equation above reduces to:

$\begin{equation}
P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam) = P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \cdot P(Spam)
\end{equation}$

The assumption of conditional independence is unrealistic in practice because words are often in a relationship of dependence. For instance, if you see the word "WINNER" in a message, the probability of seeing the word "money" is very likely to increase, so "WINNER" and "money" are most likely dependent. The assumption of conditional independence between words is thus naive since it rarely holds in practice, and this is why the algorithm is called Naive Bayes (also called simple Bayes or independence Bayes).

Despite this simplifying assumption, the algorithm works quite well in many real-word situations, and we'll see that ourselves in the guided project.

That being said, on the previous screen we assumed conditional independence when we introduced these two equations:

$\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
P(Spam^C | w_1,w_2,w_3,w_4) \propto P(Spam^C) \cdot P(w_1|Spam^C) \cdot P(w_2|Spam^C) \cdot P(w_3|Spam^C) \cdot P(w_4|Spam^C) \\
\end{equation}$

On the next screen, we'll make the equations more general to account for any number of words.

#### Generalized equation

On the previous screen, we learned about the conditional independence assumption, which is central to the Naive Bayes algorithm. As a result of the assumption, we saw we can use these simplified equations:

$\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
P(Spam^C | w_1,w_2,w_3,w_4) \propto P(Spam^C) \cdot P(w_1|Spam^C) \cdot P(w_2|Spam^C) \cdot P(w_3|Spam^C) \cdot P(w_4|Spam^C) \\
\end{equation}$

The equations above work for messages that have four words, but we need a more general form to use with messages of various word lengths.

A new message has n words, where n can be any positive integer (1, 2, 3, ..., 50, 51, 53, ...). If we wanted to find P(Spam|w1, w2, ..., wn), then this is an equation we could use:

$\begin{equation}
P(Spam | w_1,w_2, \ldots, w_n) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot \ldots \cdot P(w_n|Spam)
\end{equation}$

Notice that there's a certain pattern in the equation above — after P(Spam), the only thing that changes is the word number.

Whenever we have a product that follows a pattern like that, it's common to use the  symbol (this is the uppercase Greek letter "pi"). So the equation above simplifies to:

$\begin{equation}
P(Spam | w_1,w_2, \ldots, w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}$

The equation above is the same as:

$\begin{equation}
P(Spam | w_1,w_2, \ldots, w_n) \propto P(Spam) \cdot \overbrace{P(w_1|Spam) \cdot P(w_2|Spam) \cdot \ldots \cdot P(w_n|Spam)}^{\displaystyle \prod_{i=1}^{n}P(w_i|Spam)}
\end{equation}$

Applying the same reasoning to P(SpamC|w1, w2, ..., wn), we have:

$\begin{equation}
P(Spam^C | w_1,w_2, \ldots, w_n) \propto P(Spam^C) \cdot \prod_{i=1}^{n}P(w_i|Spam^C)
\end{equation}$

Now that we have these general equations, we're going to discuss a few edge cases on the next few screens. After this, we'll be ready to start working on the guided project, where we'll work with over 5,000 real messages.

#### Edge cases

On a previous screen, we looked at a few messages that were already classified:

![image3.png](image3.png)

Above, we have four messages and nine unique words: "secret", "party", "at", "my", "place", "money", "you", "know", "the". We call the set of unique words a vocabulary.

Now, what if we receive a new message that contains words which are not part of the vocabulary? How do we calculate probabilities for this kind of words?

For instance, say we received the message "secret code to unlock the money".

![image4.png](image4.png)

Notice that for this new message:

- The words "code", "to", and "unlock" are not part of the vocabulary.
- The word "secret" is part of both spam and non-spam messages.
- The word "money" is only part of the spam messages and is missing from the non-spam messages.
- The word "the" is missing from the spam messages and is only part of the non-spam messages.

Whenever we have to deal with words that are not part of the vocabulary, one solution is to ignore them when we're calculating probabilities. If we wanted to calculate P(Spam|"secret code to unlock the money"), we could skip calculating P("code"|Spam), P("to"|Spam), and P("unlock"|Spam) because "code", "to", and "unlock" are not part of the vocabulary:

$\begin{equation}
P(Spam|\text{"secret code to unlock the money"}) \propto P(Spam) \cdot {P(\text{"secret"}|Spam) \cdot P(\text{"the"}|Spam) \cdot P(\text{"money"}|Spam)}
\end{equation}$

We can also apply the same reasoning for calculating P(SpamC|"secret code to unlock the money"):

$\begin{equation}
P(Spam^C|\text{"secret code to unlock the money"}) \propto P(Spam^C) \cdot P(\text{"secret"}|Spam^C) \cdot P(\text{"the"}|Spam^C) \cdot P(\text{"money"}|Spam^C)
\end{equation}$

Let's now calculate P(Spam|"secret code to unlock the money") and P(SpamC|"secret code to unlock the money"), and see what we get.

### Exercise 10

P(Spam|"secret code to unlock the money") is already calculated for you. Use the table below (the same as above) to calculate P(SpamC|"secret code to unlock the money").

![](image5.png)

Calculate P(SpamC|"secret code to unlock the money"). Assign your answer to p_non_spam_given_message.
Print p_spam_given_message and p_non_spam_given_message. Why do you think we got these values? We'll discuss more about this in the next screen.

```py
p_spam = 2/4
p_secret_given_spam = 4/7
p_the_given_spam = 0/7
p_money_given_spam = 2/7
p_spam_given_message = (p_spam * p_secret_given_spam *
                        p_the_given_spam * p_money_given_spam)

p_non_spam = 2/4
p_secret_given_non_spam = 2/9
p_the_given_non_spam = 1/9
p_money_given_non_spam = 0/9
p_non_spam_given_message = (p_non_spam * p_secret_given_non_spam *
                        p_the_given_non_spam * p_money_given_non_spam)

print(p_spam_given_message)
print(p_non_spam_given_message)
```

#### Additive Smoothing

In the previous exercise, we saw that both P(Spam|"secret code to unlock the money") and P(SpamC|"secret code to unlock the money") were equal to 0. This will always happen when we have words that occur in only one category — "money" occurs only in spam messages, while "the" only occurs in non-spam messages.

![](image6.png)

When we calculate P(Spam|"secret code to unlock the money"), we can see that P("the"|Spam) is equal to 0 because "the" is not part of the spam messages. Unfortunately, that single value of 0 has the drawback of turning the result of the entire equation to 0:

$\begin{aligned}
P(Spam|\text{"secret code to unlock the money"}) &\propto P(Spam) \cdot P(\text{"secret"}|Spam) \cdot P(\text{"the"}|Spam) \cdot P(\text{"money"}|Spam) \\
&= \frac{2}{4} \cdot \frac{4}{7} \cdot \frac{0}{7} \cdot \frac{2}{7} = 0
\end{aligned}$

To fix this problem, we need to find a way to avoid these cases where we get probabilities of 0. Let's start by laying out the equation we're using to calculate P("the"|Spam):

$\begin{equation}
P(\text{"the"}|Spam) = \frac{\text{total number of times "the" occurs in spam messages}}{\text{total number of words in spam messages}} = \frac{0}{7}
\end{equation}$

We're going to add some notation and rewrite the equation above as:

$\begin{equation}
P(\text{"the"}|Spam) = \frac{N_{\text{"the"}|Spam}}{N_{Spam}} = \frac{0}{7}
\end{equation}$

To fix the problem, we're going to use a technique called `additive smoothing`, where we add a smoothing parameter $\alpha$. In the equation below, we'll use $\alpha = 1$ (below, NVocabulary represents the number of unique words in all the messages — both spam and non-spam).

$\begin{equation}
P(\text{"the"}|Spam) = \frac{N_{\text{"the"}|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} = \frac{0 + 1}{7 + 1 \cdot 9} = \frac{1}{16}
\end{equation}$

The additive smoothing technique solves the issue and gets us a non-zero result, but it introduces another problem. We're now calculating probabilities differently depending on the word — take P("the"|Spam) and P("secret"|Spam) for instance:

$\begin{equation}
P(\text{"the"}|Spam) = \frac{N_{\text{"the"}|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(\text{"secret"}|Spam) = \frac{N_{\text{"secret"}|Spam}}{N_{Spam}}
\end{equation}$

Words like "the" are thus given special treatment and their probability are increased artificially to avoid non-zero cases, while words like "secret" are treated normally. To keep the probability values proportional across all words, we're going to use the additive smoothing for every word:

$\begin{equation}
P(\text{"the"}|Spam) = \frac{N_{\text{"the"}|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(\text{"secret"}|Spam) = \frac{N_{\text{"secret"}|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}$

In more general terms, this is the equation that we'll need to use for every word:

$\begin{equation}
P(word|Spam) = \frac{N_{\text{word}|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}$

As a side note, when $\alpha = 1$, the additive smoothing technique is most commonly known as Laplace smoothing (or add-one smoothing). However, it is also possible to use $\alpha < 1$, in which case the technique is called Lidstone smoothing. If you want to learn more about additive smoothing, you can start [here](https://en.wikipedia.org/wiki/Additive_smoothing).

Let's now recalculate the probabilities for the message "secret code to unlock the money" and try to classify the message.

P(Spam|"secret code to unlock the money") is already calculated for you. Use the table below (the same as above) to calculate P(SpamC|"secret code to unlock the money").

![](image7.png)

- Using the additive smoothing technique, calculate P(SpamC|"secret code to unlock the money"). Assign your answer to p_non_spam_given_message.
- Compare p_spam_given_message and p_non_spam_given_message to classify the message as spam or non-spam. If you think it's spam, then assign the string 'spam' to classification. Otherwise, assign 'non-spam'.

```py
p_spam = 2/4
p_secret_given_spam = (4 + 1) / (7 + 9)
p_the_given_spam = (0 + 1) / (7 + 9)
p_money_given_spam = (2 + 1) / (7 + 9)
p_spam_given_message = (p_spam * p_secret_given_spam *
                        p_the_given_spam * p_money_given_spam)

p_non_spam = 2/4
p_secret_given_non_spam = (2 + 1)/(9 + 9)
p_the_given_non_spam = (1 + 1)/(9 + 9)
p_money_given_non_spam = (0 + 1)/(9 + 9)
p_non_spam_given_message = (p_non_spam * p_secret_given_non_spam *
                        p_the_given_non_spam * p_money_given_non_spam) 

classification = 'spam' if p_spam_given_message > p_non_spam_given_message else 'non-spam'
```

The Naive Bayes algorithm can be used for more than just building spam filters. For instance, we could use it to perform sentiment analysis for Twitter messages — the input is a Twitter message, and the output is the sentiment type (positive or negative). This follows the same pattern we saw with our spam filter, where the input is a new SMS message and the output is the message type (spam or non-spam).

Depending on the math and the assumptions used, the Naive Bayes algorithm has a few variations. The three most popular Naive Bayes algorithms are:

- Multinomial Naive Bayes
- Gaussian Naive Bayes
- Bernoulli Naive Bayes

In this mission, we learned the multinomial Naive Bayes version of the algorithm. Explaining the mathematical differences between the various versions is out of the scope of this course, but it's important to keep in mind that all the Naive Bayes algorithms build on the (naive) conditional independence assumption we learned about earlier in this mission.

On the next screen, we'll summarize everything we've done so far, and then we'll wrap up this mission!

To summarize everything we've done so far, these are the two equations we can use for our spam filtering problem moving forward:

$\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}$

$\begin{equation}
P(Spam^C | w_1,w_2, ..., w_n) \propto P(Spam^C) \cdot \prod_{i=1}^{n}P(w_i|Spam^C)
\end{equation}$

To calculate P(wi|Spam) and P(wi|SpamC), we need to use the additive smoothing technique:

$\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}$

$\begin{equation}
P(w_i|Spam^C) = \frac{N_{w_i|Spam^C} + \alpha}{N_{Spam^C} + \alpha \cdot N_{Vocabulary}}
\end{equation}$

Let's also summarize what the terms in the equations above mean:

$\begin{aligned}
&N_{w_i|Spam} = \text{the number of times the word } w_i \text{ occurs in spam messages} \\
&N_{w_i|Spam^C} = \text{the number of times the word } w_i \text{ occurs in non-spam messages} \\
\\
&N_{Spam} = \text{total number of words in spam messages} \\
&N_{Spam^C} = \text{total number of words in non-spam messages} \\
\\
&N_{Vocabulary} = \text{total number of words in the vocabulary} \\
&\alpha = 1 \ \ \ \ (\alpha \text{ is a smoothing parameter})
\end{aligned}$

It's worth emphasizing that:

- NSpam is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.
- NSpamC is equal to the number of words in all the non-spam messages — it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.

In the next mission, which is a guided project, we'll use the multinomial Naive Bayes algorithm to create a spam filter, and we'll use a dataset of over 5,000 SMS messages.