# Bernoulli Naive Bayes

## Probability Basics

Probability tells us how likely something is to happen. The result is always between 0 and 1.

In this section, we’ll learn 3 essential concepts:

1. Marginal Probability  
2. Joint Probability  
3. The Product Rule (with Conditional Probability)

We’ll use **coin flips** for all examples.

### 1. Marginal Probability

This is the probability of a single event happening.

**Example — Flipping a coin:**

- The probability of getting Heads:

$$
P(\text{Heads}) = \frac{1}{2}
$$

- The probability of getting Tails:

$$
P(\text{Tails}) = \frac{1}{2}
$$

These are called **marginal probabilities** — they only involve one event.

### 2. Joint Probability

This is the probability of **two events happening together**:

$$
P(A \cap B)
$$

**Example — Flipping two coins:**

- What’s the probability of getting Heads on the first **and** Heads on the second?

If the flips are **independent**:

$$
P(\text{Heads on both}) = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}
$$

So for **independent events only**, the joint probability is:

$$
P(A \cap B) = P(A) \cdot P(B)
$$

If the events are **not independent**, we need the **product rule**.

### 3. The Product Rule (with Conditional Probability)

The **product rule** handles both independent and dependent events:

$$
P(A \cap B) = P(A \mid B) \cdot P(B)
$$

This introduces **conditional probability**:

- $P(A \mid B)$ means "the probability of $A$, assuming $B$ has already happened."

**Example — Dependent coin flips:**

- $P(\text{First = Heads}) = \frac{1}{2}$
- If the first flip is Heads, we use a **special biased coin** for the second flip:  
  $P(\text{Second = Heads} \mid \text{First = Heads}) = 0.75$

Then:

$$
P(\text{Both Heads}) = \frac{1}{2} \cdot 0.75 = 0.375
$$

This shows how a conditional bias affects the joint probability.

### From Product Rule to Bayes’ Theorem

We can also write the product rule the other way:

$$
P(B \cap A) = P(B \mid A) \cdot P(A)
$$

Since $P(A \cap B) = P(B \cap A)$, we can set them equal:

$$
P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)
$$

Now divide both sides by $P(B)$:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

This is **Bayes’ Theorem** — a way to reverse the direction of conditional probability.

## Bayes' Theorem for Classification

Now let:
- $A = \text{Spam}$  
- $B = \text{Words in the email}$

$$
P(\text{Spam} \mid \text{Words}) = \frac{P(\text{Words} \mid \text{Spam}) \cdot P(\text{Spam})}{P(\text{Words})}
$$

Then: 

- We know how often certain words appear in spam:  
  → $P(\text{Words} \mid \text{Spam})$

- We know the overall proportion of spam:  
  → $P(\text{Spam})$

Bayes’ Theorem helps us compute $P(\text{Spam} \mid \text{Words})$ — the chance an email is spam **given** the words it contains.

We usually compare two classes:

$$
\frac{P(\text{Words} \mid \text{Spam}) \cdot P(\text{Spam})}{P(\text{Words})}
\quad \text{vs} \quad
\frac{P(\text{Words} \mid \text{Not Spam}) \cdot P(\text{Not Spam})}{P(\text{Words})}
$$

Since the denominator is the same, it cancels out:

$$
P(\text{Spam} \mid \text{Words}) \propto P(\text{Words} \mid \text{Spam}) \cdot P(\text{Spam})
$$

> The symbol $\propto$ means "proportional to" — we're only interested in which class has the higher score, so we can safely ignore the shared denominator.

## The Naive Assumption

Just like assuming each coin flip is independent makes probability easier to compute, we assume all words are **conditionally independent** given the class — so the overall probability can be calculated by simply multiplying each word’s probability:


$$
P(\text{Words} \mid \text{Spam}) = P(x_1 \mid \text{Spam}) \cdot P(x_2 \mid \text{Spam}) \cdot \cdots \cdot P(x_n \mid \text{Spam}) = \prod_{i=1}^{n} P(x_i \mid \text{Spam})
$$

> The symbol $\prod$ is the product symbol — like $\sum$ but for multiplication.

It's called *Naive* because this independence assumption is rarely true in real language — but it works surprisingly well in practice.

## Calculating Word Probabilities in Spam Emails

Now that we assume each word is independent, we need a way to calculate how likely each word appears in spam emails.

Here’s how it works:

### 1. Learn Word Probabilities from Training Data

From the training data, we count how often each word appears in spam emails.

Let’s define:

- $x_i = 1$ means the $i$-th word is **present** in the email  
- $x_i = 0$ means the $i$-th word is **absent**

We want to estimate:

$$
\theta_i = P(x_i = 1 \mid \text{Spam}) = \frac{\text{Number of spam emails containing word } i}{\text{Total number of spam emails}}
$$

This gives us the probability that a spam email **contains** word $i$ at least once.

#### To Avoid Zero Probability: Laplace Smoothing

Sometimes a word might never appear in spam emails, which would give us a probability of 0 and cause the whole product to collapse.

To fix that, we use **Laplace smoothing**:

$$
\theta_i = \frac{\text{Number of spam emails containing word } i + \alpha}{\text{Total number of spam emails} + 2\alpha}
$$

Where:
- $\alpha$ is a small constant (usually 1)
- $2\alpha$ accounts for both word **present** and **absent** cases

### 2. Check the Email for Each Word

For a new email, we check each word $x_i$:

- If the word **is present**: we use $\theta_i$  
- If the word **is absent**: we use $1 - \theta_i$

We can express both in a single formula:

$$
P(x_i \mid \text{Spam}) = \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}
$$

Why it works:

- If $x_i = 1$:  
  $P(x_i \mid \text{Spam}) = \theta_i$

- If $x_i = 0$:  
  $P(x_i \mid \text{Spam}) = 1 - \theta_i$

### 3. Multiply All Word Probabilities Together

Since we assumed all words are conditionally independent:

$$
P(\text{Words} \mid \text{Spam}) = \prod_{i=1}^{n} \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}
$$

So we multiply each word’s probability — whether it’s present or not — to get the full likelihood for the email.

This is how Bernoulli Naive Bayes models text classification.

### 4. Multiply All Word Probabilities Together

Assuming independence:

$$
P(\text{Words} \mid \text{Spam}) = \prod_{i=1}^{n} \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}
$$

### 5. Combine with Prior

We also multiply by the prior probability of spam:

$$
P(\text{Spam}) = \frac{\text{Number of spam emails}}{\text{Total number of emails}}
$$

So the full expression becomes:

$$
P(\text{Spam} \mid \text{Words}) \propto P(\text{Spam}) \cdot \prod_{i=1}^{n} \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}
$$

## But There’s a Problem: Underflow

Multiplying many small probabilities can make the result **extremely tiny** — too small for the computer to represent accurately.

**Example:**

$$
0.9 \cdot 0.6 \cdot 0.1 \cdot 0.02 = 0.00108
$$

With hundreds of terms, the value might become so small it gets rounded to **0**. This is called **underflow**, and it breaks our calculation.

### The Fix: Use Logarithms

Instead of multiplying probabilities, we take the **log** and add them — this avoids underflow and is easier to compute.

### Log Rules We’ll Use

We’ll use these two key log rules:

- $\log(a \cdot b) = \log a + \log b$  
- $\log(a^b) = b \cdot \log a$  

### Why It Works: A Simple Example

Let’s take the log of the earlier multiplication:

$$
\log(0.9 \cdot 0.6 \cdot 0.1 \cdot 0.02) = \log(0.9) + \log(0.6) + \log(0.1) + \log(0.02)
$$

Approximate values:

$$
= -0.105 + (-0.222) + (-1.0) + (-1.699) = -3.026
$$

Instead of a tiny number like **0.00108**, we now get a manageable value: **–3.026**

### Apply to Bernoulli Naive Bayes

We start from:

$$
P(\text{Spam} \mid \text{Words}) \propto P(\text{Spam}) \cdot \prod_{i=1}^{n} \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}
$$

Take the log:

$$
\log P(\text{Spam} \mid \text{Words}) \propto \log P(\text{Spam}) + \log \prod_{i=1}^{n} \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}
$$

Convert the product into a sum:

$$
\log P(\text{Spam} \mid \text{Words}) \propto \log P(\text{Spam}) + \sum_{i=1}^{n} \log\left( \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i} \right)
$$

Break inside the log:

$$
\log P(\text{Spam} \mid \text{Words}) \propto \log P(\text{Spam}) + \sum_{i=1}^{n} \left[ \log(\theta_i^{x_i}) + \log((1 - \theta_i)^{1 - x_i}) \right]
$$

Apply the power rule:

$$
\log P(\text{Spam} \mid \text{Words}) \propto \log P(\text{Spam}) + \sum_{i=1}^{n} \left[ x_i \log \theta_i + (1 - x_i) \log(1 - \theta_i) \right]
$$