# Bernoulli Naive Bayes

## Probability Basics

Probability tells us how likely something is to happen. The result is always between 0 and 1.

In this section, we’ll learn 3 essential concepts:

1. Marginal Probability  
2. Joint Probability  
3. The Product Rule (with Conditional Probability)

We’ll use **coin flips** for all examples.

### 1. Marginal Probability

This is the probability of a single event happening.

**Example — Flipping a coin:**

- The probability of getting Heads:

$$
P(\text{Heads}) = \frac{1}{2}
$$

- The probability of getting Tails:

$$
P(\text{Tails}) = \frac{1}{2}
$$

These are called **marginal probabilities** — they only involve one event.

### 2. Joint Probability

This is the probability of **two events happening together**:

$$
P(A \cap B)
$$

**Example — Flipping two coins:**

- What’s the probability of getting Heads on the first **and** Heads on the second?

If the flips are **independent**:

$$
P(\text{Heads on both}) = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}
$$

So for **independent events only**, the joint probability is:

$$
P(A \cap B) = P(A) \cdot P(B)
$$

If the events are **not independent**, we need the **product rule**.

### 3. The Product Rule (with Conditional Probability)

The **product rule** handles both independent and dependent events:

$$
P(A \cap B) = P(A \mid B) \cdot P(B)
$$

This introduces **conditional probability**:

- $P(A \mid B)$ means "the probability of $A$, assuming $B$ has already happened."

**Example — Dependent coin flips:**

- If the first flip is Heads, the second flip is Heads **75%** of the time.
- $P(\text{First = Heads}) = \frac{1}{2}$
- $P(\text{Second = Heads} \mid \text{First = Heads}) = 0.75$

Then:

$$
P(\text{Both Heads}) = 0.75 \cdot \frac{1}{2} = 0.375
$$

### From Product Rule to Bayes’ Theorem

We can also write the product rule the other way:

$$
P(B \cap A) = P(B \mid A) \cdot P(A)
$$

Since $P(A \cap B) = P(B \cap A)$, we can set them equal:

$$
P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)
$$

Now divide both sides by $P(B)$:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

This is **Bayes’ Theorem** — a way to reverse the direction of conditional probability.

## Bayes' Theorem for Classification

Bayes’ Theorem helps us answer questions like:

> "Given some evidence, how likely is a hypothesis?"

The formula:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

Now let:
- $A = \text{Spam}$  
- $B = \text{Words in the email}$

Then:

$$
P(\text{Spam} \mid \text{Words}) = \frac{P(\text{Words} \mid \text{Spam}) \cdot P(\text{Spam})}{P(\text{Words})}
$$

### Why This Is Useful

In email spam detection:

- We know how often certain words appear in spam:  
  → $P(\text{Words} \mid \text{Spam})$

- We know the overall proportion of spam:  
  → $P(\text{Spam})$

But we want:

- The probability an email is spam **given** its words:  
  → $P(\text{Spam} \mid \text{Words})$

Bayes’ Theorem reverses the direction — from "given spam, what's likely to appear" to "given what appears, how likely is spam?"

### Why It's Called "Naive" Bayes

If an email has 10 words, computing:

$$
P(\text{Words} \mid \text{Spam})
$$

...requires modeling word combinations, which is hard.

So we make a **naive assumption**:

> All words are conditionally independent given the class.

Even though the assumption is unrealistic, it works well in practice — hence the name **Naive Bayes**.

### Why We Can Ignore $P(\text{Words})$

We usually compare two classes:

$$
\frac{P(\text{Words} \mid \text{Spam}) \cdot P(\text{Spam})}{P(\text{Words})}
\quad \text{vs} \quad
\frac{P(\text{Words} \mid \text{Not Spam}) \cdot P(\text{Not Spam})}{P(\text{Words})}
$$

Since the denominator is the same, it cancels out:

$$
P(\text{Spam} \mid \text{Words}) \propto P(\text{Words} \mid \text{Spam}) \cdot P(\text{Spam})
$$

## The Naive Assumption

We assume all words are **conditionally independent** given the class:

$$
P(\text{Words} \mid \text{Spam}) = \prod_{i=1}^{n} P(x_i \mid \text{Spam})
$$

Where:
- $x_i = 1$ if word $i$ is present
- $x_i = 0$ otherwise

### What is $\prod$?

$\prod$ is the product symbol — like $\sum$ but for multiplication:

$$
\prod_{i=1}^{n} a_i = a_1 \cdot a_2 \cdot \cdots \cdot a_n
$$

## Estimating $P(x_i \mid \text{Spam})$

We define:

$$
\theta_i = P(x_i = 1 \mid \text{Spam})
$$

Using Laplace smoothing:

$$
\theta_i = \frac{\text{number of spam emails containing word } i + \alpha}{\text{total number of spam emails} + 2\alpha}
$$

Where:
- $\alpha$ is a small constant (typically 1) to avoid zero probability
- $2\alpha$ accounts for both present and absent cases

### Why Use $\theta_i$ and $1 - \theta_i$

We want to handle two cases:

- If $x_i = 1$, use $\theta_i$
- If $x_i = 0$, use $1 - \theta_i$

So we write both in one formula:

$$
P(x_i \mid \text{Spam}) = \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}
$$

## Final Bernoulli Naive Bayes Formula

Putting it all together:

$$
P(\text{Spam} \mid \text{Words}) \propto P(\text{Spam}) \cdot \prod_{i=1}^{n} \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}
$$

**Example:**

- $x = [1, 0, 1]$  
- $\theta = [0.8, 0.4, 0.6]$

Then:

$$
P(\text{Spam} \mid x) \propto P(\text{Spam}) \cdot 0.8 \cdot (1 - 0.4) \cdot 0.6 = P(\text{Spam}) \cdot 0.8 \cdot 0.6 \cdot 0.6
$$

## Use Log to Prevent Underflow

Multiplying small numbers leads to underflow:

$$
0.9 \cdot 0.6 \cdot 0.1 \cdot 0.02 = 0.00108
$$

Over many features, this approaches zero.

### Logarithmic Form:

Convert products to sums:

$$
\log P(\text{Spam} \mid \text{Words}) \propto \log P(\text{Spam}) + \sum_{i=1}^{n} \left[ x_i \log \theta_i + (1 - x_i) \log (1 - \theta_i) \right]
$$

This is numerically stable and used in real implementations.