```{contents}
```

# Background

## Why do we need probability in ML?

Machine learning is about **making predictions under uncertainty**. For example:

* Email classification: “Is this spam or not?”
* Medical diagnosis: “Does the patient have disease X?”
* Sentiment analysis: “Is the review positive, neutral, or negative?”

We want to estimate the **probability of each class** given the observed features.

Formally:

$$
P(y \mid X) \quad \text{where } X = (x_1, x_2, \dots, x_n)
$$

---

## Independent vs Dependent events

* **Independent events**: rolling a dice — the outcome of roll 1 does not affect roll 2.

  $$
  P(A \cap B) = P(A) \cdot P(B)
  $$

* **Dependent events**: drawing marbles without replacement. The probability of the second draw **changes** depending on the first.

  $$
  P(A \cap B) = P(A) \cdot P(B \mid A)
  $$

👉 This dependency idea is the foundation for **conditional probability**.

---

## Conditional Probability

Definition:

$$
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
$$

Intuition: “What is the probability of $A$ happening if I already know that $B$ happened?”

---

## Bayes Theorem

Using conditional probability both ways:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

Where:

* $P(A)$ = prior (belief about $A$ before seeing data)
* $P(B \mid A)$ = likelihood (how compatible evidence $B$ is with $A$)
* $P(B)$ = marginal probability (normalizing constant)
* $P(A \mid B)$ = posterior (updated belief after seeing evidence)

This is the **Bayesian update rule**.

---

## Applying Bayes to ML

We want:

$$
P(y \mid x_1, x_2, \dots, x_n)
$$

By Bayes theorem:

$$
P(y \mid X) = \frac{P(y) \cdot P(x_1, x_2, \dots, x_n \mid y)}{P(x_1, x_2, \dots, x_n)}
$$

The denominator is **the same for all classes**, so we only care about the numerator.

---

## The Naïve Assumption

Problem: computing $P(x_1, x_2, \dots, x_n \mid y)$ is complex because features may be dependent.

**Naïve Bayes assumes conditional independence:**

$$
P(x_1, x_2, \dots, x_n \mid y) \approx \prod_{i=1}^n P(x_i \mid y)
$$

This gives:

$$
\hat{y} = \arg\max_y \; P(y) \prod_{i=1}^n P(x_i \mid y)
$$

That’s the **Naïve Bayes classifier**.

---

## Interpreting the Formula

* $P(y)$: Prior probability of the class
* $P(x_i \mid y)$: Likelihood of feature $x_i$ under class $y$
* $\prod_i$: Combine all feature evidence
* $\arg\max_y$: Choose the class with the highest posterior probability

---

## Worked Example (Play Tennis 🌤️🎾)

Dataset (simplified):

* Features: **Outlook, Temperature**
* Target: **Play = Yes/No**

Say the test instance is:
`Outlook = Sunny, Temperature = Hot`

We compute:

$$
P(\text{Yes} \mid Sunny, Hot) \propto P(\text{Yes}) \cdot P(Sunny \mid Yes) \cdot P(Hot \mid Yes)
$$

$$
P(\text{No} \mid Sunny, Hot) \propto P(\text{No}) \cdot P(Sunny \mid No) \cdot P(Hot \mid No)
$$

* From counts in dataset:

  * $P(\text{Yes}) = \tfrac{9}{14}$, $P(\text{No}) = \tfrac{5}{14}$
  * $P(Sunny \mid Yes) = \tfrac{2}{9}$, $P(Hot \mid Yes) = \tfrac{2}{9}$
  * $P(Sunny \mid No) = \tfrac{3}{5}$, $P(Hot \mid No) = \tfrac{2}{5}$

Plug in:

$$
P(\text{Yes} \mid Sunny, Hot) \propto \tfrac{9}{14} \cdot \tfrac{2}{9} \cdot \tfrac{2}{9} \approx 0.031
$$

$$
P(\text{No} \mid Sunny, Hot) \propto \tfrac{5}{14} \cdot \tfrac{3}{5} \cdot \tfrac{2}{5} \approx 0.085
$$

Normalize:

* $P(\text{Yes} \mid Sunny, Hot) = 0.27$
* $P(\text{No} \mid Sunny, Hot) = 0.73$

✅ Prediction: **No (won’t play tennis)**

---

## Key Takeaways

* **Naïve Bayes = Bayes theorem + independence assumption**
* Works well when features are weakly correlated
* Very fast, good for text classification (spam filtering, sentiment analysis)
* Outputs **probabilities**, not just class labels