```{contents}
```

## Assumptions

---

### 1. Feature Independence

* **Assumption:** Features are independent given the class.
* **Reality:** Features are often correlated.

  * Example: In spam emails, the words *“lottery”* and *“win”* appear together frequently. Independence is false.
* **Why it still works:** The product of probabilities often still yields a **reasonable ranking of classes**, even if absolute probabilities are wrong. Classification only needs the **highest posterior**, not exact values.

---

### 2. Equal Contribution of Features

* **Assumption:** Each feature contributes equally to the prediction.
* **Reality:** Some features dominate.

  * Example: In medical diagnosis, *“tumor detected in MRI”* is far stronger than *“slight fever”*.
* **Why it still works:** In high-dimensional settings (like text classification), many weak but independent-ish signals combine to give strong results.

---

### 3. Distribution of Features

* **Assumption:**

  * Gaussian NB → features are normally distributed.
  * Multinomial NB → word counts follow multinomial distribution.
  * Bernoulli NB → features are binary indicators.
* **Reality:** Data distributions often deviate.

  * Example: Continuous features may be skewed, not Gaussian.
* **Why it still works:** As long as the assumed distribution is a rough approximation, the decision boundary can still separate classes effectively.

---

### 4. No Zero Probability

* **Assumption:** Every feature-class combination has a nonzero probability.
* **Reality:** Some words/values may not appear in training.

  * Example: If “Bitcoin” never appeared in spam training data, then $P(\text{Bitcoin}|\text{spam}) = 0$.
* **Why it still works:** With **Laplace (add-one) smoothing**, we avoid zeros and keep predictions stable.

---

**Key Insight:**
Even though independence and distribution assumptions are false in practice, Naive Bayes still works well when:

* Features provide **enough weak evidence**.
* The goal is classification, not perfect probability estimation.
* Data is **high-dimensional and sparse** (like text).

❌ It fails when:

* Strong feature correlations matter.
* Precise probability estimates are required (not just classification).
