### 1. **Introduction**

* Naïve Bayes is a machine learning algorithm for **classification** (binary and multi-class).
* It is based on **Bayes’ Theorem** from probability.
* Understanding **probability concepts** (independent vs dependent events) is necessary before deriving Bayes’ Theorem.

---

### 2. **Probability Basics**

* **Independent events**: One outcome does not affect another.

  * Example: rolling a dice. Each face (1–6) has probability **1/6**.
* **Dependent events**: One event changes the probability of another.

  * Example: bag with 3 orange and 2 yellow marbles.

    * Probability(orange first) = 3/5.
    * After removing 1 orange, probability(yellow next) = 2/4 = 1/2.
  * Joint probability = P(O and Y) = P(O) × P(Y|O) = 3/5 × 1/2 = 3/10.
* This introduces **conditional probability**:

  * P(B|A) = probability of event B given event A has occurred.
  * General form: P(A and B) = P(A) × P(B|A).

---

### 3. **Bayes’ Theorem**

* Derived from conditional probability symmetry:

  * P(A and B) = P(A)P(B|A) = P(B)P(A|B).
* Rearranging gives Bayes’ theorem:

  $$
  P(A|B) = \frac{P(A) \cdot P(B|A)}{P(B)}
  $$
* Components:

  * P(A|B): Posterior (probability of A given evidence B).
  * P(A): Prior (probability of A).
  * P(B|A): Likelihood.
  * P(B): Evidence (normalization constant).

---

### 4. **Naïve Bayes in Machine Learning**

* Aim: Predict class $y$ given input features $X_1, X_2, ..., X_n$.
* Formula:

  $$
  P(y | X_1, X_2, ..., X_n) = \frac{P(y) \cdot P(X_1, X_2, ..., X_n | y)}{P(X_1, X_2, ..., X_n)}
  $$
* With the **Naïve assumption** (features are conditionally independent given class):

  $$
  P(y | X_1, ..., X_n) \propto P(y) \cdot \prod_{i=1}^n P(X_i|y)
  $$
* Denominator $P(X_1,...,X_n)$ is constant for all classes → ignored in comparison.
* Classification rule: Choose the class with **maximum posterior probability**.

---

### 5. **Worked Example (Tennis dataset)**

Dataset features: Outlook, Temperature, Humidity, Wind → Target: Play Tennis (Yes/No).

**Step 1: Prior Probabilities**

* Yes = 9/14.
* No = 5/14.

**Step 2: Conditional Probabilities** (example for Outlook):

* P(Sunny|Yes) = 2/9, P(Overcast|Yes) = 4/9, P(Rain|Yes) = 3/9.
* P(Sunny|No) = 3/5, P(Overcast|No) = 0, P(Rain|No) = 2/5.
  (Similarly calculated for Temperature values: Hot, Mild, Cool.)

**Step 3: Prediction for new test case**
Test data: Outlook = Sunny, Temperature = Hot.

* Compute posterior for **Yes**:

  $$
  P(Yes|Sunny,Hot) \propto P(Yes) \cdot P(Sunny|Yes) \cdot P(Hot|Yes)
  $$

  \= (9/14) × (2/9) × (2/9) ≈ 0.031.

* Compute posterior for **No**:

  $$
  P(No|Sunny,Hot) \propto P(No) \cdot P(Sunny|No) \cdot P(Hot|No)
  $$

  \= (5/14) × (3/5) × (2/5) ≈ 0.085.

* Normalize:

  * Yes = 0.031 / (0.031 + 0.085) ≈ **27%**.
  * No = 0.085 / (0.031 + 0.085) ≈ **73%**.

**Result:** Prediction = **No (will not play tennis)** since probability is higher.

---

### 6. **Conclusion**

* Naïve Bayes uses **Bayes’ theorem** with the assumption of **feature independence**.
* For each class, compute **prior × product of likelihoods**, then compare probabilities.
* It works for both **binary and multi-class classification**.
* Implementation in code is simple, but understanding the derivation and probability mechanics is essential.



# 🔹 1. Hard Margin SVC

* **Assumption**: The data is **perfectly linearly separable** (no overlap, no noise).
* The goal is to find a hyperplane that **separates the two classes with no misclassification**.
* It maximizes the margin subject to:

$$
y_i (w^T x_i + b) \geq 1 \quad \forall i
$$

* Means: every point must be correctly classified and lie **outside the margin boundaries**.

✅ **Pros**:

* Simple, clean, and works when data is truly separable.

❌ **Cons**:

* Very sensitive to **outliers** and **noise**.

  * Even one misclassified or overlapping point can break the model.

---

# 🔹 2. Soft Margin SVC

* **Reality**: Most real-world data is **not perfectly separable** (there’s noise, overlap, outliers).
* Soft margin allows some **violations of the margin rule** using **slack variables $\xi_i$**.
* Optimization problem:

$$
\min \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i
$$

subject to:

$$
y_i (w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0
$$

* Here:

  * $\xi_i$ measures how much point $i$ violates the margin (or is misclassified).
  * $C$ controls the penalty for misclassification:

    * Large $C$ → less tolerance (tries to classify every point correctly).
    * Small $C$ → more tolerance, allows wider margin with some misclassifications.

✅ **Pros**:

* Works better on noisy, real-world data.
* Balances **margin maximization** and **classification errors**.

❌ **Cons**:

* Needs tuning of **C parameter**.

---

# 🔹 3. Quick Analogy

* **Hard Margin** = "Strict teacher" → *no mistakes allowed*. Even one wrong answer = fail.
* **Soft Margin** = "Practical teacher" → *a few mistakes are allowed* if the overall understanding is strong.

---

✅ **Summary**:

* **Hard Margin** → perfect separation, no misclassification, sensitive to outliers.
* **Soft Margin** → allows some errors (controlled by C), more robust and practical.
