<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Naive%20Bayes/Naive%20Bayes%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes Hands-On Lab

In this hands-on lab, you will implement the Naive Bayes classifier from scratch, gaining a deep understanding of probabilistic classification. You'll work with both **Gaussian Naive Bayes** for continuous features and **Multinomial Naive Bayes** for text classification.

## Learning Objectives

By the end of this lab, you will be able to:

1. **Understand Bayes Theorem**: Apply the fundamental formula for probabilistic inference
2. **Implement Gaussian Naive Bayes**: Build a classifier for continuous features from scratch
3. **Implement Multinomial Naive Bayes**: Create a text classifier using bag-of-words representation
4. **Handle numerical stability**: Use log probabilities to avoid numerical underflow
5. **Apply Laplace smoothing**: Prevent zero probability issues in classification
6. **Visualize decision boundaries**: Understand how Naive Bayes separates classes
7. **Compare with scikit-learn**: Validate your implementation against the library version

## Algorithm Overview

### Bayes Theorem

Naive Bayes is a probabilistic classifier based on **Bayes Theorem**:

$$P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)}$$

Where:
- $P(y|X)$ is the **posterior probability** - probability of class $y$ given features $X$
- $P(X|y)$ is the **likelihood** - probability of features $X$ given class $y$
- $P(y)$ is the **prior probability** - probability of class $y$ before seeing the data
- $P(X)$ is the **evidence** - probability of the features (normalizing constant)

### The Naive Assumption

The "naive" in Naive Bayes comes from the **conditional independence assumption**: given the class label, all features are assumed to be independent of each other.

$$P(X|y) = P(x_1|y) \cdot P(x_2|y) \cdot ... \cdot P(x_n|y) = \prod_{i=1}^{n} P(x_i|y)$$

This simplification makes the algorithm computationally efficient and surprisingly effective in practice.

### Classification Decision

To classify a new sample, we compute:

$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i|y)$$

Since $P(X)$ is constant for all classes, we can ignore it for classification purposes.

### Gaussian Naive Bayes

For **continuous features**, we assume each feature follows a Gaussian (normal) distribution within each class:

$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_{y,i}^2}} \exp\left(-\frac{(x_i - \mu_{y,i})^2}{2\sigma_{y,i}^2}\right)$$

Where:
- $\mu_{y,i}$ is the mean of feature $i$ for class $y$
- $\sigma_{y,i}^2$ is the variance of feature $i$ for class $y$

### Multinomial Naive Bayes

For **count data** (like word frequencies in text), we use the multinomial distribution:

$$P(x_i|y) = \frac{N_{y,i} + \alpha}{N_y + \alpha \cdot n}$$

Where:
- $N_{y,i}$ is the count of feature $i$ in class $y$
- $N_y$ is the total count of all features in class $y$
- $\alpha$ is the Laplace smoothing parameter
- $n$ is the number of features

### Numerical Stability: Log Probabilities

Multiplying many small probabilities can lead to **numerical underflow**. To avoid this, we work with **log probabilities**:

$$\log P(y|X) \propto \log P(y) + \sum_{i=1}^{n} \log P(x_i|y)$$

For Gaussian likelihood, the log probability becomes:

$$\log P(x_i|y) = -\frac{1}{2}\log(2\pi\sigma_{y,i}^2) - \frac{(x_i - \mu_{y,i})^2}{2\sigma_{y,i}^2}$$

## When to Use Naive Bayes

Naive Bayes is a fast, probabilistic classifier that excels in specific scenarios. Understanding when to use it is crucial for effective model selection.

### ✅ Use Naive Bayes When:

**1. Text Classification Problems**
- Spam detection, sentiment analysis, document categorization
- High-dimensional sparse features (thousands of words)
- Example: Email with 10,000 vocabulary words, most zero → Perfect for Multinomial NB

**2. Need Fast Training and Prediction**
- Training is O(N × d) - just counting and computing means
- Prediction is O(d) - no iterative optimization
- Example: Real-time spam filtering processing millions of emails daily

**3. Limited Training Data**
- Naive Bayes needs fewer samples than discriminative models
- Works well with hundreds of samples
- Example: Medical diagnosis with only 200 patient records

**4. Features Are (Approximately) Conditionally Independent**
- Works surprisingly well even when independence is violated
- Best when features provide complementary information
- Example: Different medical tests measuring different aspects of health

**5. Need Probabilistic Outputs**
- Provides P(y|x) directly from Bayes theorem
- Useful for ranking or threshold tuning
- Example: Prioritize emails by spam probability, not just spam/not-spam

**6. Baseline Model for Comparison**
- Quick to implement and train
- Establishes performance floor for complex models
- Example: Before trying deep learning, check if Naive Bayes gets 90% accuracy

### ❌ Don't Use Naive Bayes When:

**1. Features Are Highly Correlated**
- Independence assumption severely violated
- Correlated features get "double-counted"
- **Better alternatives**: Logistic Regression, Random Forest, or PCA first
- Example: Using both "temperature in Celsius" and "temperature in Fahrenheit"

**2. Need to Capture Feature Interactions**
- Cannot learn "A AND B" patterns
- Each feature contributes independently
- **Better alternatives**: Decision Trees, Neural Networks
- Example: XOR problem - (0,0)→0, (1,1)→0, (0,1)→1, (1,0)→1

**3. Continuous Features Don't Follow Gaussian Distribution**
- Gaussian NB assumes normal distribution per class
- Multimodal or heavily skewed data violates this
- **Better alternatives**: Transform features, use kernel density estimation, or different model
- Example: Income data (highly right-skewed)

**4. Need Well-Calibrated Probabilities**
- Naive Bayes probabilities are often overconfident
- Pushes probabilities toward 0 or 1
- **Better alternatives**: Logistic Regression, or apply Platt scaling
- Example: When 0.7 predicted probability should actually mean 70% success rate

**5. Complex Decision Boundaries Required**
- Decision boundary is always linear in log-probability space
- Cannot capture highly non-linear patterns
- **Better alternatives**: SVM with RBF kernel, Neural Networks
- Example: Concentric circles classification

### Quick Decision Tree

```
Is it a text classification problem?
├─ Yes → Multinomial NB (excellent choice!)
└─ No
    ├─ Are features continuous and roughly Gaussian?
    │   ├─ Yes → Gaussian NB (good choice)
    │   └─ No → Consider other models
    └─ Are features binary (0/1)?
        └─ Yes → Bernoulli NB
```

### Comparison: Naive Bayes vs Other Classifiers

| Criterion | Naive Bayes | Logistic Regression | Decision Trees | SVM |
|-----------|-------------|---------------------|----------------|-----|
| **Training speed** | ✅ Very fast | ✅ Fast | ✅ Fast | ⚠️ Slow |
| **Prediction speed** | ✅ Very fast | ✅ Very fast | ✅ Fast | ⚠️ Slow |
| **Text classification** | ✅ Excellent | ✅ Good | ⚠️ Poor | ✅ Good |
| **Small datasets** | ✅ Excellent | ⚠️ Moderate | ⚠️ Overfits | ✅ Good |
| **Correlated features** | ❌ Poor | ✅ Good | ✅ Good | ✅ Good |
| **Interpretability** | ✅ Good | ✅ Excellent | ✅ Excellent | ❌ Poor |
| **Probability calibration** | ❌ Poor | ✅ Good | ⚠️ Moderate | ❌ Poor |
| **Feature interactions** | ❌ Cannot learn | ❌ Manual only | ✅ Automatic | ⚠️ Kernel only |

### Choosing the Right Naive Bayes Variant

| Variant | Feature Type | Use Case | Example |
|---------|--------------|----------|---------|
| **Gaussian NB** | Continuous (real numbers) | General classification | Iris flowers, medical measurements |
| **Multinomial NB** | Counts/frequencies | Text classification | Word counts, TF-IDF |
| **Bernoulli NB** | Binary (0/1) | Binary features | Word presence (not count) |
| **Complement NB** | Counts (imbalanced) | Imbalanced text data | Rare category detection |

### Real-World Applications Where Naive Bayes Excels:

1. **Spam Detection**: High-dimensional word features, need fast prediction, works great!
2. **Sentiment Analysis**: Positive/negative classification from text reviews
3. **Document Categorization**: News articles into topics (sports, politics, tech)
4. **Medical Diagnosis**: Symptoms as features, diseases as classes
5. **Recommendation Systems**: "Users who liked X also liked Y" patterns
6. **Real-time Classification**: When latency matters (milliseconds prediction time)

### The Bottom Line:

**Choose Naive Bayes when:**
- Text classification or high-dimensional sparse data
- Need fast training and prediction
- Have limited training data
- Want a simple, interpretable baseline

**Consider alternatives when:**
- Features are highly correlated
- Need to capture feature interactions
- Require well-calibrated probabilities
- Decision boundary is highly non-linear

## Pseudocode: Gaussian Naive Bayes

```
TRAINING:
1. For each class y in classes:
   a. Calculate prior: P(y) = count(y) / total_samples
   b. For each feature i:
      - Calculate mean: μ_yi = mean of feature i where class = y
      - Calculate variance: σ²_yi = variance of feature i where class = y

PREDICTION:
1. For each class y:
   a. Start with log_prob = log(P(y))  # log prior
   b. For each feature i:
      - Add log(P(x_i|y)) using Gaussian PDF
   c. Store total log_prob for class y
2. Return class with highest log probability
```

## Pseudocode: Multinomial Naive Bayes

```
# Multinomial Naive Bayes — For Text/Count Data
# Inputs
# X ← document-term matrix (N documents × V vocabulary)
# y ← class labels
# α ← Laplace smoothing parameter (default: 1)
# X_query ← documents to classify

# ----- fit -----
classes ← unique(y)
V ← number_of_columns(X)        # vocabulary size

FOR each class c in classes DO
    # Prior probability
    prior[c] ← count(y == c) / N
    
    # Get all documents of class c
    X_c ← X[y == c]
    
    # Count total words per feature in class c
    feature_counts[c] ← sum(X_c, axis=0)    # shape: (V,)
    total_count[c] ← sum(feature_counts[c])
    
    # Apply Laplace smoothing
    # P(word_i | class c) = (count_i + α) / (total + α × V)
    log_prob[c] ← log((feature_counts[c] + α) / (total_count[c] + α × V))
END FOR

# ----- predict -----
FOR each document d in X_query DO
    FOR each class c in classes DO
        # Log posterior = log prior + sum of (word_count × log_prob)
        score[c] ← log(prior[c]) + dot(d, log_prob[c])
    END FOR
    prediction[d] ← argmax(score)
END FOR

RETURN predictions
```

**Key Differences from Gaussian NB:**

| Aspect | Gaussian NB | Multinomial NB |
|--------|-------------|----------------|
| **Feature type** | Continuous | Counts/frequencies |
| **Distribution** | Gaussian (μ, σ²) | Multinomial |
| **Parameters stored** | Mean, variance per feature | Log probability per feature |
| **Smoothing** | var_smoothing (numerical stability) | α (Laplace, prevents zero prob) |
| **Likelihood** | Gaussian PDF | Word count × log probability |

---

## Checkpoint 1: Test Your Understanding

### Question 1

What does the "naive" assumption in Naive Bayes refer to?

A) Features are assumed to be independent given the class label  
B) Features are assumed to be identically distributed across all classes  
C) Each feature contributes equally to the classification decision  
D) The prior probabilities are assumed to be equal for all classes

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: A) Features are assumed to be independent given the class label**

The "naive" assumption refers to **conditional independence**: given the class label, all features are assumed to be independent of each other. Mathematically: P(X|y) = ∏P(xᵢ|y). This allows us to compute P(X|y) by simply multiplying individual feature probabilities.

**Example with numbers:**

For spam classification with features [contains "free", contains "money"]:
- **With independence assumption**: P("free", "money" | spam) = P("free" | spam) × P("money" | spam) = 0.8 × 0.6 = 0.48
- **Without assumption**: Would need P("free", "money" | spam) directly from data, requiring exponentially more samples

This simplification reduces parameters from O(2ⁿ) to O(n) for n binary features!

**Why other answers are incorrect:**

- **B) Features are assumed to be identically distributed across all classes**: This is incorrect. Naive Bayes explicitly models *different* distributions for each class - that's the whole point. Example: P("free" | spam) = 0.8 but P("free" | not_spam) = 0.1. The mean and variance (Gaussian NB) or word frequencies (Multinomial NB) are computed separately for each class.

- **C) Each feature contributes equally to the classification decision**: This is incorrect. Features have very different contributions. Example: If P("free" | spam) = 0.8 and P("free" | not_spam) = 0.1, the word "free" strongly indicates spam (ratio 8:1). But if P("the" | spam) = 0.9 and P("the" | not_spam) = 0.85, "the" barely helps (ratio ~1:1).

- **D) The prior probabilities are assumed to be equal for all classes**: This is incorrect. Naive Bayes explicitly computes priors from training data. Example: If 30% of training emails are spam, P(spam) = 0.3 and P(not_spam) = 0.7. These priors directly influence predictions.

</details>

---

# Part 1: Gaussian Naive Bayes Implementation

Let's implement Gaussian Naive Bayes step by step.

## Import Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB as SklearnGaussianNB
from sklearn.naive_bayes import MultinomialNB as SklearnMultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## Generate Sample Data

In [None]:
# Generate a 2D classification dataset for visualization
X, y = make_classification(
    n_samples=300,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    class_sep=2.0,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")
print(f"Classes: {np.unique(y_train)}")

In [None]:
# Visualize the data
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(X_train[y_train == 0][:, 0], X_train[y_train == 0][:, 1], 
            c='blue', label='Class 0', alpha=0.6, edgecolors='k')
plt.scatter(X_train[y_train == 1][:, 0], X_train[y_train == 1][:, 1], 
            c='red', label='Class 1', alpha=0.6, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Training Data')
plt.legend()

plt.subplot(1, 2, 2)
plt.scatter(X_test[y_test == 0][:, 0], X_test[y_test == 0][:, 1], 
            c='blue', label='Class 0', alpha=0.6, edgecolors='k')
plt.scatter(X_test[y_test == 1][:, 0], X_test[y_test == 1][:, 1], 
            c='red', label='Class 1', alpha=0.6, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Test Data')
plt.legend()

plt.tight_layout()
plt.show()

---

## Exercise 1: Calculate Class Statistics

In this exercise, you'll implement methods to calculate the **prior probabilities** and **class statistics** (mean and variance) needed for Gaussian Naive Bayes.

**Your tasks:**
1. Calculate the prior probability for each class
2. Calculate the mean of each feature for each class
3. Calculate the variance of each feature for each class

In [None]:
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier implementation from scratch.
    
    Parameters
    ----------
    var_smoothing : float, default=1e-9
        Portion of the largest variance of all features added to variances
        for numerical stability.
    """
    
    def __init__(self, var_smoothing=1e-9):
        self.var_smoothing = var_smoothing
        self.classes_ = None
        self.priors_ = None      # Prior probabilities for each class
        self.theta_ = None       # Mean of each feature per class
        self.var_ = None         # Variance of each feature per class
    
    def _calculate_priors(self, y):
        """
        Calculate the prior probability of each class.
        
        Prior P(y) = count(y) / total_samples
        
        Parameters
        ----------
        y : array-like of shape (n_samples,)
            Target values.
            
        Returns
        -------
        priors : array of shape (n_classes,)
            Prior probability for each class.
        """
        # TODO: Calculate the prior probability for each class
        # Hint: For each class, divide the count of samples in that class
        # by the total number of samples
        
        priors = None  # Replace with your implementation
        
        return priors
    
    def _calculate_class_statistics(self, X, y):
        """
        Calculate mean and variance of each feature for each class.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples,)
            Target values.
            
        Returns
        -------
        theta : array of shape (n_classes, n_features)
            Mean of each feature per class.
        var : array of shape (n_classes, n_features)
            Variance of each feature per class.
        """
        n_features = X.shape[1]
        n_classes = len(self.classes_)
        
        theta = np.zeros((n_classes, n_features))
        var = np.zeros((n_classes, n_features))
        
        # TODO: For each class, calculate the mean and variance of each feature
        # Hint: Filter X to only include samples of each class, then compute statistics
        
        for idx, c in enumerate(self.classes_):
            # Get samples belonging to class c
            X_c = None  # TODO: Filter X for samples where y == c
            
            # Calculate mean for each feature
            theta[idx, :] = None  # TODO: Calculate mean along axis 0
            
            # Calculate variance for each feature
            var[idx, :] = None  # TODO: Calculate variance along axis 0
        
        return theta, var

### Verification Cell for Exercise 1

Run this cell to verify your implementation of priors and class statistics calculation.

In [None]:
# Test the priors and class statistics calculation
gnb_test = GaussianNaiveBayes()
gnb_test.classes_ = np.unique(y_train)

# Test priors
priors = gnb_test._calculate_priors(y_train)
print("Prior Probabilities:")
if priors is not None:
    for i, c in enumerate(gnb_test.classes_):
        print(f"  P(y={c}) = {priors[i]:.4f}")
    
    # Verify priors sum to 1
    assert np.isclose(priors.sum(), 1.0), "Priors should sum to 1!"
    print("\n✓ Priors sum to 1.0")
else:
    print("  Not implemented yet")

print("\n" + "="*50 + "\n")

# Test class statistics
theta, var = gnb_test._calculate_class_statistics(X_train, y_train)
print("Class Statistics:")
if theta is not None and var is not None:
    for i, c in enumerate(gnb_test.classes_):
        print(f"\nClass {c}:")
        print(f"  Mean (θ): {theta[i]}")
        print(f"  Variance (σ²): {var[i]}")
    
    # Verify shape
    assert theta.shape == (len(gnb_test.classes_), X_train.shape[1]), "Theta shape incorrect!"
    assert var.shape == (len(gnb_test.classes_), X_train.shape[1]), "Variance shape incorrect!"
    print("\n✓ Class statistics shapes are correct")
else:
    print("  Not implemented yet")

<details>
<summary style="cursor: pointer; font-weight: bold;">💡 Click here for Exercise 1 Solution</summary>

```python
def _calculate_priors(self, y):
    priors = np.array([np.sum(y == c) / len(y) for c in self.classes_])
    return priors

def _calculate_class_statistics(self, X, y):
    n_features = X.shape[1]
    n_classes = len(self.classes_)
    
    theta = np.zeros((n_classes, n_features))
    var = np.zeros((n_classes, n_features))
    
    for idx, c in enumerate(self.classes_):
        # Get samples belonging to class c
        X_c = X[y == c]
        
        # Calculate mean for each feature
        theta[idx, :] = X_c.mean(axis=0)
        
        # Calculate variance for each feature
        var[idx, :] = X_c.var(axis=0)
    
    return theta, var
```

**Explanation:**
- **Priors**: For each class, we count how many samples belong to that class and divide by total samples
- **Mean (θ)**: Average value of each feature for samples in each class
- **Variance (σ²)**: Spread of each feature for samples in each class

</details>

---

## Exercise 2: Calculate Gaussian Log-Likelihood

Now implement the method to calculate the **log-likelihood** of observing features given a class, using the Gaussian probability density function.

**Formula:**
$$\log P(x_i|y) = -\frac{1}{2}\log(2\pi\sigma_{y,i}^2) - \frac{(x_i - \mu_{y,i})^2}{2\sigma_{y,i}^2}$$

In [None]:
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier implementation from scratch.
    """
    
    def __init__(self, var_smoothing=1e-9):
        self.var_smoothing = var_smoothing
        self.classes_ = None
        self.priors_ = None
        self.theta_ = None
        self.var_ = None
    
    def _calculate_priors(self, y):
        """Calculate prior probabilities."""
        priors = np.array([np.sum(y == c) / len(y) for c in self.classes_])
        return priors
    
    def _calculate_class_statistics(self, X, y):
        """Calculate mean and variance for each class."""
        n_features = X.shape[1]
        n_classes = len(self.classes_)
        
        theta = np.zeros((n_classes, n_features))
        var = np.zeros((n_classes, n_features))
        
        for idx, c in enumerate(self.classes_):
            X_c = X[y == c]
            theta[idx, :] = X_c.mean(axis=0)
            var[idx, :] = X_c.var(axis=0)
        
        return theta, var
    
    def _calculate_log_likelihood(self, X):
        """
        Calculate log-likelihood of X for each class using Gaussian PDF.
        
        Log P(x_i|y) = -0.5 * log(2π * σ²) - (x_i - μ)² / (2σ²)
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Input samples.
            
        Returns
        -------
        log_likelihood : array of shape (n_samples, n_classes)
            Log-likelihood for each sample and each class.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        
        log_likelihood = np.zeros((n_samples, n_classes))
        
        # TODO: Calculate log-likelihood for each class
        # For each class:
        # 1. Calculate the log of the Gaussian PDF for each feature
        # 2. Sum across features (naive assumption - features are independent)
        
        for idx in range(n_classes):
            # Get mean and variance for this class
            mean = self.theta_[idx]  # shape: (n_features,)
            var = self.var_[idx]     # shape: (n_features,)
            
            # TODO: Calculate log-likelihood using Gaussian PDF formula
            # Hint: Use np.log for logarithm, np.pi for π
            # The formula is: -0.5 * log(2π * σ²) - (x - μ)² / (2σ²)
            # Sum across features for each sample
            
            log_likelihood[:, idx] = None  # Replace with your implementation
        
        return log_likelihood

### Verification Cell for Exercise 2

Run this cell to verify your log-likelihood implementation.

In [None]:
# Test log-likelihood calculation
gnb_test = GaussianNaiveBayes(var_smoothing=1e-9)
gnb_test.classes_ = np.unique(y_train)
gnb_test.theta_, gnb_test.var_ = gnb_test._calculate_class_statistics(X_train, y_train)

# Add smoothing to variance
gnb_test.var_ = gnb_test.var_ + gnb_test.var_smoothing

# Calculate log-likelihood for test samples
log_likelihood = gnb_test._calculate_log_likelihood(X_test[:5])

print("Log-Likelihood for first 5 test samples:")
if log_likelihood is not None and not np.any(log_likelihood == None):
    print(f"Shape: {log_likelihood.shape}")
    print(f"\nLog-likelihood values:")
    for i in range(5):
        print(f"  Sample {i}: Class 0 = {log_likelihood[i, 0]:.4f}, Class 1 = {log_likelihood[i, 1]:.4f}")
    
    # Verify shape
    assert log_likelihood.shape == (5, 2), "Log-likelihood shape incorrect!"
    # Verify no NaN or Inf values
    assert not np.any(np.isnan(log_likelihood)), "Log-likelihood contains NaN!"
    assert not np.any(np.isinf(log_likelihood)), "Log-likelihood contains Inf!"
    print("\n✓ Log-likelihood implementation looks correct")
else:
    print("  Not implemented yet")

---

## Checkpoint 2: Test Your Understanding

### Question 2

Why do we use log probabilities instead of raw probabilities in Naive Bayes?

A) Log transformation normalizes the feature distributions to be Gaussian  
B) Logarithms convert the product of probabilities into a sum, preventing numerical underflow  
C) Log probabilities allow the model to handle negative feature values  
D) Using logs reduces the computational complexity from O(n²) to O(n)

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) Logarithms convert the product of probabilities into a sum, preventing numerical underflow**

When multiplying many small probabilities (like P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)), the result can become astronomically small, causing numerical underflow to zero.

**Example with numbers:**

Consider classifying a document with 100 words, each with P(word|class) ≈ 0.01:
- **Raw probabilities**: 0.01¹⁰⁰ = 10⁻²⁰⁰ → **Underflows to 0!**
- **Log probabilities**: 100 × log(0.01) = 100 × (-4.6) = -460 → **Computable!**

The classification comparison still works:
```
Class A: log(0.01) × 100 = -460
Class B: log(0.02) × 100 = -340  ← Winner (less negative)
```

Using log probabilities:
- Converts multiplication to addition: log(a × b) = log(a) + log(b)
- Keeps values in a manageable numerical range (e.g., -500 instead of 10⁻²⁰⁰)
- Preserves the relative ordering needed for classification (log is monotonic)

**Why other answers are incorrect:**

- **A) Log transformation normalizes the feature distributions to be Gaussian**: This is incorrect. Log transformation of probabilities has nothing to do with making features Gaussian. Example: If P(word|spam) follows any distribution, taking log just changes scale, not shape. Log transformations of *features* (not probabilities) can sometimes help with skewed data, but that's a different concept.

- **C) Log probabilities allow the model to handle negative feature values**: This is incorrect. Log probabilities are about the probability values (which are always positive: 0 < p < 1), not about handling negative features. Example: Gaussian NB naturally handles x = -5 because it uses (x - μ)² in the PDF. Multinomial NB requires non-negative counts regardless of whether logs are used.

- **D) Using logs reduces the computational complexity from O(n²) to O(n)**: This is incorrect. The computational complexity remains O(n) for n features with or without logs. We still compute n terms and sum them. Log is applied element-wise: log(p₁) + log(p₂) + ... + log(pₙ) is still O(n) operations.

</details>

<details>
<summary style="cursor: pointer; font-weight: bold;">💡 Click here for Exercise 2 Solution</summary>

```python
def _calculate_log_likelihood(self, X):
    n_samples = X.shape[0]
    n_classes = len(self.classes_)
    
    log_likelihood = np.zeros((n_samples, n_classes))
    
    for idx in range(n_classes):
        mean = self.theta_[idx]
        var = self.var_[idx]
        
        # Log of Gaussian PDF: -0.5 * log(2π * σ²) - (x - μ)² / (2σ²)
        # Sum across features (naive assumption)
        log_likelihood[:, idx] = np.sum(
            -0.5 * np.log(2 * np.pi * var) - ((X - mean) ** 2) / (2 * var),
            axis=1
        )
    
    return log_likelihood
```

**Explanation:**
- We compute the log of the Gaussian PDF for each feature
- The naive assumption allows us to sum log-probabilities across features
- Broadcasting handles the vectorized computation efficiently
- `axis=1` sums across features for each sample

</details>

---

## Exercise 3: Complete the Gaussian Naive Bayes Classifier

Now implement the complete `fit` and `predict` methods to finish the Gaussian Naive Bayes classifier.

In [None]:
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier implementation from scratch.
    """
    
    def __init__(self, var_smoothing=1e-9):
        self.var_smoothing = var_smoothing
        self.classes_ = None
        self.priors_ = None
        self.theta_ = None
        self.var_ = None
    
    def _calculate_priors(self, y):
        """Calculate prior probabilities."""
        return np.array([np.sum(y == c) / len(y) for c in self.classes_])
    
    def _calculate_class_statistics(self, X, y):
        """Calculate mean and variance for each class."""
        n_features = X.shape[1]
        n_classes = len(self.classes_)
        
        theta = np.zeros((n_classes, n_features))
        var = np.zeros((n_classes, n_features))
        
        for idx, c in enumerate(self.classes_):
            X_c = X[y == c]
            theta[idx, :] = X_c.mean(axis=0)
            var[idx, :] = X_c.var(axis=0)
        
        return theta, var
    
    def _calculate_log_likelihood(self, X):
        """Calculate log-likelihood using Gaussian PDF."""
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        log_likelihood = np.zeros((n_samples, n_classes))
        
        for idx in range(n_classes):
            mean = self.theta_[idx]
            var = self.var_[idx]
            log_likelihood[:, idx] = np.sum(
                -0.5 * np.log(2 * np.pi * var) - ((X - mean) ** 2) / (2 * var),
                axis=1
            )
        
        return log_likelihood
    
    def fit(self, X, y):
        """
        Fit the Gaussian Naive Bayes classifier.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples,)
            Target values.
            
        Returns
        -------
        self : object
            Fitted estimator.
        """
        # TODO: Implement the fit method
        # 1. Store unique classes
        # 2. Calculate prior probabilities
        # 3. Calculate class statistics (mean and variance)
        # 4. Apply variance smoothing for numerical stability
        
        # Store unique classes
        self.classes_ = None  # TODO
        
        # Calculate priors
        self.priors_ = None  # TODO
        
        # Calculate class statistics
        self.theta_, self.var_ = None, None  # TODO
        
        # Apply variance smoothing
        # TODO: Add var_smoothing to variance to prevent division by zero
        
        return self
    
    def predict(self, X):
        """
        Predict class labels for samples in X.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples to predict.
            
        Returns
        -------
        y_pred : array of shape (n_samples,)
            Predicted class labels.
        """
        # TODO: Implement the predict method
        # 1. Calculate log priors
        # 2. Calculate log likelihoods
        # 3. Combine: log_posterior ∝ log_prior + log_likelihood
        # 4. Return the class with highest log posterior for each sample
        
        # Calculate log priors (same for all samples)
        log_priors = None  # TODO: Use np.log on priors
        
        # Calculate log likelihoods
        log_likelihood = None  # TODO
        
        # Combine log prior and log likelihood
        log_posterior = None  # TODO: Add log_priors to log_likelihood
        
        # Return class with highest log posterior
        return None  # TODO: Use self.classes_ and np.argmax
    
    def predict_proba(self, X):
        """
        Return probability estimates for samples in X.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples.
            
        Returns
        -------
        proba : array of shape (n_samples, n_classes)
            Probability of each class for each sample.
        """
        log_priors = np.log(self.priors_)
        log_likelihood = self._calculate_log_likelihood(X)
        log_posterior = log_priors + log_likelihood
        
        # Convert log probabilities to probabilities using softmax
        # Subtract max for numerical stability
        log_posterior = log_posterior - np.max(log_posterior, axis=1, keepdims=True)
        posterior = np.exp(log_posterior)
        return posterior / posterior.sum(axis=1, keepdims=True)

### Verification Cell for Exercise 3

Run this cell to verify your complete Gaussian Naive Bayes implementation.

In [None]:
# Test the complete implementation
gnb = GaussianNaiveBayes(var_smoothing=1e-9)
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

if y_pred is not None:
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Gaussian Naive Bayes Accuracy: {accuracy:.4f}")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    
    # Compare with sklearn
    sklearn_gnb = SklearnGaussianNB(var_smoothing=1e-9)
    sklearn_gnb.fit(X_train, y_train)
    sklearn_pred = sklearn_gnb.predict(X_test)
    sklearn_accuracy = accuracy_score(y_test, sklearn_pred)
    
    print(f"\nScikit-learn GaussianNB Accuracy: {sklearn_accuracy:.4f}")
    
    if np.isclose(accuracy, sklearn_accuracy, atol=0.01):
        print("\n✓ Your implementation matches scikit-learn!")
    else:
        print(f"\n⚠ Accuracy differs from sklearn by {abs(accuracy - sklearn_accuracy):.4f}")
else:
    print("Prediction not implemented yet")

<details>
<summary style="cursor: pointer; font-weight: bold;">💡 Click here for Exercise 3 Solution</summary>

```python
def fit(self, X, y):
    # Store unique classes
    self.classes_ = np.unique(y)
    
    # Calculate priors
    self.priors_ = self._calculate_priors(y)
    
    # Calculate class statistics
    self.theta_, self.var_ = self._calculate_class_statistics(X, y)
    
    # Apply variance smoothing for numerical stability
    self.var_ = self.var_ + self.var_smoothing
    
    return self

def predict(self, X):
    # Calculate log priors
    log_priors = np.log(self.priors_)
    
    # Calculate log likelihoods
    log_likelihood = self._calculate_log_likelihood(X)
    
    # Combine: log_posterior ∝ log_prior + log_likelihood
    log_posterior = log_priors + log_likelihood
    
    # Return class with highest log posterior
    return self.classes_[np.argmax(log_posterior, axis=1)]
```

**Explanation:**
- **fit**: Stores classes, computes priors, means, variances, and adds smoothing
- **predict**: Computes log posterior = log prior + log likelihood, returns argmax class
- Using log probabilities avoids numerical underflow from multiplying small numbers

</details>

---

## Visualize Decision Boundary

In [None]:
def plot_decision_boundary(model, X, y, title="Decision Boundary"):
    """
    Plot the decision boundary of a classifier.
    """
    h = 0.02  # Step size
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], c='blue', 
                label='Class 0', edgecolors='k', alpha=0.7)
    plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], c='red', 
                label='Class 1', edgecolors='k', alpha=0.7)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.legend()
    plt.show()

# Plot decision boundary for our implementation
if y_pred is not None:
    plot_decision_boundary(gnb, X_train, y_train, 
                          "Gaussian Naive Bayes Decision Boundary (Our Implementation)")

---

## Impact of Outliers on Gaussian Naive Bayes

Gaussian Naive Bayes estimates the mean (μ) and variance (σ²) of each feature for each class. Since these statistics are sensitive to extreme values, **outliers can significantly distort the model's decision boundary**.

Let's visualize how outliers affect Gaussian NB:

In [None]:
# Demonstrate the impact of outliers on Gaussian Naive Bayes
np.random.seed(42)

# Create clean data
n_samples = 100
X_clean_0 = np.random.randn(n_samples, 2) + np.array([-2, -2])
X_clean_1 = np.random.randn(n_samples, 2) + np.array([2, 2])
X_clean = np.vstack([X_clean_0, X_clean_1])
y_clean = np.array([0] * n_samples + [1] * n_samples)

# Create data with outliers (add extreme points to class 0)
X_with_outliers = X_clean.copy()
outliers = np.array([[8, 8], [9, 7], [7, 9]])  # Extreme outliers in class 0
X_with_outliers = np.vstack([X_with_outliers, outliers])
y_with_outliers = np.append(y_clean, [0, 0, 0])

# Train models
gnb_clean = GaussianNaiveBayes(var_smoothing=1e-9)
gnb_clean.fit(X_clean, y_clean)

gnb_outliers = GaussianNaiveBayes(var_smoothing=1e-9)
gnb_outliers.fit(X_with_outliers, y_with_outliers)

# Visualization function for comparison
def plot_gnb_comparison(X1, y1, model1, title1, X2, y2, model2, title2):
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    for ax, X, y, model, title in [(axes[0], X1, y1, model1, title1), 
                                    (axes[1], X2, y2, model2, title2)]:
        h = 0.1
        x_min, x_max = -6, 12
        y_min, y_max = -6, 12
        
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                             np.arange(y_min, y_max, h))
        
        Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        
        ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
        ax.scatter(X[y == 0][:, 0], X[y == 0][:, 1], c='blue', 
                   label='Class 0', edgecolors='k', alpha=0.7)
        ax.scatter(X[y == 1][:, 0], X[y == 1][:, 1], c='red', 
                   label='Class 1', edgecolors='k', alpha=0.7)
        
        # Mark outliers
        if title == title2:
            ax.scatter(outliers[:, 0], outliers[:, 1], c='blue', s=200, 
                      marker='*', edgecolors='yellow', linewidths=2, label='Outliers')
        
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
        ax.set_title(title)
        ax.legend()
        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)
    
    plt.tight_layout()
    plt.show()

plot_gnb_comparison(X_clean, y_clean, gnb_clean, 'Clean Data (No Outliers)',
                   X_with_outliers, y_with_outliers, gnb_outliers, 'Data with Outliers')

# Print statistics comparison
print("Class 0 Statistics Comparison:")
print(f"  Without outliers - Mean: {gnb_clean.theta_[0]}, Var: {gnb_clean.var_[0]}")
print(f"  With outliers    - Mean: {gnb_outliers.theta_[0]}, Var: {gnb_outliers.var_[0]}")
print(f"\nNotice how outliers shift the mean and inflate the variance of Class 0!")

### Key Observations on Outliers

**Effects of outliers on Gaussian NB:**

1. **Mean distortion**: Outliers pull the class mean toward them, shifting the decision boundary
2. **Variance inflation**: Outliers increase the variance estimate, making the Gaussian distribution "wider"
3. **Decision boundary shift**: The combined effect can cause significant misclassification of normal points

**Mitigation strategies:**

| Strategy | Description |
|----------|-------------|
| **Outlier removal** | Remove points beyond k standard deviations |
| **Robust statistics** | Use median and MAD instead of mean and variance |
| **Feature transformation** | Apply log transform or winsorization |
| **Different model** | Consider models less sensitive to outliers |

> **Note**: Gaussian NB is particularly vulnerable because both mean and variance are affected. Compare this to k-NN, where only nearby points influence predictions.

---

## Checkpoint 3: Test Your Understanding

### Question 3

What is the purpose of `var_smoothing` in Gaussian Naive Bayes?

A) To add regularization that prevents overfitting to the training data  
B) To ensure numerical stability when variance is very small or zero  
C) To standardize features to have unit variance before training  
D) To control the trade-off between model complexity and generalization

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) To ensure numerical stability when variance is very small or zero**

`var_smoothing` adds a small value (typically 1e-9) to the computed variance of each feature. In the Gaussian PDF formula, we divide by variance (σ²). If variance is zero or very close to zero, this causes division by zero or numerical overflow.

**Example with numbers:**

Consider a feature where all samples in class "spam" have the same value:
```
Feature "email_length" for spam class: [100, 100, 100, 100, 100]
Computed variance: σ² = 0

Without smoothing:
P(x=105|spam) = 1/(√(2π×0)) × exp(-(105-100)²/(2×0)) = 1/0 × exp(-∞) → ERROR!

With var_smoothing = 1e-9:
σ² = 0 + 1e-9 = 1e-9
P(x=105|spam) = 1/(√(2π×1e-9)) × exp(-(105-100)²/(2×1e-9)) ≈ 0 (very small but computable)
```

**Why other answers are incorrect:**

- **A) To add regularization that prevents overfitting to the training data**: While smoothing can have a mild regularization effect, this is not its primary purpose. The default value 1e-9 is far too small to meaningfully regularize. Example: Adding 0.000000001 to a variance of 2.5 doesn't change predictions. Contrast with Laplace smoothing (α=1) in Multinomial NB, which explicitly regularizes.

- **C) To standardize features to have unit variance before training**: This is incorrect. var_smoothing does not standardize features. Example: If feature has variance 100, var_smoothing adds 1e-9, giving 100.000000001 - not unit variance! StandardScaler (subtracting mean, dividing by std) standardizes features, which is a separate preprocessing step.

- **D) To control the trade-off between model complexity and generalization**: This describes regularization hyperparameters like α in Multinomial NB. var_smoothing's default (1e-9) is chosen for numerical stability only. Example: Changing var_smoothing from 1e-9 to 1e-8 has negligible effect on predictions - it's not a tuning knob.

</details>

---

## Multiple Choice Questions: Gaussian Naive Bayes

---

# Part 2: Multinomial Naive Bayes for Text Classification

Now let's implement **Multinomial Naive Bayes**, which is commonly used for text classification with word count features.

## Text Classification Example

We'll classify movie reviews as positive or negative.

In [None]:
# Sample movie reviews dataset
reviews = [
    "This movie was fantastic and amazing",
    "Great film with excellent acting",
    "Wonderful story and brilliant performance",
    "I loved this movie so much",
    "Best movie I have ever seen",
    "Outstanding cinematography and plot",
    "Terrible movie waste of time",
    "Awful film with bad acting",
    "Boring and disappointing story",
    "I hated this movie completely",
    "Worst movie ever made",
    "Poor direction and terrible script",
    "Amazing performances by all actors",
    "A masterpiece of modern cinema",
    "Dreadful experience awful waste",
    "Horrible plot and bad dialogue"
]

labels = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]  # 1 = positive, 0 = negative

# Split data
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(
    reviews, labels, test_size=0.25, random_state=42
)

print(f"Training samples: {len(X_text_train)}")
print(f"Test samples: {len(X_text_test)}")

In [None]:
# Convert text to bag-of-words representation
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_text_train).toarray()
X_test_bow = vectorizer.transform(X_text_test).toarray()

print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")
print(f"\nVocabulary: {vectorizer.get_feature_names_out()}")
print(f"\nBag-of-words shape: {X_train_bow.shape}")

---

## Exercise 4: Implement Multinomial Naive Bayes

Implement the Multinomial Naive Bayes classifier with **Laplace smoothing**.

**Formula for feature likelihood:**
$$P(x_i|y) = \frac{N_{y,i} + \alpha}{N_y + \alpha \cdot n}$$

Where:
- $N_{y,i}$ = count of feature $i$ in class $y$
- $N_y$ = total count of all features in class $y$
- $\alpha$ = smoothing parameter (usually 1 for Laplace smoothing)
- $n$ = number of features

In [None]:
class MultinomialNaiveBayes:
    """
    Multinomial Naive Bayes classifier for text classification.
    
    Parameters
    ----------
    alpha : float, default=1.0
        Laplace smoothing parameter.
    """
    
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.classes_ = None
        self.priors_ = None
        self.feature_log_prob_ = None  # Log probability of features given class
    
    def fit(self, X, y):
        """
        Fit the Multinomial Naive Bayes classifier.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data (word counts).
        y : array-like of shape (n_samples,)
            Target values.
        """
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        n_features = X.shape[1]
        
        # Calculate priors
        self.priors_ = np.array([np.sum(y == c) / len(y) for c in self.classes_])
        
        # Calculate feature log probabilities with Laplace smoothing
        self.feature_log_prob_ = np.zeros((n_classes, n_features))
        
        # TODO: Calculate P(x_i|y) for each feature and class using Laplace smoothing
        # Formula: P(x_i|y) = (N_yi + alpha) / (N_y + alpha * n_features)
        # Then take log for numerical stability
        
        for idx, c in enumerate(self.classes_):
            # Get samples belonging to class c
            X_c = X[y == c]
            
            # TODO: Calculate N_yi (sum of feature i across all samples in class c)
            feature_counts = None  # Sum along axis 0
            
            # TODO: Calculate N_y (total count of all features in class c)
            total_count = None  # Sum of all feature counts
            
            # TODO: Apply Laplace smoothing and calculate log probabilities
            # P(x_i|y) = (feature_counts + alpha) / (total_count + alpha * n_features)
            self.feature_log_prob_[idx, :] = None
        
        return self
    
    def predict(self, X):
        """
        Predict class labels for samples in X.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples to predict.
            
        Returns
        -------
        y_pred : array of shape (n_samples,)
            Predicted class labels.
        """
        # TODO: Calculate log posterior for each class
        # log_posterior = log_prior + sum(x_i * log_P(x_i|y))
        
        log_priors = np.log(self.priors_)
        
        # TODO: Calculate log likelihood using feature counts and log probabilities
        # Hint: Use matrix multiplication X @ self.feature_log_prob_.T
        log_likelihood = None
        
        # TODO: Calculate log posterior
        log_posterior = None
        
        # Return class with highest log posterior
        return self.classes_[np.argmax(log_posterior, axis=1)]

### Verification Cell for Exercise 4

In [None]:
# Test Multinomial Naive Bayes
mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.fit(X_train_bow, np.array(y_text_train))

# Make predictions
y_text_pred = mnb.predict(X_test_bow)

if y_text_pred is not None:
    accuracy = accuracy_score(y_text_test, y_text_pred)
    print(f"Multinomial Naive Bayes Accuracy: {accuracy:.4f}")
    
    print("\nPredictions vs Actual:")
    for review, actual, pred in zip(X_text_test, y_text_test, y_text_pred):
        sentiment_actual = "Positive" if actual == 1 else "Negative"
        sentiment_pred = "Positive" if pred == 1 else "Negative"
        match = "✓" if actual == pred else "✗"
        print(f"  {match} '{review[:40]}...' - Actual: {sentiment_actual}, Predicted: {sentiment_pred}")
    
    # Compare with sklearn
    sklearn_mnb = SklearnMultinomialNB(alpha=1.0)
    sklearn_mnb.fit(X_train_bow, np.array(y_text_train))
    sklearn_pred = sklearn_mnb.predict(X_test_bow)
    sklearn_accuracy = accuracy_score(y_text_test, sklearn_pred)
    
    print(f"\nScikit-learn MultinomialNB Accuracy: {sklearn_accuracy:.4f}")
    
    if np.allclose(y_text_pred, sklearn_pred):
        print("\n✓ Your implementation matches scikit-learn!")
else:
    print("Prediction not implemented yet")

---

## Checkpoint 4: Test Your Understanding

### Question 4

What problem does Laplace smoothing (alpha) solve in Multinomial Naive Bayes?

A) It handles the case where a word appears in test data but not in any training documents  
B) It prevents zero probabilities when a word never appears in documents of a particular class  
C) It removes stop words that appear too frequently across all documents  
D) It corrects for the different document lengths in the training corpus

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) It prevents zero probabilities when a word never appears in documents of a particular class**

Without smoothing, if a word never appears in training documents of a particular class, P(word|class) = 0. When we multiply probabilities, any zero makes the entire product zero.

**Example with numbers:**

Consider classifying: "Free money offer" with vocabulary [free, money, offer, hello]

Training data counts:
```
           free  money  offer  hello  TOTAL
spam:        10      8      5      2     25
not_spam:     0      1      0     15     16
```

**Without smoothing (α=0):**
```
P("free"|not_spam) = 0/16 = 0
P("money"|not_spam) = 1/16 = 0.0625
P("offer"|not_spam) = 0/16 = 0

P(not_spam|"Free money offer") ∝ 0 × 0.0625 × 0 = 0  ← Always zero!
```

**With Laplace smoothing (α=1):**
```
P("free"|not_spam) = (0+1)/(16+4) = 1/20 = 0.05
P("money"|not_spam) = (1+1)/(16+4) = 2/20 = 0.10
P("offer"|not_spam) = (0+1)/(16+4) = 1/20 = 0.05

P(not_spam|"Free money offer") ∝ 0.05 × 0.10 × 0.05 = 0.00025  ← Non-zero!
```

**Why other answers are incorrect:**

- **A) It handles the case where a word appears in test data but not in any training documents**: This is "out-of-vocabulary" (OOV), a different problem. Example: If "cryptocurrency" isn't in vocabulary at all, it's simply ignored during prediction. You'd need unknown word tokens or subword models to handle truly unseen words.

- **C) It removes stop words that appear too frequently across all documents**: Incorrect - smoothing doesn't remove anything. Example: The word "the" appearing 1000 times still gets counted. Stop word removal is a separate preprocessing step (using nltk.corpus.stopwords or similar).

- **D) It corrects for the different document lengths in the training corpus**: Incorrect - smoothing doesn't normalize by document length. Example: A 1000-word document contributes more counts than a 10-word document. Document length normalization requires dividing by document length or using TF-IDF.

</details>

<details>
<summary style="cursor: pointer; font-weight: bold;">💡 Click here for Exercise 4 Solution</summary>

```python
def fit(self, X, y):
    self.classes_ = np.unique(y)
    n_classes = len(self.classes_)
    n_features = X.shape[1]
    
    # Calculate priors
    self.priors_ = np.array([np.sum(y == c) / len(y) for c in self.classes_])
    
    # Calculate feature log probabilities with Laplace smoothing
    self.feature_log_prob_ = np.zeros((n_classes, n_features))
    
    for idx, c in enumerate(self.classes_):
        X_c = X[y == c]
        
        # N_yi: sum of feature i across all samples in class c
        feature_counts = X_c.sum(axis=0)
        
        # N_y: total count of all features in class c
        total_count = feature_counts.sum()
        
        # Apply Laplace smoothing and calculate log probabilities
        self.feature_log_prob_[idx, :] = np.log(
            (feature_counts + self.alpha) / (total_count + self.alpha * n_features)
        )
    
    return self

def predict(self, X):
    log_priors = np.log(self.priors_)
    
    # Log likelihood: sum of (x_i * log P(x_i|y))
    log_likelihood = X @ self.feature_log_prob_.T
    
    # Log posterior
    log_posterior = log_priors + log_likelihood
    
    return self.classes_[np.argmax(log_posterior, axis=1)]
```

**Explanation:**
- **Laplace smoothing**: Adds α to each count to avoid zero probabilities for unseen words
- **Feature counts**: Sum of each word's frequency across all documents in a class
- **Log likelihood**: For count data, we multiply log probabilities by word counts
- Matrix multiplication `X @ feature_log_prob_.T` efficiently computes the sum

</details>

---

## Multiple Choice Questions: Multinomial Naive Bayes

---

## TF-IDF vs Bag-of-Words for Text Classification

So far we've used **Bag-of-Words (BoW)** - raw word counts. However, **TF-IDF (Term Frequency-Inverse Document Frequency)** often provides better features for text classification.

### The Problem with Raw Counts

Common words like "the", "is", "and" appear frequently in all documents but carry little discriminative information. Raw counts give these words high importance.

### TF-IDF Solution

TF-IDF weighs terms by:

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

Where:
- **TF(t, d)** = frequency of term t in document d
- **IDF(t)** = log(N / df(t)) where N is total documents and df(t) is documents containing t

**Key insight**: Words appearing in many documents get lower IDF weights, reducing the influence of common words.

In [None]:
# Compare Bag-of-Words vs TF-IDF for text classification
# Using a larger dataset to see the difference

# Extended movie reviews for better comparison
extended_reviews = [
    "This movie was fantastic and amazing",
    "Great film with excellent acting",
    "Wonderful story and brilliant performance",
    "I loved this movie so much",
    "Best movie I have ever seen",
    "Outstanding cinematography and plot",
    "Amazing performances by all actors",
    "A masterpiece of modern cinema",
    "Incredible film that I highly recommend",
    "Superb acting and wonderful direction",
    "Terrible movie waste of time",
    "Awful film with bad acting",
    "Boring and disappointing story",
    "I hated this movie completely",
    "Worst movie ever made",
    "Poor direction and terrible script",
    "Dreadful experience awful waste",
    "Horrible plot and bad dialogue",
    "The movie was so boring I fell asleep",
    "Disappointing film with weak characters"
]

extended_labels = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# Split data
X_ext_train, X_ext_test, y_ext_train, y_ext_test = train_test_split(
    extended_reviews, extended_labels, test_size=0.3, random_state=42
)

# Bag-of-Words
bow_vectorizer = CountVectorizer()
X_train_bow_ext = bow_vectorizer.fit_transform(X_ext_train).toarray()
X_test_bow_ext = bow_vectorizer.transform(X_ext_test).toarray()

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_ext_train).toarray()
X_test_tfidf = tfidf_vectorizer.transform(X_ext_test).toarray()

# Train classifiers
sklearn_mnb_bow = SklearnMultinomialNB(alpha=1.0)
sklearn_mnb_bow.fit(X_train_bow_ext, np.array(y_ext_train))

sklearn_mnb_tfidf = SklearnMultinomialNB(alpha=1.0)
sklearn_mnb_tfidf.fit(X_train_tfidf, np.array(y_ext_train))

# Compare results
bow_accuracy = accuracy_score(y_ext_test, sklearn_mnb_bow.predict(X_test_bow_ext))
tfidf_accuracy = accuracy_score(y_ext_test, sklearn_mnb_tfidf.predict(X_test_tfidf))

print("Comparison: Bag-of-Words vs TF-IDF")
print("=" * 50)
print(f"Bag-of-Words Accuracy:  {bow_accuracy:.4f}")
print(f"TF-IDF Accuracy:        {tfidf_accuracy:.4f}")

# Visualize feature weights for a sample word
print("\nFeature Representation Comparison (sample word: 'movie'):")
if 'movie' in bow_vectorizer.vocabulary_:
    bow_idx = bow_vectorizer.vocabulary_['movie']
    tfidf_idx = tfidf_vectorizer.vocabulary_['movie']
    
    print(f"  BoW values for first 3 training docs:   {X_train_bow_ext[:3, bow_idx]}")
    print(f"  TF-IDF values for first 3 training docs: {X_train_tfidf[:3, tfidf_idx].round(3)}")
    print("\n  Notice: TF-IDF down-weights common words like 'movie' that appear in many documents")

---

## N-grams: Capturing Word Context

A key limitation of Bag-of-Words is that it treats words independently, losing important context. **N-grams** help capture word sequences and handle negation.

### The Problem: Negation and Context

Consider these reviews:
- "This movie is **not good**" → Negative sentiment
- "This movie is **good**" → Positive sentiment

With unigrams (single words), both contain "good" with the same count, making them appear similar!

### N-gram Solution

| N-gram Type | Description | Example: "not good at all" |
|-------------|-------------|----------------------------|
| **Unigrams (n=1)** | Single words | ["not", "good", "at", "all"] |
| **Bigrams (n=2)** | Word pairs | ["not good", "good at", "at all"] |
| **Trigrams (n=3)** | Word triples | ["not good at", "good at all"] |

**Key insight**: "not good" as a bigram captures the negation that unigrams miss!

### Example with Sentiment Analysis

```python
from sklearn.feature_extraction.text import CountVectorizer

# Unigrams only
vec_uni = CountVectorizer(ngram_range=(1, 1))

# Unigrams + Bigrams
vec_bi = CountVectorizer(ngram_range=(1, 2))

text = ["This movie is not good"]
print(vec_uni.fit_transform(text).toarray())  # [good, is, movie, not, this]
print(vec_bi.fit_transform(text).toarray())   # [good, is, is not, movie, movie is, not, not good, this, this movie]
```

### Trade-offs

| Aspect | Unigrams | Unigrams + Bigrams | Higher N-grams |
|--------|----------|-------------------|----------------|
| **Vocabulary size** | V | V + V² (approx) | Exponential growth |
| **Captures negation** | ❌ No | ✅ Yes | ✅ Yes |
| **Sparse features** | Moderate | High | Very high |
| **Overfitting risk** | Low | Medium | High |
| **Training data needed** | Less | More | Much more |

### Best Practices for N-grams

1. **Start with (1, 2)**: Unigrams + bigrams is a good default
2. **Limit vocabulary**: Use `max_features` to cap vocabulary size
3. **Use with smoothing**: Higher α for sparser n-gram features
4. **Consider TF-IDF**: Helps with n-gram feature weighting

```python
# Recommended setup for sentiment analysis
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),      # Unigrams and bigrams
    max_features=10000,       # Limit vocabulary
    min_df=2                  # Ignore very rare terms
)

model = MultinomialNB(alpha=0.1)  # Lower alpha for TF-IDF
```

In [None]:
# Demonstrate N-grams effect on sentiment classification
from sklearn.feature_extraction.text import CountVectorizer

# Reviews with negation - tricky for unigrams!
negation_reviews = [
    "This movie is good",
    "This movie is not good", 
    "I really loved this film",
    "I did not love this film",
    "Great acting and plot",
    "Not great acting at all",
    "The story was amazing",
    "The story was not amazing"
]
negation_labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# Compare unigrams vs bigrams
print("N-grams Comparison for Handling Negation")
print("=" * 60)

# Unigrams only
vec_uni = CountVectorizer(ngram_range=(1, 1))
X_uni = vec_uni.fit_transform(negation_reviews).toarray()

# Unigrams + Bigrams
vec_bi = CountVectorizer(ngram_range=(1, 2))
X_bi = vec_bi.fit_transform(negation_reviews).toarray()

print(f"\nUnigrams vocabulary size: {len(vec_uni.get_feature_names_out())}")
print(f"Unigrams + Bigrams vocabulary size: {len(vec_bi.get_feature_names_out())}")

# Show key features
print("\n--- Unigram features (sample) ---")
uni_features = vec_uni.get_feature_names_out()
print(f"Features: {list(uni_features)}")

print("\n--- Bigram features that capture negation ---")
bi_features = vec_bi.get_feature_names_out()
negation_bigrams = [f for f in bi_features if 'not' in f]
print(f"Negation bigrams: {negation_bigrams}")

# Train and compare
from sklearn.model_selection import cross_val_score

mnb = SklearnMultinomialNB(alpha=1.0)

cv_uni = cross_val_score(mnb, X_uni, negation_labels, cv=4)
cv_bi = cross_val_score(mnb, X_bi, negation_labels, cv=4)

print(f"\n--- Cross-validation Accuracy ---")
print(f"Unigrams only:        {cv_uni.mean():.2f} (±{cv_uni.std():.2f})")
print(f"Unigrams + Bigrams:   {cv_bi.mean():.2f} (±{cv_bi.std():.2f})")
print("\nNote: Bigrams help capture negation patterns like 'not good', 'not great'!")

### When to Use TF-IDF vs Bag-of-Words

| Feature | Bag-of-Words | TF-IDF |
|---------|--------------|--------|
| **Representation** | Raw word counts | Weighted by term importance |
| **Common words** | High values | Down-weighted |
| **Rare but discriminative words** | Low values | Up-weighted |
| **Best for** | Short texts, when word frequency matters | Longer documents, diverse vocabulary |
| **Computational cost** | Lower | Slightly higher |

> **Note**: TF-IDF values are continuous, so they work better with models that handle continuous features. For Multinomial NB (which expects counts), you may need to scale TF-IDF values appropriately.

---

## Checkpoint 5: Test Your Understanding

### Question 5

When would you choose Multinomial NB over Gaussian NB?

A) When features represent word frequencies or document-term counts  
B) When the features have high correlation with each other  
C) When the dataset has many more samples than features  
D) When you need well-calibrated probability estimates

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: A) When features represent word frequencies or document-term counts**

Multinomial NB is designed for discrete count data, particularly text classification with bag-of-words features. It models the probability of word occurrences using the multinomial distribution.

**Example with numbers:**

| Dataset | Feature Type | Best Model | Why |
|---------|--------------|------------|-----|
| Email spam detection | Word counts: [free:3, money:2, click:1] | **Multinomial NB** | Discrete counts |
| Iris classification | Measurements: [sepal:5.1cm, petal:1.4cm] | **Gaussian NB** | Continuous values |
| Sentiment analysis | Word frequencies | **Multinomial NB** | Count data |
| Medical diagnosis | Lab values: [glucose:95, cholesterol:180] | **Gaussian NB** | Continuous measurements |

**Decision rule:**
```
Are your features counts/frequencies (0, 1, 2, 3, ...)?
├─ Yes → Multinomial NB
└─ No → Are they continuous real numbers?
        ├─ Yes → Gaussian NB
        └─ No (binary 0/1) → Bernoulli NB
```

**Why other answers are incorrect:**

- **B) When the features have high correlation with each other**: Neither variant handles correlated features well - both assume conditional independence! Example: If "free" and "money" always appear together, both models treat them as independent. For correlated features, consider PCA first or use models like Random Forests.

- **C) When the dataset has many more samples than features**: The sample-to-feature ratio doesn't determine the choice. Example: Text classification often has vocabulary_size=10,000 features but only 1,000 documents (more features than samples!) - and Multinomial NB still excels. The choice is about feature *type*, not dataset dimensions.

- **D) When you need well-calibrated probability estimates**: Neither variant produces well-calibrated probabilities. Example: Naive Bayes might output P(spam)=0.99 when the true probability is 0.75 (overconfident). For calibrated probabilities, use Platt scaling or isotonic regression post-hoc.

</details>

---

## Bias-Variance Tradeoff in Naive Bayes

The smoothing parameter (α in Multinomial NB, var_smoothing in Gaussian NB) controls the **bias-variance tradeoff**:

- **Low smoothing (α → 0)**: High variance, low bias
  - Model closely follows training data
  - Risk of overfitting, especially with sparse data
  - Zero probabilities for unseen features
  
- **High smoothing (α → ∞)**: High bias, low variance
  - Model approaches uniform probabilities
  - Ignores training data evidence
  - Underfitting - poor discrimination between classes

Let's visualize this tradeoff:

In [None]:
# Visualize bias-variance tradeoff with different smoothing values
from sklearn.model_selection import cross_val_score

# Generate synthetic text-like data for demonstration
np.random.seed(42)

# Create a more substantial dataset for meaningful cross-validation
n_train = 100
vocab_size = 50

# Simulate word count data
X_synthetic = np.random.poisson(lam=2, size=(n_train, vocab_size))
# Add some class-specific signal
class_signal = np.zeros((n_train, vocab_size))
class_signal[:n_train//2, :10] = np.random.poisson(lam=3, size=(n_train//2, 10))
class_signal[n_train//2:, 10:20] = np.random.poisson(lam=3, size=(n_train//2, 10))
X_synthetic = X_synthetic + class_signal
y_synthetic = np.array([0] * (n_train//2) + [1] * (n_train//2))

# Test different alpha values
alphas_bv = np.logspace(-3, 2, 20)  # From 0.001 to 100
mean_train_scores = []
mean_cv_scores = []
std_cv_scores = []

for alpha in alphas_bv:
    model = SklearnMultinomialNB(alpha=alpha)
    
    # Training score
    model.fit(X_synthetic, y_synthetic)
    train_score = model.score(X_synthetic, y_synthetic)
    mean_train_scores.append(train_score)
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_synthetic, y_synthetic, cv=5)
    mean_cv_scores.append(cv_scores.mean())
    std_cv_scores.append(cv_scores.std())

mean_cv_scores = np.array(mean_cv_scores)
std_cv_scores = np.array(std_cv_scores)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Training vs CV accuracy
ax1 = axes[0]
ax1.semilogx(alphas_bv, mean_train_scores, 'b-', label='Training Accuracy', linewidth=2)
ax1.semilogx(alphas_bv, mean_cv_scores, 'r-', label='CV Accuracy', linewidth=2)
ax1.fill_between(alphas_bv, mean_cv_scores - std_cv_scores, 
                  mean_cv_scores + std_cv_scores, alpha=0.2, color='red')
ax1.axvline(x=alphas_bv[np.argmax(mean_cv_scores)], color='green', linestyle='--', 
             label=f'Best α = {alphas_bv[np.argmax(mean_cv_scores)]:.3f}')
ax1.set_xlabel('Smoothing Parameter (α)', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Bias-Variance Tradeoff in Multinomial NB', fontsize=14)
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0.4, 1.05])

# Right plot: Gap between training and CV (indicator of overfitting)
ax2 = axes[1]
gap = np.array(mean_train_scores) - np.array(mean_cv_scores)
ax2.semilogx(alphas_bv, gap, 'purple', linewidth=2)
ax2.fill_between(alphas_bv, 0, gap, alpha=0.3, color='purple')
ax2.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax2.set_xlabel('Smoothing Parameter (α)', fontsize=12)
ax2.set_ylabel('Train - CV Accuracy (Overfitting Gap)', fontsize=12)
ax2.set_title('Overfitting Indicator', fontsize=14)
ax2.grid(True, alpha=0.3)

# Add annotations
ax2.annotate('High Variance\n(Overfitting)', xy=(0.005, 0.15), fontsize=10, ha='center')
ax2.annotate('High Bias\n(Underfitting)', xy=(20, 0.02), fontsize=10, ha='center')

plt.tight_layout()
plt.show()

print(f"Optimal α (highest CV accuracy): {alphas_bv[np.argmax(mean_cv_scores)]:.4f}")
print(f"Best CV Accuracy: {max(mean_cv_scores):.4f}")

### Interpreting the Bias-Variance Plot

**Left plot (Training vs CV Accuracy):**
- **Small α**: High training accuracy but lower CV accuracy → overfitting
- **Large α**: Both accuracies drop → underfitting  
- **Optimal α**: Where CV accuracy is maximized (green line)

**Right plot (Overfitting Gap):**
- Large gap = high variance (overfitting)
- Near-zero gap with low accuracy = high bias (underfitting)
- Sweet spot: small gap with high overall accuracy

**Practical advice:**
1. Use cross-validation to find optimal α
2. Default α=1.0 works well in most cases
3. Smaller α for large vocabularies, larger α for small datasets

---

## Checkpoint 6: Test Your Understanding

### Question 6

What happens if we increase the smoothing parameter α in Multinomial NB from 1.0 to 10.0?

A) The model gives more weight to words that appear frequently in the training data  
B) The model makes feature probabilities more uniform, reducing the influence of observed word counts  
C) The model becomes more sensitive to rare words in the vocabulary  
D) The model's training time increases significantly due to more complex calculations

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) The model makes feature probabilities more uniform, reducing the influence of observed word counts**

As α increases, the smoothing formula P(word|class) = (count + α) / (total + α × vocab_size) causes all feature probabilities to move closer to uniform (1/vocab_size).

**Example with numbers:**

Word "free" appears 10 times in spam class (total words in spam = 100, vocab_size = 1000):

| α | P("free"\|spam) | Effect |
|---|----------------|--------|
| 0 | 10/100 = **0.100** | Pure observed frequency |
| 1 | (10+1)/(100+1000) = 11/1100 = **0.010** | Smoothed |
| 10 | (10+10)/(100+10000) = 20/10100 = **0.002** | More uniform |
| 100 | (10+100)/(100+100000) = 110/100100 = **0.0011** | Nearly uniform (≈1/1000) |

Notice: As α increases, P("free"|spam) approaches 1/1000 = 0.001 (uniform probability).

**Bias-Variance Tradeoff:**
- **α small**: Low bias, high variance (fits training data closely, may overfit)
- **α large**: High bias, low variance (ignores data, may underfit)

**Why other answers are incorrect:**

- **A) The model gives more weight to words that appear frequently**: This is **backwards**! Larger α *reduces* the relative weight of observed frequencies. Example from table above: The ratio between a word appearing 10 times vs 0 times:
  - α=0: 10/100 vs 0/100 = infinite ratio
  - α=1: 11/1100 vs 1/1100 = 11:1 ratio  
  - α=100: 110/100100 vs 100/100100 = 1.1:1 ratio (nearly equal!)

- **C) The model becomes more sensitive to rare words**: **Opposite is true**. Higher α washes out rare words. Example: A discriminative rare word appearing once gets drowned by the α term when α is large.

- **D) The model's training time increases significantly**: Incorrect - α doesn't affect computational complexity. The formula (count + α)/(total + α × vocab_size) takes the same O(1) time regardless of α's value. Training time depends on dataset size, not α.

</details>

---

## Effect of Smoothing Parameter

Let's visualize how the smoothing parameter affects model performance.

In [None]:
# Test different alpha values
alphas = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
train_accuracies = []
test_accuracies = []

for alpha in alphas:
    mnb_test = MultinomialNaiveBayes(alpha=alpha)
    mnb_test.fit(X_train_bow, np.array(y_text_train))
    
    train_pred = mnb_test.predict(X_train_bow)
    test_pred = mnb_test.predict(X_test_bow)
    
    if train_pred is not None and test_pred is not None:
        train_accuracies.append(accuracy_score(y_text_train, train_pred))
        test_accuracies.append(accuracy_score(y_text_test, test_pred))

if train_accuracies and test_accuracies:
    plt.figure(figsize=(10, 5))
    plt.plot(alphas, train_accuracies, 'bo-', label='Training Accuracy', markersize=8)
    plt.plot(alphas, test_accuracies, 'rs-', label='Test Accuracy', markersize=8)
    plt.xscale('log')
    plt.xlabel('Alpha (Smoothing Parameter)')
    plt.ylabel('Accuracy')
    plt.title('Effect of Laplace Smoothing on Multinomial NB')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print("Complete Exercise 4 to see the smoothing effect visualization")

---

# Part 3: Applying to Real Dataset - Iris Classification

In [None]:
# Load Iris dataset
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

print(f"Iris dataset shape: {X_iris.shape}")
print(f"Classes: {iris.target_names}")
print(f"Features: {iris.feature_names}")

In [None]:
# Split data
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42
)

# Train our Gaussian NB
gnb_iris = GaussianNaiveBayes(var_smoothing=1e-9)
gnb_iris.fit(X_iris_train, y_iris_train)

# Predict
y_iris_pred = gnb_iris.predict(X_iris_test)

if y_iris_pred is not None:
    print("Gaussian Naive Bayes on Iris Dataset")
    print("="*50)
    print(f"\nAccuracy: {accuracy_score(y_iris_test, y_iris_pred):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_iris_test, y_iris_pred, target_names=iris.target_names))
    
    # Confusion Matrix visualization
    cm = confusion_matrix(y_iris_test, y_iris_pred)
    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap='Blues')
    plt.title('Confusion Matrix - Iris Classification')
    plt.colorbar()
    tick_marks = np.arange(len(iris.target_names))
    plt.xticks(tick_marks, iris.target_names, rotation=45)
    plt.yticks(tick_marks, iris.target_names)
    
    # Add text annotations
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, str(cm[i, j]), ha='center', va='center',
                    color='white' if cm[i, j] > cm.max()/2 else 'black')
    
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()
else:
    print("Complete Exercise 3 to see Iris classification results")

---

## Best Practices and Tips

### 1. Feature Engineering
- **Gaussian NB**: Works best when features approximately follow normal distribution
- **Multinomial NB**: Best for count data (text); consider TF-IDF for better results

### 2. Choosing Smoothing Parameters
- **var_smoothing** (Gaussian): Start with 1e-9, increase if numerical issues occur
- **alpha** (Multinomial): Use cross-validation to find optimal value; 1.0 is a good default

### 3. When Naive Bayes Shines
- Text classification (spam, sentiment, categorization)
- High-dimensional data with many features
- When you need a quick baseline model
- When training data is limited

### 4. When to Consider Alternatives
- When features are highly correlated
- When decision boundaries are complex
- When probability estimates need to be well-calibrated

### 5. Common Mistakes to Avoid
- Forgetting to use log probabilities → numerical underflow
- Using Multinomial NB with negative feature values
- Not applying smoothing → zero probability issues

---

## Summary

In this lab, you learned:

1. **Bayes Theorem Foundation**: How to use $P(y|X) \propto P(X|y)P(y)$ for classification

2. **Gaussian Naive Bayes**: 
   - Assumes continuous features follow Gaussian distributions
   - Computes mean and variance per feature per class
   - Uses variance smoothing for numerical stability

3. **Multinomial Naive Bayes**:
   - Best for count/frequency data (text classification)
   - Uses Laplace smoothing to handle zero counts
   - Feature probability: $P(x_i|y) = \frac{N_{y,i} + \alpha}{N_y + \alpha n}$

4. **Numerical Stability**:
   - Always use log probabilities to avoid underflow
   - Convert multiplication to addition: $\log(ab) = \log(a) + \log(b)$

5. **The Naive Assumption**:
   - Features are conditionally independent given the class
   - This simplification makes computation tractable
   - Often works well despite being unrealistic

### Key Takeaways

- Naive Bayes is fast, simple, and effective for many tasks
- Choose the right variant based on your data type
- Smoothing parameters control the bias-variance tradeoff
- Log probabilities are essential for numerical stability