<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Naive%20Bayes/Naive%20Bayes%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes Hands-On Lab

In this hands-on lab, you will implement the Naive Bayes classifier from scratch, gaining a deep understanding of probabilistic classification. You'll work with both **Gaussian Naive Bayes** for continuous features and **Multinomial Naive Bayes** for text classification.

## Learning Objectives

By the end of this lab, you will be able to:

1. **Understand Bayes Theorem**: Apply the fundamental formula for probabilistic inference
2. **Implement Gaussian Naive Bayes**: Build a classifier for continuous features from scratch
3. **Implement Multinomial Naive Bayes**: Create a text classifier using bag-of-words representation
4. **Handle numerical stability**: Use log probabilities to avoid numerical underflow
5. **Apply Laplace smoothing**: Prevent zero probability issues in classification
6. **Visualize decision boundaries**: Understand how Naive Bayes separates classes
7. **Compare with scikit-learn**: Validate your implementation against the library version

## Algorithm Overview

### Bayes Theorem

Naive Bayes is a probabilistic classifier based on **Bayes Theorem**:

$$P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)}$$

Where:
- $P(y|X)$ is the **posterior probability** - probability of class $y$ given features $X$
- $P(X|y)$ is the **likelihood** - probability of features $X$ given class $y$
- $P(y)$ is the **prior probability** - probability of class $y$ before seeing the data
- $P(X)$ is the **evidence** - probability of the features (normalizing constant)

### The Naive Assumption

The "naive" in Naive Bayes comes from the **conditional independence assumption**: given the class label, all features are assumed to be independent of each other.

$$P(X|y) = P(x_1|y) \cdot P(x_2|y) \cdot ... \cdot P(x_n|y) = \prod_{i=1}^{n} P(x_i|y)$$

This simplification makes the algorithm computationally efficient and surprisingly effective in practice.

### Classification Decision

To classify a new sample, we compute:

$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i|y)$$

Since $P(X)$ is constant for all classes, we can ignore it for classification purposes.

### Gaussian Naive Bayes

For **continuous features**, we assume each feature follows a Gaussian (normal) distribution within each class:

$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_{y,i}^2}} \exp\left(-\frac{(x_i - \mu_{y,i})^2}{2\sigma_{y,i}^2}\right)$$

Where:
- $\mu_{y,i}$ is the mean of feature $i$ for class $y$
- $\sigma_{y,i}^2$ is the variance of feature $i$ for class $y$

### Multinomial Naive Bayes

For **count data** (like word frequencies in text), we use the multinomial distribution:

$$P(x_i|y) = \frac{N_{y,i} + \alpha}{N_y + \alpha \cdot n}$$

Where:
- $N_{y,i}$ is the count of feature $i$ in class $y$
- $N_y$ is the total count of all features in class $y$
- $\alpha$ is the Laplace smoothing parameter
- $n$ is the number of features

### Numerical Stability: Log Probabilities

Multiplying many small probabilities can lead to **numerical underflow**. To avoid this, we work with **log probabilities**:

$$\log P(y|X) \propto \log P(y) + \sum_{i=1}^{n} \log P(x_i|y)$$

For Gaussian likelihood, the log probability becomes:

$$\log P(x_i|y) = -\frac{1}{2}\log(2\pi\sigma_{y,i}^2) - \frac{(x_i - \mu_{y,i})^2}{2\sigma_{y,i}^2}$$

## When to Use Naive Bayes

| Strengths | Limitations |
|-----------|-------------|
| Very fast training and prediction | Assumes feature independence (rarely true) |
| Works well with high-dimensional data | Cannot learn feature interactions |
| Performs well with small training sets | Continuous features may not follow Gaussian |
| Handles missing values naturally | Probability estimates can be poor |
| Resistant to irrelevant features | Sensitive to feature scaling (Gaussian NB) |
| Excellent for text classification | May be outperformed by other algorithms |

### Best Use Cases

- **Text classification**: Spam filtering, sentiment analysis, document categorization
- **Medical diagnosis**: When features are conditionally independent given disease
- **Real-time prediction**: When speed is critical
- **Baseline model**: Quick benchmark before trying complex models

## Pseudocode: Gaussian Naive Bayes

```
TRAINING:
1. For each class y in classes:
   a. Calculate prior: P(y) = count(y) / total_samples
   b. For each feature i:
      - Calculate mean: Œº_yi = mean of feature i where class = y
      - Calculate variance: œÉ¬≤_yi = variance of feature i where class = y

PREDICTION:
1. For each class y:
   a. Start with log_prob = log(P(y))  # log prior
   b. For each feature i:
      - Add log(P(x_i|y)) using Gaussian PDF
   c. Store total log_prob for class y
2. Return class with highest log probability
```

---

# Part 1: Gaussian Naive Bayes Implementation

Let's implement Gaussian Naive Bayes step by step.

## Import Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB as SklearnGaussianNB
from sklearn.naive_bayes import MultinomialNB as SklearnMultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## Generate Sample Data

In [None]:
# Generate a 2D classification dataset for visualization
X, y = make_classification(
    n_samples=300,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    class_sep=2.0,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")
print(f"Classes: {np.unique(y_train)}")

In [None]:
# Visualize the data
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(X_train[y_train == 0][:, 0], X_train[y_train == 0][:, 1], 
            c='blue', label='Class 0', alpha=0.6, edgecolors='k')
plt.scatter(X_train[y_train == 1][:, 0], X_train[y_train == 1][:, 1], 
            c='red', label='Class 1', alpha=0.6, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Training Data')
plt.legend()

plt.subplot(1, 2, 2)
plt.scatter(X_test[y_test == 0][:, 0], X_test[y_test == 0][:, 1], 
            c='blue', label='Class 0', alpha=0.6, edgecolors='k')
plt.scatter(X_test[y_test == 1][:, 0], X_test[y_test == 1][:, 1], 
            c='red', label='Class 1', alpha=0.6, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Test Data')
plt.legend()

plt.tight_layout()
plt.show()

---

## Exercise 1: Calculate Class Statistics

In this exercise, you'll implement methods to calculate the **prior probabilities** and **class statistics** (mean and variance) needed for Gaussian Naive Bayes.

**Your tasks:**
1. Calculate the prior probability for each class
2. Calculate the mean of each feature for each class
3. Calculate the variance of each feature for each class

In [None]:
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier implementation from scratch.
    
    Parameters
    ----------
    var_smoothing : float, default=1e-9
        Portion of the largest variance of all features added to variances
        for numerical stability.
    """
    
    def __init__(self, var_smoothing=1e-9):
        self.var_smoothing = var_smoothing
        self.classes_ = None
        self.priors_ = None      # Prior probabilities for each class
        self.theta_ = None       # Mean of each feature per class
        self.var_ = None         # Variance of each feature per class
    
    def _calculate_priors(self, y):
        """
        Calculate the prior probability of each class.
        
        Prior P(y) = count(y) / total_samples
        
        Parameters
        ----------
        y : array-like of shape (n_samples,)
            Target values.
            
        Returns
        -------
        priors : array of shape (n_classes,)
            Prior probability for each class.
        """
        # TODO: Calculate the prior probability for each class
        # Hint: For each class, divide the count of samples in that class
        # by the total number of samples
        
        priors = None  # Replace with your implementation
        
        return priors
    
    def _calculate_class_statistics(self, X, y):
        """
        Calculate mean and variance of each feature for each class.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples,)
            Target values.
            
        Returns
        -------
        theta : array of shape (n_classes, n_features)
            Mean of each feature per class.
        var : array of shape (n_classes, n_features)
            Variance of each feature per class.
        """
        n_features = X.shape[1]
        n_classes = len(self.classes_)
        
        theta = np.zeros((n_classes, n_features))
        var = np.zeros((n_classes, n_features))
        
        # TODO: For each class, calculate the mean and variance of each feature
        # Hint: Filter X to only include samples of each class, then compute statistics
        
        for idx, c in enumerate(self.classes_):
            # Get samples belonging to class c
            X_c = None  # TODO: Filter X for samples where y == c
            
            # Calculate mean for each feature
            theta[idx, :] = None  # TODO: Calculate mean along axis 0
            
            # Calculate variance for each feature
            var[idx, :] = None  # TODO: Calculate variance along axis 0
        
        return theta, var

### Verification Cell for Exercise 1

Run this cell to verify your implementation of priors and class statistics calculation.

In [None]:
# Test the priors and class statistics calculation
gnb_test = GaussianNaiveBayes()
gnb_test.classes_ = np.unique(y_train)

# Test priors
priors = gnb_test._calculate_priors(y_train)
print("Prior Probabilities:")
if priors is not None:
    for i, c in enumerate(gnb_test.classes_):
        print(f"  P(y={c}) = {priors[i]:.4f}")
    
    # Verify priors sum to 1
    assert np.isclose(priors.sum(), 1.0), "Priors should sum to 1!"
    print("\n‚úì Priors sum to 1.0")
else:
    print("  Not implemented yet")

print("\n" + "="*50 + "\n")

# Test class statistics
theta, var = gnb_test._calculate_class_statistics(X_train, y_train)
print("Class Statistics:")
if theta is not None and var is not None:
    for i, c in enumerate(gnb_test.classes_):
        print(f"\nClass {c}:")
        print(f"  Mean (Œ∏): {theta[i]}")
        print(f"  Variance (œÉ¬≤): {var[i]}")
    
    # Verify shape
    assert theta.shape == (len(gnb_test.classes_), X_train.shape[1]), "Theta shape incorrect!"
    assert var.shape == (len(gnb_test.classes_), X_train.shape[1]), "Variance shape incorrect!"
    print("\n‚úì Class statistics shapes are correct")
else:
    print("  Not implemented yet")

<details>
<summary style="cursor: pointer; font-weight: bold;">üí° Click here for Exercise 1 Solution</summary>

```python
def _calculate_priors(self, y):
    priors = np.array([np.sum(y == c) / len(y) for c in self.classes_])
    return priors

def _calculate_class_statistics(self, X, y):
    n_features = X.shape[1]
    n_classes = len(self.classes_)
    
    theta = np.zeros((n_classes, n_features))
    var = np.zeros((n_classes, n_features))
    
    for idx, c in enumerate(self.classes_):
        # Get samples belonging to class c
        X_c = X[y == c]
        
        # Calculate mean for each feature
        theta[idx, :] = X_c.mean(axis=0)
        
        # Calculate variance for each feature
        var[idx, :] = X_c.var(axis=0)
    
    return theta, var
```

**Explanation:**
- **Priors**: For each class, we count how many samples belong to that class and divide by total samples
- **Mean (Œ∏)**: Average value of each feature for samples in each class
- **Variance (œÉ¬≤)**: Spread of each feature for samples in each class

</details>

---

## Exercise 2: Calculate Gaussian Log-Likelihood

Now implement the method to calculate the **log-likelihood** of observing features given a class, using the Gaussian probability density function.

**Formula:**
$$\log P(x_i|y) = -\frac{1}{2}\log(2\pi\sigma_{y,i}^2) - \frac{(x_i - \mu_{y,i})^2}{2\sigma_{y,i}^2}$$

In [None]:
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier implementation from scratch.
    """
    
    def __init__(self, var_smoothing=1e-9):
        self.var_smoothing = var_smoothing
        self.classes_ = None
        self.priors_ = None
        self.theta_ = None
        self.var_ = None
    
    def _calculate_priors(self, y):
        """Calculate prior probabilities."""
        priors = np.array([np.sum(y == c) / len(y) for c in self.classes_])
        return priors
    
    def _calculate_class_statistics(self, X, y):
        """Calculate mean and variance for each class."""
        n_features = X.shape[1]
        n_classes = len(self.classes_)
        
        theta = np.zeros((n_classes, n_features))
        var = np.zeros((n_classes, n_features))
        
        for idx, c in enumerate(self.classes_):
            X_c = X[y == c]
            theta[idx, :] = X_c.mean(axis=0)
            var[idx, :] = X_c.var(axis=0)
        
        return theta, var
    
    def _calculate_log_likelihood(self, X):
        """
        Calculate log-likelihood of X for each class using Gaussian PDF.
        
        Log P(x_i|y) = -0.5 * log(2œÄ * œÉ¬≤) - (x_i - Œº)¬≤ / (2œÉ¬≤)
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Input samples.
            
        Returns
        -------
        log_likelihood : array of shape (n_samples, n_classes)
            Log-likelihood for each sample and each class.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        
        log_likelihood = np.zeros((n_samples, n_classes))
        
        # TODO: Calculate log-likelihood for each class
        # For each class:
        # 1. Calculate the log of the Gaussian PDF for each feature
        # 2. Sum across features (naive assumption - features are independent)
        
        for idx in range(n_classes):
            # Get mean and variance for this class
            mean = self.theta_[idx]  # shape: (n_features,)
            var = self.var_[idx]     # shape: (n_features,)
            
            # TODO: Calculate log-likelihood using Gaussian PDF formula
            # Hint: Use np.log for logarithm, np.pi for œÄ
            # The formula is: -0.5 * log(2œÄ * œÉ¬≤) - (x - Œº)¬≤ / (2œÉ¬≤)
            # Sum across features for each sample
            
            log_likelihood[:, idx] = None  # Replace with your implementation
        
        return log_likelihood

### Verification Cell for Exercise 2

Run this cell to verify your log-likelihood implementation.

In [None]:
# Test log-likelihood calculation
gnb_test = GaussianNaiveBayes(var_smoothing=1e-9)
gnb_test.classes_ = np.unique(y_train)
gnb_test.theta_, gnb_test.var_ = gnb_test._calculate_class_statistics(X_train, y_train)

# Add smoothing to variance
gnb_test.var_ = gnb_test.var_ + gnb_test.var_smoothing

# Calculate log-likelihood for test samples
log_likelihood = gnb_test._calculate_log_likelihood(X_test[:5])

print("Log-Likelihood for first 5 test samples:")
if log_likelihood is not None and not np.any(log_likelihood == None):
    print(f"Shape: {log_likelihood.shape}")
    print(f"\nLog-likelihood values:")
    for i in range(5):
        print(f"  Sample {i}: Class 0 = {log_likelihood[i, 0]:.4f}, Class 1 = {log_likelihood[i, 1]:.4f}")
    
    # Verify shape
    assert log_likelihood.shape == (5, 2), "Log-likelihood shape incorrect!"
    # Verify no NaN or Inf values
    assert not np.any(np.isnan(log_likelihood)), "Log-likelihood contains NaN!"
    assert not np.any(np.isinf(log_likelihood)), "Log-likelihood contains Inf!"
    print("\n‚úì Log-likelihood implementation looks correct")
else:
    print("  Not implemented yet")

<details>
<summary style="cursor: pointer; font-weight: bold;">üí° Click here for Exercise 2 Solution</summary>

```python
def _calculate_log_likelihood(self, X):
    n_samples = X.shape[0]
    n_classes = len(self.classes_)
    
    log_likelihood = np.zeros((n_samples, n_classes))
    
    for idx in range(n_classes):
        mean = self.theta_[idx]
        var = self.var_[idx]
        
        # Log of Gaussian PDF: -0.5 * log(2œÄ * œÉ¬≤) - (x - Œº)¬≤ / (2œÉ¬≤)
        # Sum across features (naive assumption)
        log_likelihood[:, idx] = np.sum(
            -0.5 * np.log(2 * np.pi * var) - ((X - mean) ** 2) / (2 * var),
            axis=1
        )
    
    return log_likelihood
```

**Explanation:**
- We compute the log of the Gaussian PDF for each feature
- The naive assumption allows us to sum log-probabilities across features
- Broadcasting handles the vectorized computation efficiently
- `axis=1` sums across features for each sample

</details>

---

## Exercise 3: Complete the Gaussian Naive Bayes Classifier

Now implement the complete `fit` and `predict` methods to finish the Gaussian Naive Bayes classifier.

In [None]:
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier implementation from scratch.
    """
    
    def __init__(self, var_smoothing=1e-9):
        self.var_smoothing = var_smoothing
        self.classes_ = None
        self.priors_ = None
        self.theta_ = None
        self.var_ = None
    
    def _calculate_priors(self, y):
        """Calculate prior probabilities."""
        return np.array([np.sum(y == c) / len(y) for c in self.classes_])
    
    def _calculate_class_statistics(self, X, y):
        """Calculate mean and variance for each class."""
        n_features = X.shape[1]
        n_classes = len(self.classes_)
        
        theta = np.zeros((n_classes, n_features))
        var = np.zeros((n_classes, n_features))
        
        for idx, c in enumerate(self.classes_):
            X_c = X[y == c]
            theta[idx, :] = X_c.mean(axis=0)
            var[idx, :] = X_c.var(axis=0)
        
        return theta, var
    
    def _calculate_log_likelihood(self, X):
        """Calculate log-likelihood using Gaussian PDF."""
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        log_likelihood = np.zeros((n_samples, n_classes))
        
        for idx in range(n_classes):
            mean = self.theta_[idx]
            var = self.var_[idx]
            log_likelihood[:, idx] = np.sum(
                -0.5 * np.log(2 * np.pi * var) - ((X - mean) ** 2) / (2 * var),
                axis=1
            )
        
        return log_likelihood
    
    def fit(self, X, y):
        """
        Fit the Gaussian Naive Bayes classifier.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples,)
            Target values.
            
        Returns
        -------
        self : object
            Fitted estimator.
        """
        # TODO: Implement the fit method
        # 1. Store unique classes
        # 2. Calculate prior probabilities
        # 3. Calculate class statistics (mean and variance)
        # 4. Apply variance smoothing for numerical stability
        
        # Store unique classes
        self.classes_ = None  # TODO
        
        # Calculate priors
        self.priors_ = None  # TODO
        
        # Calculate class statistics
        self.theta_, self.var_ = None, None  # TODO
        
        # Apply variance smoothing
        # TODO: Add var_smoothing to variance to prevent division by zero
        
        return self
    
    def predict(self, X):
        """
        Predict class labels for samples in X.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples to predict.
            
        Returns
        -------
        y_pred : array of shape (n_samples,)
            Predicted class labels.
        """
        # TODO: Implement the predict method
        # 1. Calculate log priors
        # 2. Calculate log likelihoods
        # 3. Combine: log_posterior ‚àù log_prior + log_likelihood
        # 4. Return the class with highest log posterior for each sample
        
        # Calculate log priors (same for all samples)
        log_priors = None  # TODO: Use np.log on priors
        
        # Calculate log likelihoods
        log_likelihood = None  # TODO
        
        # Combine log prior and log likelihood
        log_posterior = None  # TODO: Add log_priors to log_likelihood
        
        # Return class with highest log posterior
        return None  # TODO: Use self.classes_ and np.argmax
    
    def predict_proba(self, X):
        """
        Return probability estimates for samples in X.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples.
            
        Returns
        -------
        proba : array of shape (n_samples, n_classes)
            Probability of each class for each sample.
        """
        log_priors = np.log(self.priors_)
        log_likelihood = self._calculate_log_likelihood(X)
        log_posterior = log_priors + log_likelihood
        
        # Convert log probabilities to probabilities using softmax
        # Subtract max for numerical stability
        log_posterior = log_posterior - np.max(log_posterior, axis=1, keepdims=True)
        posterior = np.exp(log_posterior)
        return posterior / posterior.sum(axis=1, keepdims=True)

### Verification Cell for Exercise 3

Run this cell to verify your complete Gaussian Naive Bayes implementation.

In [None]:
# Test the complete implementation
gnb = GaussianNaiveBayes(var_smoothing=1e-9)
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

if y_pred is not None:
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Gaussian Naive Bayes Accuracy: {accuracy:.4f}")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    
    # Compare with sklearn
    sklearn_gnb = SklearnGaussianNB(var_smoothing=1e-9)
    sklearn_gnb.fit(X_train, y_train)
    sklearn_pred = sklearn_gnb.predict(X_test)
    sklearn_accuracy = accuracy_score(y_test, sklearn_pred)
    
    print(f"\nScikit-learn GaussianNB Accuracy: {sklearn_accuracy:.4f}")
    
    if np.isclose(accuracy, sklearn_accuracy, atol=0.01):
        print("\n‚úì Your implementation matches scikit-learn!")
    else:
        print(f"\n‚ö† Accuracy differs from sklearn by {abs(accuracy - sklearn_accuracy):.4f}")
else:
    print("Prediction not implemented yet")

<details>
<summary style="cursor: pointer; font-weight: bold;">üí° Click here for Exercise 3 Solution</summary>

```python
def fit(self, X, y):
    # Store unique classes
    self.classes_ = np.unique(y)
    
    # Calculate priors
    self.priors_ = self._calculate_priors(y)
    
    # Calculate class statistics
    self.theta_, self.var_ = self._calculate_class_statistics(X, y)
    
    # Apply variance smoothing for numerical stability
    self.var_ = self.var_ + self.var_smoothing
    
    return self

def predict(self, X):
    # Calculate log priors
    log_priors = np.log(self.priors_)
    
    # Calculate log likelihoods
    log_likelihood = self._calculate_log_likelihood(X)
    
    # Combine: log_posterior ‚àù log_prior + log_likelihood
    log_posterior = log_priors + log_likelihood
    
    # Return class with highest log posterior
    return self.classes_[np.argmax(log_posterior, axis=1)]
```

**Explanation:**
- **fit**: Stores classes, computes priors, means, variances, and adds smoothing
- **predict**: Computes log posterior = log prior + log likelihood, returns argmax class
- Using log probabilities avoids numerical underflow from multiplying small numbers

</details>

---

## Visualize Decision Boundary

In [None]:
def plot_decision_boundary(model, X, y, title="Decision Boundary"):
    """
    Plot the decision boundary of a classifier.
    """
    h = 0.02  # Step size
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], c='blue', 
                label='Class 0', edgecolors='k', alpha=0.7)
    plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], c='red', 
                label='Class 1', edgecolors='k', alpha=0.7)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.legend()
    plt.show()

# Plot decision boundary for our implementation
if y_pred is not None:
    plot_decision_boundary(gnb, X_train, y_train, 
                          "Gaussian Naive Bayes Decision Boundary (Our Implementation)")

---

## Multiple Choice Questions: Gaussian Naive Bayes

### Question 1

What does the "naive" assumption in Naive Bayes refer to?

A) The algorithm is simple and basic  
B) Features are assumed to be independent given the class label  
C) The algorithm ignores prior probabilities  
D) Only one feature is used for classification

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) Features are assumed to be independent given the class label**

The "naive" assumption means that given the class label, all features are conditionally independent of each other. Mathematically: P(X|y) = ‚àèP(x·µ¢|y). This assumption is rarely true in practice, but Naive Bayes often works well despite this simplification. This allows us to compute P(X|y) by simply multiplying individual feature probabilities, making the algorithm computationally efficient.

</details>

### Question 2

Why do we use log probabilities instead of raw probabilities in Naive Bayes?

A) To make the algorithm faster  
B) To convert multiplication to addition  
C) To avoid numerical underflow when multiplying many small probabilities  
D) Both B and C

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: D) Both B and C**

Using log probabilities serves two important purposes:

1. **Numerical stability**: When multiplying many small probabilities (like P(x‚ÇÅ|y) √ó P(x‚ÇÇ|y) √ó ... √ó P(x‚Çô|y)), the result can become so small that it underflows to zero. Log probabilities avoid this issue.

2. **Computational convenience**: log(a √ó b) = log(a) + log(b), so multiplication becomes addition, which is computationally more efficient and numerically stable.

For example, instead of computing P(y)‚àèP(x·µ¢|y), we compute log(P(y)) + Œ£log(P(x·µ¢|y)).

</details>

### Question 3

What is the purpose of `var_smoothing` in Gaussian Naive Bayes?

A) To increase model accuracy  
B) To prevent division by zero when variance is very small  
C) To reduce overfitting  
D) To normalize the features

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) To prevent division by zero when variance is very small**

`var_smoothing` adds a small value to the variance of each feature to ensure numerical stability. In the Gaussian PDF formula, we divide by variance (œÉ¬≤). If variance is zero or very close to zero (which can happen if all samples of a class have the same feature value), this would cause division by zero or numerical instability.

By adding `var_smoothing` (typically 1e-9), we ensure the variance is never zero:
```
var = var + var_smoothing
```

This is similar to Laplace smoothing in Multinomial NB but for continuous features.

</details>

---

# Part 2: Multinomial Naive Bayes for Text Classification

Now let's implement **Multinomial Naive Bayes**, which is commonly used for text classification with word count features.

## Text Classification Example

We'll classify movie reviews as positive or negative.

In [None]:
# Sample movie reviews dataset
reviews = [
    "This movie was fantastic and amazing",
    "Great film with excellent acting",
    "Wonderful story and brilliant performance",
    "I loved this movie so much",
    "Best movie I have ever seen",
    "Outstanding cinematography and plot",
    "Terrible movie waste of time",
    "Awful film with bad acting",
    "Boring and disappointing story",
    "I hated this movie completely",
    "Worst movie ever made",
    "Poor direction and terrible script",
    "Amazing performances by all actors",
    "A masterpiece of modern cinema",
    "Dreadful experience awful waste",
    "Horrible plot and bad dialogue"
]

labels = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]  # 1 = positive, 0 = negative

# Split data
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(
    reviews, labels, test_size=0.25, random_state=42
)

print(f"Training samples: {len(X_text_train)}")
print(f"Test samples: {len(X_text_test)}")

In [None]:
# Convert text to bag-of-words representation
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_text_train).toarray()
X_test_bow = vectorizer.transform(X_text_test).toarray()

print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")
print(f"\nVocabulary: {vectorizer.get_feature_names_out()}")
print(f"\nBag-of-words shape: {X_train_bow.shape}")

---

## Exercise 4: Implement Multinomial Naive Bayes

Implement the Multinomial Naive Bayes classifier with **Laplace smoothing**.

**Formula for feature likelihood:**
$$P(x_i|y) = \frac{N_{y,i} + \alpha}{N_y + \alpha \cdot n}$$

Where:
- $N_{y,i}$ = count of feature $i$ in class $y$
- $N_y$ = total count of all features in class $y$
- $\alpha$ = smoothing parameter (usually 1 for Laplace smoothing)
- $n$ = number of features

In [None]:
class MultinomialNaiveBayes:
    """
    Multinomial Naive Bayes classifier for text classification.
    
    Parameters
    ----------
    alpha : float, default=1.0
        Laplace smoothing parameter.
    """
    
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.classes_ = None
        self.priors_ = None
        self.feature_log_prob_ = None  # Log probability of features given class
    
    def fit(self, X, y):
        """
        Fit the Multinomial Naive Bayes classifier.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data (word counts).
        y : array-like of shape (n_samples,)
            Target values.
        """
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        n_features = X.shape[1]
        
        # Calculate priors
        self.priors_ = np.array([np.sum(y == c) / len(y) for c in self.classes_])
        
        # Calculate feature log probabilities with Laplace smoothing
        self.feature_log_prob_ = np.zeros((n_classes, n_features))
        
        # TODO: Calculate P(x_i|y) for each feature and class using Laplace smoothing
        # Formula: P(x_i|y) = (N_yi + alpha) / (N_y + alpha * n_features)
        # Then take log for numerical stability
        
        for idx, c in enumerate(self.classes_):
            # Get samples belonging to class c
            X_c = X[y == c]
            
            # TODO: Calculate N_yi (sum of feature i across all samples in class c)
            feature_counts = None  # Sum along axis 0
            
            # TODO: Calculate N_y (total count of all features in class c)
            total_count = None  # Sum of all feature counts
            
            # TODO: Apply Laplace smoothing and calculate log probabilities
            # P(x_i|y) = (feature_counts + alpha) / (total_count + alpha * n_features)
            self.feature_log_prob_[idx, :] = None
        
        return self
    
    def predict(self, X):
        """
        Predict class labels for samples in X.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples to predict.
            
        Returns
        -------
        y_pred : array of shape (n_samples,)
            Predicted class labels.
        """
        # TODO: Calculate log posterior for each class
        # log_posterior = log_prior + sum(x_i * log_P(x_i|y))
        
        log_priors = np.log(self.priors_)
        
        # TODO: Calculate log likelihood using feature counts and log probabilities
        # Hint: Use matrix multiplication X @ self.feature_log_prob_.T
        log_likelihood = None
        
        # TODO: Calculate log posterior
        log_posterior = None
        
        # Return class with highest log posterior
        return self.classes_[np.argmax(log_posterior, axis=1)]

### Verification Cell for Exercise 4

In [None]:
# Test Multinomial Naive Bayes
mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.fit(X_train_bow, np.array(y_text_train))

# Make predictions
y_text_pred = mnb.predict(X_test_bow)

if y_text_pred is not None:
    accuracy = accuracy_score(y_text_test, y_text_pred)
    print(f"Multinomial Naive Bayes Accuracy: {accuracy:.4f}")
    
    print("\nPredictions vs Actual:")
    for review, actual, pred in zip(X_text_test, y_text_test, y_text_pred):
        sentiment_actual = "Positive" if actual == 1 else "Negative"
        sentiment_pred = "Positive" if pred == 1 else "Negative"
        match = "‚úì" if actual == pred else "‚úó"
        print(f"  {match} '{review[:40]}...' - Actual: {sentiment_actual}, Predicted: {sentiment_pred}")
    
    # Compare with sklearn
    sklearn_mnb = SklearnMultinomialNB(alpha=1.0)
    sklearn_mnb.fit(X_train_bow, np.array(y_text_train))
    sklearn_pred = sklearn_mnb.predict(X_test_bow)
    sklearn_accuracy = accuracy_score(y_text_test, sklearn_pred)
    
    print(f"\nScikit-learn MultinomialNB Accuracy: {sklearn_accuracy:.4f}")
    
    if np.allclose(y_text_pred, sklearn_pred):
        print("\n‚úì Your implementation matches scikit-learn!")
else:
    print("Prediction not implemented yet")

<details>
<summary style="cursor: pointer; font-weight: bold;">üí° Click here for Exercise 4 Solution</summary>

```python
def fit(self, X, y):
    self.classes_ = np.unique(y)
    n_classes = len(self.classes_)
    n_features = X.shape[1]
    
    # Calculate priors
    self.priors_ = np.array([np.sum(y == c) / len(y) for c in self.classes_])
    
    # Calculate feature log probabilities with Laplace smoothing
    self.feature_log_prob_ = np.zeros((n_classes, n_features))
    
    for idx, c in enumerate(self.classes_):
        X_c = X[y == c]
        
        # N_yi: sum of feature i across all samples in class c
        feature_counts = X_c.sum(axis=0)
        
        # N_y: total count of all features in class c
        total_count = feature_counts.sum()
        
        # Apply Laplace smoothing and calculate log probabilities
        self.feature_log_prob_[idx, :] = np.log(
            (feature_counts + self.alpha) / (total_count + self.alpha * n_features)
        )
    
    return self

def predict(self, X):
    log_priors = np.log(self.priors_)
    
    # Log likelihood: sum of (x_i * log P(x_i|y))
    log_likelihood = X @ self.feature_log_prob_.T
    
    # Log posterior
    log_posterior = log_priors + log_likelihood
    
    return self.classes_[np.argmax(log_posterior, axis=1)]
```

**Explanation:**
- **Laplace smoothing**: Adds Œ± to each count to avoid zero probabilities for unseen words
- **Feature counts**: Sum of each word's frequency across all documents in a class
- **Log likelihood**: For count data, we multiply log probabilities by word counts
- Matrix multiplication `X @ feature_log_prob_.T` efficiently computes the sum

</details>

---

## Multiple Choice Questions: Multinomial Naive Bayes

### Question 4

What problem does Laplace smoothing (alpha) solve in Multinomial Naive Bayes?

A) It speeds up training  
B) It prevents zero probabilities for unseen words  
C) It normalizes the feature values  
D) It reduces the vocabulary size

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) It prevents zero probabilities for unseen words**

Without smoothing, if a word never appears in training documents of a particular class, P(word|class) = 0. When we multiply probabilities, any zero makes the entire product zero, causing misclassification.

Laplace smoothing adds Œ± (usually 1) to each word count:
```
P(word|class) = (count + Œ±) / (total + Œ± √ó vocabulary_size)
```

This ensures no probability is ever zero, while still giving higher probabilities to more frequent words.

</details>

### Question 5

When would you choose Multinomial NB over Gaussian NB?

A) When features are continuous and normally distributed  
B) When working with count/frequency data like word occurrences  
C) When you have very few training samples  
D) When features have high correlation

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) When working with count/frequency data like word occurrences**

- **Multinomial NB**: Best for discrete count data, especially text classification with bag-of-words or TF-IDF features. Assumes features represent counts from a multinomial distribution.

- **Gaussian NB**: Best for continuous features that follow a Gaussian (normal) distribution. Used for general classification with continuous data.

Examples:
- Document classification ‚Üí Multinomial NB
- Spam detection (word counts) ‚Üí Multinomial NB  
- Iris flower classification (petal measurements) ‚Üí Gaussian NB
- Sensor data classification ‚Üí Gaussian NB

</details>

### Question 6

What happens if we increase the smoothing parameter Œ± in Multinomial NB?

A) The model becomes more confident in its predictions  
B) Feature probabilities become more uniform across classes  
C) Training becomes faster  
D) The vocabulary size decreases

<details>
<summary style="cursor: pointer; font-weight: bold;">Click here for Answer</summary>

**Answer: B) Feature probabilities become more uniform across classes**

As Œ± increases:
- All feature probabilities move closer to uniform distribution (1/vocabulary_size)
- The model relies less on observed data and more on the prior
- This increases **bias** but reduces **variance** (bias-variance tradeoff)

Example:
- Œ± = 0: P(word|class) purely based on training counts (high variance)
- Œ± = 1: Standard Laplace smoothing (balanced)
- Œ± >> 1: Probabilities approach uniform, model ignores data (high bias)

This is similar to regularization in other models.

</details>

---

## Effect of Smoothing Parameter

Let's visualize how the smoothing parameter affects model performance.

In [None]:
# Test different alpha values
alphas = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
train_accuracies = []
test_accuracies = []

for alpha in alphas:
    mnb_test = MultinomialNaiveBayes(alpha=alpha)
    mnb_test.fit(X_train_bow, np.array(y_text_train))
    
    train_pred = mnb_test.predict(X_train_bow)
    test_pred = mnb_test.predict(X_test_bow)
    
    if train_pred is not None and test_pred is not None:
        train_accuracies.append(accuracy_score(y_text_train, train_pred))
        test_accuracies.append(accuracy_score(y_text_test, test_pred))

if train_accuracies and test_accuracies:
    plt.figure(figsize=(10, 5))
    plt.plot(alphas, train_accuracies, 'bo-', label='Training Accuracy', markersize=8)
    plt.plot(alphas, test_accuracies, 'rs-', label='Test Accuracy', markersize=8)
    plt.xscale('log')
    plt.xlabel('Alpha (Smoothing Parameter)')
    plt.ylabel('Accuracy')
    plt.title('Effect of Laplace Smoothing on Multinomial NB')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print("Complete Exercise 4 to see the smoothing effect visualization")

---

# Part 3: Applying to Real Dataset - Iris Classification

In [None]:
# Load Iris dataset
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

print(f"Iris dataset shape: {X_iris.shape}")
print(f"Classes: {iris.target_names}")
print(f"Features: {iris.feature_names}")

In [None]:
# Split data
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42
)

# Train our Gaussian NB
gnb_iris = GaussianNaiveBayes(var_smoothing=1e-9)
gnb_iris.fit(X_iris_train, y_iris_train)

# Predict
y_iris_pred = gnb_iris.predict(X_iris_test)

if y_iris_pred is not None:
    print("Gaussian Naive Bayes on Iris Dataset")
    print("="*50)
    print(f"\nAccuracy: {accuracy_score(y_iris_test, y_iris_pred):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_iris_test, y_iris_pred, target_names=iris.target_names))
    
    # Confusion Matrix visualization
    cm = confusion_matrix(y_iris_test, y_iris_pred)
    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap='Blues')
    plt.title('Confusion Matrix - Iris Classification')
    plt.colorbar()
    tick_marks = np.arange(len(iris.target_names))
    plt.xticks(tick_marks, iris.target_names, rotation=45)
    plt.yticks(tick_marks, iris.target_names)
    
    # Add text annotations
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, str(cm[i, j]), ha='center', va='center',
                    color='white' if cm[i, j] > cm.max()/2 else 'black')
    
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()
else:
    print("Complete Exercise 3 to see Iris classification results")

---

## Best Practices and Tips

### 1. Feature Engineering
- **Gaussian NB**: Works best when features approximately follow normal distribution
- **Multinomial NB**: Best for count data (text); consider TF-IDF for better results

### 2. Choosing Smoothing Parameters
- **var_smoothing** (Gaussian): Start with 1e-9, increase if numerical issues occur
- **alpha** (Multinomial): Use cross-validation to find optimal value; 1.0 is a good default

### 3. When Naive Bayes Shines
- Text classification (spam, sentiment, categorization)
- High-dimensional data with many features
- When you need a quick baseline model
- When training data is limited

### 4. When to Consider Alternatives
- When features are highly correlated
- When decision boundaries are complex
- When probability estimates need to be well-calibrated

### 5. Common Mistakes to Avoid
- Forgetting to use log probabilities ‚Üí numerical underflow
- Using Multinomial NB with negative feature values
- Not applying smoothing ‚Üí zero probability issues

---

## Summary

In this lab, you learned:

1. **Bayes Theorem Foundation**: How to use $P(y|X) \propto P(X|y)P(y)$ for classification

2. **Gaussian Naive Bayes**: 
   - Assumes continuous features follow Gaussian distributions
   - Computes mean and variance per feature per class
   - Uses variance smoothing for numerical stability

3. **Multinomial Naive Bayes**:
   - Best for count/frequency data (text classification)
   - Uses Laplace smoothing to handle zero counts
   - Feature probability: $P(x_i|y) = \frac{N_{y,i} + \alpha}{N_y + \alpha n}$

4. **Numerical Stability**:
   - Always use log probabilities to avoid underflow
   - Convert multiplication to addition: $\log(ab) = \log(a) + \log(b)$

5. **The Naive Assumption**:
   - Features are conditionally independent given the class
   - This simplification makes computation tractable
   - Often works well despite being unrealistic

### Key Takeaways

- Naive Bayes is fast, simple, and effective for many tasks
- Choose the right variant based on your data type
- Smoothing parameters control the bias-variance tradeoff
- Log probabilities are essential for numerical stability