# Understanding Naive Bayes Classifier

Naive Bayes is a simple yet powerful classification technique based on **Bayes' Theorem**, which relates the conditional and marginal probabilities of random events.

---

## 📌 Bayes' Theorem

The fundamental equation:

$$
P(y \mid X) = \frac{P(X \mid y) \cdot P(y)}{P(X)}
$$

Where:
- \( P(y \mid X) \) is the **posterior**: the probability of class `y` given the features `X`.
- \( P(X \mid y) \) is the **likelihood**: the probability of the features `X` given that class `y` is true.
- \( P(y) \) is the **prior**: the initial probability of class `y`.
- \( P(X) \) is the **evidence**: the overall probability of the features `X` (constant for all classes).

In practice, since \( P(X) \) is the same for every class, we often omit it when comparing probabilities between classes.

---

## 🤔 Why Is It Called *Naive*?

The "naive" part comes from the **assumption of independence**: Naive Bayes assumes that all features are independent given the class. This is rarely true in real-world data, but the method still performs well.

---

## 🎯 Classification Objective

Our goal is to find the class \( y \) that **maximizes the posterior**:

$$
\hat{y} = \arg\max_y \; P(y \mid X)
$$

This simplifies to:

$$
\hat{y} = \arg\max_y \; P(X \mid y) \cdot P(y)
$$

---

## 📌 Gaussian Naive Bayes

If the features are continuous, we often assume that they follow a **normal distribution** within each class. The likelihood \( P(x_i \mid y) \) is then computed using the **Gaussian probability density function**:

$$
f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^2 \right)
$$

Where:
- \( \mu \) is the mean of the feature values for class `y`
- \( \sigma \) is the standard deviation

---

## 🧮 Log Probability Trick

To avoid numerical underflow when multiplying small probabilities, we sum the **logarithms** of probabilities instead:

$$
\log P(X \mid y) = \sum_{i=1}^{n} \log P(x_i \mid y)
$$

Thus, the final decision rule becomes:

$$
\hat{y} = \arg\max_y \; \left( \log P(y) + \sum_{i=1}^{n} \log P(x_i \mid y) \right)
$$

---

## 🛠 Implementation Summary

To implement Naive Bayes with Gaussian assumption:
- Estimate \( \mu \), \( \sigma \), and \( P(y) \) for each class.
- For each test example, compute log-likelihoods for all classes.
- Predict the class with the highest posterior probability.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB as SklearnNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from mlfs.naive_bayes import NaiveBayes as CustomNB
from mlfs.preprocessing import train_test_split, standardize
from mlfs.metrics import (
    accuracy as custom_accuracy,
    precision as custom_precision,
    recall as custom_recall,
    f1_score as custom_f1,
)


In [None]:
df = pd.read_csv("../data/breast-cancer.csv")
df

<a id="4"></a>
<h1 style='background:#00EFFF;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '

# Data Preprocessing

In [4]:
df.drop('id', axis=1, inplace=True) 

In [None]:
df['diagnosis'] = (df['diagnosis'] == 'M').astype(int)
corr = df.corr()
plt.figure(figsize=(20,20))
sns.heatmap(corr, cmap='viridis_r',annot=True)
plt.show()

In [6]:
notincluded_columns = abs(corr['diagnosis'])[abs(corr['diagnosis'] < 0.25)]
notincluded_columns = notincluded_columns.index.tolist()
for col in notincluded_columns:
  df.drop(col, axis = 1, inplace = True)

In [None]:
X = df.drop('diagnosis', axis = 1).values
y = df['diagnosis']
print('Shape of X', X.shape)
print('Shape of y', y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
X_train, mean, std = standardize(X_train, return_params=True)
X_test = (X_test - mean) / std

<a id="4"></a>
<h1 style='background:#00EFFF;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '

# Comparing

In [None]:
def benchmark_naive_bayes_custom_vs_sklearn(X, y, n_repeats=5):
    """
    Benchmarks training and prediction times for custom and sklearn Naive Bayes implementations.

    Parameters
    ----------
    X : np.ndarray
        Feature matrix.
    y : np.ndarray
        Target vector (binary: 0/1).
    n_repeats : int
        Number of repetitions for averaging time.

    Returns
    -------
    pd.DataFrame
        DataFrame with average fit and predict times for both models.
    """

    custom_fit_times = []
    custom_predict_times = []
    sklearn_fit_times = []
    sklearn_predict_times = []

    for _ in range(n_repeats):
        # Custom Naive Bayes
        model_custom = CustomNB()

        start = time.time()
        model_custom.fit(X, y)
        custom_fit_times.append(time.time() - start)

        start = time.time()
        model_custom.predict(X)
        custom_predict_times.append(time.time() - start)

        # Sklearn Naive Bayes
        model_sklearn = SklearnNB()

        start = time.time()
        model_sklearn.fit(X, y)
        sklearn_fit_times.append(time.time() - start)

        start = time.time()
        model_sklearn.predict(X)
        sklearn_predict_times.append(time.time() - start)

    results = pd.DataFrame({
        'Model': ['CustomNB', 'SklearnNB'],
        'FitTime': [np.mean(custom_fit_times), np.mean(sklearn_fit_times)],
        'PredictTime': [np.mean(custom_predict_times), np.mean(sklearn_predict_times)]
    })

    return results


In [None]:
results = benchmark_naive_bayes_custom_vs_sklearn(X_train, y_train)
print(results)


In [None]:
def benchmark_nb_scalability_vs_sklearn(sample_sizes, n_features=4):
    """
    Benchmarks and compares fit/predict times of custom and sklearn Naive Bayes models 
    as dataset size increases.

    Parameters
    ----------
    sample_sizes : list[int]
        List of dataset sizes to evaluate.
    n_features : int
        Number of features per sample.

    Returns
    -------
    pd.DataFrame
        DataFrame with timing results for plotting and analysis.
    """

    records = []

    for n_samples in sample_sizes:
        X, y = make_classification(
            n_samples=n_samples,
            n_features=n_features,
            n_informative=n_features,
            n_redundant=0,
            n_classes=2,
            random_state=42
        )

        # Custom Naive Bayes
        custom_model = CustomNB()

        start = time.time()
        custom_model.fit(X, y)
        fit_custom = time.time() - start

        start = time.time()
        custom_model.predict(X)
        predict_custom = time.time() - start

        records.append({
            "Samples": n_samples,
            "Model": "CustomNB",
            "FitTime": fit_custom,
            "PredictTime": predict_custom
        })

        # Sklearn Naive Bayes
        sklearn_model = SklearnNB()

        start = time.time()
        sklearn_model.fit(X, y)
        fit_sklearn = time.time() - start

        start = time.time()
        sklearn_model.predict(X)
        predict_sklearn = time.time() - start

        records.append({
            "Samples": n_samples,
            "Model": "SklearnNB",
            "FitTime": fit_sklearn,
            "PredictTime": predict_sklearn
        })

    df = pd.DataFrame(records)

    fig, axs = plt.subplots(1, 2, figsize=(14, 5))

    sns.lineplot(data=df, x="Samples", y="FitTime", hue="Model", marker="o", ax=axs[0])
    axs[0].set_title("Fit Time vs Sample Size")
    axs[0].set_ylabel("Time (s)")
    axs[0].grid(True)

    sns.lineplot(data=df, x="Samples", y="PredictTime", hue="Model", marker="o", ax=axs[1])
    axs[1].set_title("Predict Time vs Sample Size")
    axs[1].set_ylabel("Time (s)")
    axs[1].grid(True)

    plt.tight_layout()
    plt.show()

    return df


In [None]:
sample_sizes = [100, 500, 1000, 2000, 5000]
benchmark_nb_scalability_vs_sklearn(sample_sizes)


In [None]:
def compare_nb_accuracy_basic(X_train, X_test, y_train, y_test):
    """
    Compares CustomNB and SklearnNB using classification metrics:
    - accuracy
    - precision
    - recall
    - f1_score

    Returns
    -------
    pd.DataFrame
        Comparison of all metrics.
    """

    # Custom Naive Bayes
    custom_model = CustomNB()
    custom_model.fit(X_train, y_train)
    preds_custom = custom_model.predict(X_test)

    acc_custom = custom_accuracy(y_test, preds_custom)
    prec_custom = custom_precision(y_test, preds_custom)
    rec_custom = custom_recall(y_test, preds_custom)
    f1_custom = custom_f1(y_test, preds_custom)

    # Sklearn Naive Bayes
    sklearn_model = SklearnNB()
    sklearn_model.fit(X_train, y_train)
    preds_sklearn = sklearn_model.predict(X_test)

    acc_sklearn = accuracy_score(y_test, preds_sklearn)
    prec_sklearn = precision_score(y_test, preds_sklearn)
    rec_sklearn = recall_score(y_test, preds_sklearn)
    f1_sklearn = f1_score(y_test, preds_sklearn)

    results = pd.DataFrame({
        "Model": ["CustomNB", "SklearnNB"],
        "Accuracy": [acc_custom, acc_sklearn],
        "Precision": [prec_custom, prec_sklearn],
        "Recall": [rec_custom, rec_sklearn],
        "F1 Score": [f1_custom, f1_sklearn]
    })

    return results


In [None]:
compare_nb_accuracy_basic(X_train, X_test, y_train, y_test)

## Analysis: Custom Naive Bayes vs Sklearn GaussianNB

### 1. Performance (Fit and Prediction Time)

The custom implementation of Naive Bayes achieves extremely fast training times across all tested dataset sizes. This is expected, since training in Gaussian Naive Bayes is reduced to computing class-wise means, variances, and prior probabilities — operations that are simple and efficient even without heavy optimization.

However, prediction time is noticeably slower compared to sklearn’s implementation, and the difference becomes more pronounced as the number of samples increases. This likely results from the use of plain Python loops in my implementation, while sklearn leverages optimized C extensions and vectorized routines under the hood.

In short, training is equally lightweight in both models, but sklearn handles prediction much more efficiently due to better use of low-level performance techniques.

---

### 2. Classification Metrics (Accuracy, Precision, Recall, F1)

In terms of predictive performance, both models yield nearly identical results on all evaluated metrics. This confirms that the underlying probability calculations and class decision logic in my implementation are mathematically correct and consistent with sklearn’s behavior.

The close match in accuracy, precision, recall, and F1-score suggests that both models handle the data equally well from a statistical standpoint. Any differences are negligible and not statistically significant.

---

### Summary

- The custom model achieves high accuracy and matches sklearn in terms of predictive quality.
- Prediction is significantly slower in the custom version, especially on larger datasets.
- This trade-off is expected: sklearn is highly optimized, while my implementation prioritizes readability and clarity over speed.

To improve prediction time, I could consider:
- Replacing sample-wise loops with vectorized NumPy operations,
- Avoiding redundant calculations inside the prediction loop,
- Precomputing reusable components during inference.
