# Gaussian Naive Bayes Classifier from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
    - [Bayes' Theorem](#bayes-theorem)
2. [Loading Data](#2-loading-data)
3. [Prior Probability](#3-prior-probability)
4. [Likelihood for Gaussian NB](#4-likelihood-for-gaussian-nb)
5. [Posterior Probability for Gaussian NB](#5-posterior-probability-for-gaussian-nb)
6. [Prediction](#6-prediction)
7. [Evaluation Metrics](#7-evaluation-metrics)
    - [Binary Confusion Matrix](#binary-confusion-matrix)
    - [Multi-Class Confusion Matrix](#multi-class-confusion-matrix)
    - [Accuracy](#accuracy)
    - [Precision](#precision)
    - [Recall](#recall)
    - [F1-Score](#f1-score)
8. [Train Test Split](#8-train-test-split)
9. [Encapsulation](#9-encapsulation)
10. [Comparison with Scikit-Learn](#10-comparison-with-scikit-learn)
***

In [1]:
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
from sklearn.datasets import load_iris
from numpy.typing import NDArray

## 1. Introduction
Naive Bayes classifiers are probabilistic classification models based on Bayes' Theorem, assuming conditional independence between features given the class labels or values. Naive Bayes is a general framework; the specific variant should be chosen based on the nature of your data:

- **Categorical Naive Bayes**

    - **Features**: Categorical labels (e.g., colours, countries, product types).

    - **Use Case**: Classification with discrete, categorically distributed features.

- **Multinomial Naive Bayes**

    - **Features**: Counts or frequencies (e.g., word occurrences, event counts).

    - **Use** **Case**: Text classification, document classification, or any scenario where features are discrete counts.

- **Gaussian Naive Bayes**

    - **Features**: Continuous data (e.g., measurements, sensor readings).

    - **Use Case**: Classification with numerical features assumed to follow a Gaussian distribution.

- **Bernoulli Naive Bayes**

    - **Features**: Binary features (e.g., True/False, 0/1).

    - **Use Case**: Text classification (presence/absence of words), binary feature spaces.



### Bayes' Theorem
Bayes' theorem describes the probability of a class $C_{i}$ given a set of features $X = (x_{1}, x_{2},\ldots,x_{N})$:

\begin{align*}
P(C_{i}|X) = \dfrac{P(X|C_{i}) \cdot P(C_{i})}{P(X)}
\end{align*}

where:
- $P(C_{i}|X)$: Posterior probability of class $C_{i}$ given features $X$.
- $P(X|C_{i})$: Likelihood of features $X$ given class $C_{i}$.
- $P(C_{i})$: Prior probability of class $C_{i}$.
- $P(X)$: Evidence (normalising constant, same for all classes)

Gaussian Naive Bayes assumes features $X = (x_{1}, x_{2},\ldots,x_{N})$ are conditionally independent given the class $C_{i}$ and features follow a Gaussian (normal) distribution within each class. Therefore, the likelihood is expressed as:

\begin{align*}
P(x_j|C_i) = \frac{1}{\sqrt{2\pi\sigma_{ij}^2}} \exp\left(-\frac{(x_j - \mu_{ij})^2}{2\sigma_{ij}^2}\right)
\end{align*}

where:

$\mu_{ij}$ = Mean of feature $x_j$ in class $C_i$.

$\sigma_{ij}^2$ = Variance of feature $x_j$ in class $C_i$


Replacing $P(X|C_{i})$ in Bayes' theorem, the equation becomes:

\begin{align*}
P(C_{i}|X) = \dfrac{P(C_{i}) \cdot \prod_{j=1}^{N} \frac{1}{\sqrt{2\pi\sigma_{ij}^2}} \exp\left(-\frac{(x_j - \mu_{ij})^2}{2\sigma_{ij}^2}\right)}{P(X)}
\end{align*}

Since $P(X)$ is constant for all classes,

\begin{align*}
P(C_{i}|X) \propto P(C_{i}) \cdot \prod_{j=1}^{N} \frac{1}{\sqrt{2\pi\sigma_{ij}^2}} \exp\left(-\frac{(x_j - \mu_{ij})^2}{2\sigma_{ij}^2}\right)
\end{align*}

The symbol $\propto$ denotes proportionality, meaning we ignore the denominator $P(X)$ when comparing probabilities across classes.

## 2. Loading Data

In [2]:
iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
X, y = df.iloc[:, :-1], df.iloc[:, -1]
X.iloc[:, 3].describe()

count    150.000000
mean       1.199333
std        0.762238
min        0.100000
25%        0.300000
50%        1.300000
75%        1.800000
max        2.500000
Name: petal width (cm), dtype: float64

## 3. Prior Probability
Class $C_{i}$ (`y`) has three discrete integer variables: `0`, `1` and `2`:

\begin{align*}
P(C_{i}=0) = \dfrac{\text{Count(0)}}{\text{Total Count}}
\end{align*}

\begin{align*}
P(C_{i}=1) = \dfrac{\text{Count(1)}}{\text{Total Count}}
\end{align*}

\begin{align*}
P(C_{i}=2) = \dfrac{\text{Count(2)}}{\text{Total Count}}
\end{align*}

In [4]:
print(f'Total count: {len(df)}')
print(f'Counts: {y.value_counts().to_dict()}')

Total count: 150
Counts: {0: 50, 1: 50, 2: 50}


\begin{align*}
P(0) = \dfrac{50}{150} = P(1) = P(2)
\end{align*}

In [5]:
def calculate_priors(y: pd.Series) -> Dict[str, float]:
    """
    Calculate prior probabilities for each class in the target variable.

    Args:
        y: Target variable containing class labels (strings).

    Returns:
        Prior probabilities for each class.
    """
    return y.value_counts(normalize=True).to_dict()

In [6]:
calculate_priors(y)

{0: 0.3333333333333333, 1: 0.3333333333333333, 2: 0.3333333333333333}

## 4. Likelihood for Gaussian NB
For continuous features, $P(x_{j}|C_{i})$ uses the **Gaussian probability density function**:

\begin{align*}
P(x_j|C_i) = \frac{1}{\sqrt{2\pi\sigma_{ij}^2}} \exp\left(-\frac{(x_j - \mu_{ij})^2}{2\sigma_{ij}^2}\right)
\end{align*}

where:

$\mu_{ij}$ = Mean of feature $x_j$ in class $C_i$.

$\sigma_{ij}^2$ = Variance of feature $x_j$ in class $C_i$.


In [7]:
def calculate_likelihoods(X: pd.DataFrame, y: pd.Series, epsilon: float = 1e-9) -> pd.DataFrame:
    """
    Calculate Gaussian likelihood probabilities for each feature in a dataset per class.


    Args:
        X: Feature matrix.
        y: Class labels corresponding to each sample in X.
        epsilon: Small value added to variances to prevent division by zero (default = 1e-9).

    Returns:
        Gaussian likelihood probability.

    Notes:
        - The calculation uses the Gaussian probability density function.
        - Variances are calculated with ddof=0 (population variance).
        - The function groups data by class to compute class-specific parameters.
    """
    means_per_class = X.groupby(y).mean()
    variances_per_class = X.groupby(y).var(ddof=0) + epsilon

    likelihoods = X.copy()

    for cls in y.unique():
        cls_idx = y[y == cls].index
        cls_X = X.loc[cls_idx]
        cls_mean = means_per_class.loc[cls]
        cls_var = variances_per_class.loc[cls]

        exponent = -0.5 * ((cls_X - cls_mean) ** 2 / cls_var)
        constant = 1 / np.sqrt(2 * np.pi * cls_var)
        likelihoods.loc[cls_idx] = round(constant * np.exp(exponent), 4)
    return likelihoods

In [8]:
calculate_likelihoods(X, y)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,1.1025,1.0437,2.1744,3.4698
1,1.0917,0.5548,2.1744,3.4698
2,0.7783,0.8839,1.4886,3.4698
3,0.5810,0.7256,2.2645,3.4698
4,1.1431,0.9571,2.1744,3.4698
...,...,...,...,...
145,0.6238,1.2455,0.5933,0.8831
146,0.5708,0.4151,0.4383,1.3179
147,0.6276,1.2455,0.5933,1.4606
148,0.5241,0.5130,0.7025,0.8831


## 5. Posterior Probability for Gaussian NB
As we discussed [above](#1-introduction), the formula of posterior probability is:

\begin{align*}
P(C_{i}|X) = P(C_{i}) \cdot \prod_{j=1}^{N} \frac{1}{\sqrt{2\pi\sigma_{ij}^2}} \exp\left(-\frac{(x_j - \mu_{ij})^2}{2\sigma_{ij}^2}\right)
\end{align*}

To avoid underflow and simplify calculations, we use the log-posterior:

\begin{align*}
\log{P(C_{i}|X)}  = \log{P(C_{i})} + \sum_{j=1}^{N} \left[ -\frac{1}{2} \log{(2\pi\sigma_{ij}^2) - \dfrac{(x_{j}-\mu_{ij})^2}{2\sigma_{ij}^2}}
\right] 
\end{align*}

- $\mu_{ij}$ and $\sigma_{ij}^2$ are estimated from training data per class.
- Priors $P(C_{i})$ are typically empirical class frequencies.
- Evidence $P(X)$ is ignored during classification (does not effect `.max()` or `.argmax()`).

In [9]:
def calculate_log_posteriors(sample: pd.Series, priors: Dict[str, float],
                             y: pd.Series, epsilon: float = 1e-9) -> Dict[str, float]:
    """
    Calculate log-posterior probabilities for a single sample across all classes using Gaussian Naive Bayes.

    Args:
        sample: Feature vector of a single sample.
        priors: Dictionary of prior probabilities per class.
        y: Series of class labels corresponding to training data.
        epsilon: Small value added to variances for numerical stability (default=1e-9).

    Returns:
        Dictionary mapping each class label to its log-posterior probability for the sample.

    Notes:
        - The function computes class-wise means and variances from the training data labels `y` and features.
        - Uses the Gaussian probability density function in log space for likelihood calculation.
        - Assumes features are conditionally independent given the class.
    """
    means_per_class = X.groupby(y).mean()
    variances_per_class = X.groupby(y).var(ddof=0) + epsilon

    log_posteriors = {}

    for cls in priors.keys():
        # Start with log prior
        log_posterior = np.log(priors[cls])

        for feature in sample.index:
            mean = means_per_class.loc[cls, feature]
            var = variances_per_class.loc[cls, feature]
            x = sample[feature]

            # Gaussian log PDF calculation
            log_pdf = -0.5 * (np.log(2 * np.pi * var) +
                              ((x - mean) ** 2) / var)
            log_posterior += log_pdf

        log_posteriors[cls] = float(log_posterior)

    return log_posteriors

In [10]:
calculate_log_posteriors(X.iloc[0], calculate_priors(
    y), y)

{0: 1.0626580653811608, 1: -40.07797768635478, 2: -56.84265441519815}

## 6. Prediction

In [11]:
def predict(X, y) -> List[int]:
    """
    Predicts class labels using Gaussian Naive Bayes.

    Args:
        X: Features.
        y: Target labels corresponding to document rows.

    Returns:
        Predicted class labels for each row in X.
    """
    predictions = []
    priors = calculate_priors(y)
    for i in range(len(X)):
        log_posteriors = calculate_log_posteriors(X.iloc[i], priors, y)
        predictions.append(max(log_posteriors, key=log_posteriors.get))
    return predictions

In [12]:
predict(X, y)[:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## 7. Evaluation Metrics
### Binary Confusion Matrix
In a confusion matrix, the terms True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) describe the classification performance for binary classification. 

|                     | Predicted Negative  | Predicted Positive  |
| ------------------- | ------------------- | ------------------- |
| **Actual Negative** | True Negative (TN)  | False Positive (FP) |
| **Actual Positive** | False Negative (FN) | True Positive (TP)  |


1. True Positive (TP): The number of instances correctly predicted as positive (e.g., a disease correctly identified).

2. True Negative (TN): The number of instances correctly predicted as negative (e.g., no disease correctly identified).

3. False Positive (FP): The number of instances incorrectly predicted as positive (e.g., predicting disease when there isn't any).

4. False Negative (FN): The number of instances incorrectly predicted as negative (e.g., missing a disease when it exists).

### Multi-Class Confusion Matrix
For multi-class classification, the concepts can be extended by treating one class as the "positive" class and all others as "negative" classes in a one-vs-all approach. Rows represent the actual classes (true labels), and columns represent the predicted classes. For a class $C$,
1. True Positive (TP): The count in the diagonal cell corresponding to class $C$ ($\text{matrix} [C][C]$).
2. False Positive (FP): The sum of the column for class $C$, excluding the diagonal ($\sum(\text{matrix} [:, C]) - \text{matrix} [C][C]$).
3. False Negative (FN): The sum of the row for class $C$, excluding the diagonal ($\sum(\text{matrix} [C, :]) - \text{matrix} [C][C]$).
4. True Negative (TN): All other cells not in the row or column for class $C$ ($\text{total} - (FP + FN + TP)$).

|                  | Predicted Class 0 | Predicted Class 1 | Predicted Class 2 |
| ---------------- | ----------------- | ----------------- | ----------------- |
| **True Class 0** | 5                 | 2                 | 0                 |
| **True Class 1** | 1                 | 6                 | 1                 |
| **True Class 2** | 0                 | 2                 | 7                 |


For Class 0:
- TP = 5 (diagonal element for Class 0)
- FP = 1 (sum of column 0 minus TP: 1 + 0)
- FN = 2 (sum of row 0 minus TP: 2 + 0)
- TN = 6 + 1 + 2 + 7 = 16 (all other cells not in row 0 or column 0)

For Class 1:
- TP = 6 (diagonal element for Class 1)
- FP = 4 (sum of column 1 minus TP: 2 + 2)
- FN = 2 (sum of row 1 minus TP: 1 + 1)
- TN = 5 + 0 + 0 + 7 = 12 (all other cells not in row 1 or column 1)

In [13]:
def confusion_matrix(y_true: pd.Series, y_pred: List[str],
                     class_names: List[str] = None) -> Tuple[NDArray[np.int64], List[str]]:
    """
    Calculate the confusion matrix.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.

    Returns:
        Tuple: 
        - Confusion matrix.
        - List of class names.
    """
    # Encode labels as integers
    unique_classes = np.unique(y_true)
    if class_names is None:
        class_names = [str(cls) for cls in unique_classes]
    class_to_index = {cls: i for i, cls in enumerate(unique_classes)}

    n_classes = len(unique_classes)
    matrix = np.zeros((n_classes, n_classes), dtype=int)

    for true, pred in zip(y_true, y_pred):
        true_idx = class_to_index[true]
        pred_idx = class_to_index[pred]
        matrix[true_idx][pred_idx] += 1

    return matrix, class_names

### Accuracy
Accuracy is the most common evaluation metric for classification problems, representing the percentage of correct predictions out of total predictions. It provides a simple measure of how often the classifier makes correct predictions across all classes.

\begin{align*}
\text{Accuracy} = \dfrac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Samples}}
\end{align*}

In [14]:
def accuracy(y_true: pd.Series, y_pred: List[str]) -> float:
    """
    Calculate the accuracy of predictions by comparing true and predicted labels.

    Args:
        y_true: Ground truth target values. Contains the actual class labels for each sample.
        y_pred: Estimated target as returned by a classifier. Contains the predicted class labels for each sample.
    Returns:
        Classification accuracy (0.0 to 1.0).
    """
    return np.mean(y_true == y_pred)

### Precision
Precision measures the proportion of true positive predictions out of all positive predictions made by the classifier.

\begin{align*}
\text{Precision} = \dfrac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
\end{align*}

In [15]:
def precision(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate precision for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Precision values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=0) + 1e-7)

### Recall
Recall measures the proportion of true positive predications out of all actual positive cases.

\begin{align*}
\text{Recall} = \dfrac{\text{True Positives (TP)} }{\text{True Positives (TP)} + \text{False Negatives (FN)}}
\end{align*}

In [16]:
def recall(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate recall for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Recall values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=1) + 1e-7)

### F1-Score
The F1-Score is the harmonic mean of precision and recall.

\begin{align*}
\text{F1-Score} = 2 \times \dfrac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\end{align*}

In [17]:
def f1_score(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate F1-score for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        F1-scores for each class.
    """
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2 * (prec * rec) / (prec + rec + 1e-7)

In [18]:
def evaluate(y_true: pd.Series, y_pred: List[str],
             class_names: List[str] = None) -> Tuple[float, float, float, float, NDArray[np.int64]]:
    """
    Calculate evaluation metrics including accuracy, precision, recall, and F1-score for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.

    Returns:
        Tuple:
        - Overall accuracy.
        - Average precision.
        - Average recall.
        - Average F1-score.
        - Confusion matrix.
    """
    cm, class_names = confusion_matrix(y_true, y_pred, class_names)
    acc = accuracy(y_true, y_pred)
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    # print("Class\tPrecision\tRecall\tF1-Score")
    # for i, class_name in enumerate(class_names):
    #     print(f"{class_name}\t{prec[i]:.4f}\t\t{rec[i]:.4f}\t{f1[i]:.4f}")
    return acc, np.mean(prec), np.mean(rec), np.mean(f1), cm

## 8. Train Test Split
Train test split is a fundamental model validation technique in machine learning. It divides a dataset into two separate portions: a **training set** used to train a model, and a **testing set** used to evaluate how well the model can perform on unseen data. 

The typical split ratio is 80% for training and 20% for testing, though this can vary (70/30 or 90/10 are also common). The key principle is that the test set must remain completely separated during model training process, and should never be used to make decisions about the model or tune parameters. 

The split is usually done randomly to ensure both sets are representative of the overall dataset, and many libraries (such as scikit-learn) provide build-in functions that handle this process automatically while maintaining proper randomisation.


In [19]:
def train_test_split(X: pd.DataFrame, y: pd.Series, test_size: float = 0.2,
                     random_state: int = None) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """
    Split arrays or matrices into random train and test subsets.

    Args:
        X: Input features, a 2D array with rows (samples) and columns (features).
        y: Target values/labels, a 1D array with rows (samples).
        test_size: Proportion of the dataset to include in the test split. Must be between 0.0 and 1.0. default = 0.2
        random_state: Seed for the random number generator to ensure reproducible results. default = None

    Returns:
        A tuple containing:
            - X_train: Training set features.
            - X_test: Testing set features.
            - y_train: Training set target values.
            - y_test: Testing set target values.
    """
    # Set a random seed if it exists
    if random_state:
        np.random.seed(random_state)

    # Create a list of numbers from 0 to len(X)
    indices = np.arange(len(X))

    # Shuffle the indices
    np.random.shuffle(indices)

    # Define the size of our test data from len(X)
    test_size = int(test_size * len(X))

    # Generate indices for test and train data
    test_indices: NDArray[np.int64] = indices[:test_size]
    train_indices: NDArray[np.int64] = indices[test_size:]

    # Return: X_train, X_test, y_train, y_test
    return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

## 9. Encapsulation

In [20]:
class CustomGaussianNB:
    """
    Gaussian Naive Bayes classifier implementation.

    Attributes:
        epsilon (float): Smoothing parameter for variance
        priors_ (Dict[str, float]): Prior probabilities per class
        means_ (pd.DataFrame): Feature means per class
        variances_ (pd.DataFrame): Feature variances per class
        classes_ (List[str]): Unique class labels
        feature_names_ (pd.Index): Feature names from training data
    """

    def __init__(self, epsilon: float = 1e-9) -> None:
        """
        Initialise Gaussian Naive Bayes classifier.

        Args:
            epsilon: Smoothing parameter for variance (default = 1e-9)
        """
        self.epsilon = epsilon
        self.priors_ = None
        self.means_ = None
        self.variances_ = None
        self.classes_ = None
        self.feature_names_ = None

    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        """
        Train Gaussian Naive Bayes model.

        Args:
            X: Feature matrix.
            y: Target class labels.
        """
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)

        self.feature_names_ = X.columns
        self.classes_ = y.unique().tolist()
        self.priors_ = self._calculate_priors(y)
        self.means_, self.variances_ = self._calculate_params(X, y)

    def predict(self, X: pd.DataFrame) -> List[str]:
        """
        Predict class labels for input samples.

        Args:
            X: Feature matrix to predict.

        Returns:
            Predicted class labels.
        """
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X, columns=self.feature_names_)
        else:
            X = X.reindex(columns=self.feature_names_, fill_value=0)

        predictions = []
        for i in range(len(X)):
            sample = X.iloc[i]
            log_posteriors = self._calculate_log_posteriors(sample)
            predictions.append(max(log_posteriors, key=log_posteriors.get))
        return predictions

    def _calculate_priors(self, y: pd.Series) -> Dict[str, float]:
        """
        Calculate prior probabilities for each class.

        Args:
            y: Target class labels.

        Returns:
            Prior probabilities for each class.
        """
        return y.value_counts(normalize=True).to_dict()

    def _calculate_params(self, X: pd.DataFrame, y: pd.Series) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """
        Compute Gaussian parameters (mean and variance) per class.

        Args:
            X: Feature matrix.
            y: Target class labels.

        Returns:
            Tuple of (means, variances) DataFrames.
        """
        means = X.groupby(y).mean()
        variances = X.groupby(y).var(ddof=0) + self.epsilon
        return means, variances

    def _calculate_log_posteriors(self, sample: pd.Series) -> Dict[str, float]:
        """
        Calculate log-posterior probabilities for a single sample.

        Args:
            sample: Feature vector of a single sample

        Returns:
            Dictionary of log-posterior probabilities per class
        """
        log_posteriors = {}

        for cls in self.classes_:
            # Start with log prior
            log_posterior = np.log(self.priors_[cls])

            # Vectorised log-likelihood calculation
            mean_vec = self.means_.loc[cls].values
            var_vec = self.variances_.loc[cls].values
            x_vec = sample.values

            # Gaussian log PDF: -1/2*[log(2πσ²) + (x-μ)²/σ²]
            log_pdf = -1/2 * (np.log(2 * np.pi * var_vec) +
                              ((x_vec - mean_vec) ** 2) / var_vec)
            log_posterior += np.sum(log_pdf)

            log_posteriors[cls] = log_posterior

        return log_posteriors

Let's check the performance on the entire dataset:

In [21]:
model = CustomGaussianNB(epsilon=1.0)
model.fit(X, y)
y_pred = model.predict(X)
acc, prec, rec, f1, cm = evaluate(y, y_pred)
print(f'Accuracy (Entire Dataset): {acc:.4f}')
print(f'Precision (Entire Dataset): {prec:.4f}')
print(f'Recall (Entire Dataset): {rec:.4f}')
print(f'F1-Score (Entire Dataset): {f1:.4f}')
print(f'Confusion Matrix (Entire Dataset):\n{cm}')

Accuracy (Entire Dataset): 0.9333
Precision (Entire Dataset): 0.9372
Recall (Entire Dataset): 0.9333
F1-Score (Entire Dataset): 0.9331
Confusion Matrix (Entire Dataset):
[[50  0  0]
 [ 0 48  2]
 [ 0  8 42]]


Train the model on 80% of the dataset, then evaluate its performance on the remaining 20% (the test set).

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Initialise and train
model = CustomGaussianNB(epsilon=1e-9)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
acc, prec, rec, f1, cm = evaluate(y_test, y_pred)
print(f'Accuracy (Test): {acc:.4f}')
print(f'Precision (Test): {prec:.4f}')
print(f'Recall (Test): {rec:.4f}')
print(f'F1-Score (Test): {f1:.4f}')
print(f'Confusion Matrix (Test):\n{cm}')

Accuracy (Test): 1.0000
Precision (Test): 1.0000
Recall (Test): 1.0000
F1-Score (Test): 1.0000
Confusion Matrix (Test):
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]


## 10. Comparison with Scikit-Learn

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = GaussianNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f'Predictions: {y_pred}')
print(f'Accuracy: {accuracy:.4f}')
print(f'Classification report:\n{classification_report(y_test, y_pred)}')

Predictions: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Accuracy: 1.0000
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

