# Multinomial Naive Bayes Classifier from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
    - [Bayes' Theorem](#bayes-theorem)
2. [Loading Data](#2-loading-data)
3. [Text Preprocessing](#3-text-preprocessing)
4. [Prior Probability](#4-prior-probability)
5. [Likelihood for Multinomial NB](#5-likelihood-for-multinomial-nb)
6. [Posterior Probability for Multinomial NB](#6-posterior-probability-for-multinomial-nb)
7. [Prediction](#7-prediction)
8. [Evaluation Metrics](#8-evaluation-metrics)
    - [Binary Confusion Matrix](#binary-confusion-matrix)
    - [Multi-Class Confusion Matrix](#multi-class-confusion-matrix)
    - [Accuracy](#accuracy)
    - [Precision](#precision)
    - [Recall](#recall)
    - [F1-Score](#f1-score)
9. [Train Test Split](#9-train-test-split)
10. [Encapsulation](#10-encapsulation)
11. [Comparison with Scikit-Learn](#11-comparison-with-scikit-learn)
***

In [1]:
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict
from numpy.typing import NDArray

## 1. Introduction
Naive Bayes classifiers are probabilistic classification models based on Bayes' Theorem, assuming conditional independence between features given the class labels or values. Naive Bayes is a general framework; the specific variant should be chosen based on the nature of your data:

- **Categorical Naive Bayes**

    - **Features**: Categorical labels (e.g., colours, countries, product types).

    - **Use Case**: Classification with discrete, categorically distributed features.

- **Multinomial Naive Bayes**

    - **Features**: Counts or frequencies (e.g., word occurrences, event counts).

    - **Use** **Case**: Text classification, document classification, or any scenario where features are discrete counts.

- **Gaussian Naive Bayes**

    - **Features**: Continuous data (e.g., measurements, sensor readings).

    - **Use Case**: Classification with numerical features assumed to follow a Gaussian distribution.

- **Bernoulli Naive Bayes**

    - **Features**: Binary features (e.g., True/False, 0/1).

    - **Use Case**: Text classification (presence/absence of words), binary feature spaces.



### Bayes' Theorem
Bayes' theorem describes the probability of a class $C_{i}$ given a document $d$:

\begin{align*}
P(C_{i}|d) = \dfrac{P(d|C_{i}) \cdot P(C_{i})}{P(d)}
\end{align*}

where:
- $P(C_{i}|d)$: Posterior probability of class $C_{i}$ given document $d$.
- $P(d|C_{i})$: Likelihood of document $d$ given class $C_{i}$.
- $P(C_{i})$: Prior probability of class $C_{i}$.
- $P(d)$: Probability of document $d$ (acts as a normalising constant).

Multinomial Naive Bayes assumes word occurrences in $d$ are conditionally independent given $C_{i}$. For a document represented by word counts ${\text{c}(w_1,d), \text{c}(w_2,d), \dots, \text{c}(w_{V},d)}$, the likelihood is:

\begin{align*}
P(d|C_{i}) = \prod_{j=1}^{V} P(w_{j}|C_{i})^{\text{c}(w_{j},d)}
\end{align*}

where:
- $w_{j}$: $j$-th word in vocabulary.
- $\text{c}(w_{j},d)$: Frequency of $w_{j}$ in $d$.
- $V$: Vocabulary size.

Replacing $P(d|C_{i})$ in Bayes' theorem, the equation becomes:

\begin{align*}
P(C_{i}|d) = \dfrac{P(C_{i}) \cdot \prod_{j=1}^{V} P(w_{j}|C_{i})^{\text{c}(w_{j},d)}}{P(d)}
\end{align*}

Since $P(d)$ is constant across classes:

\begin{align*}
P(C_{i}|d) \propto P(C_{i}) \cdot \prod_{j=1}^{V} P(w_{j}|C_{i})^{\text{c}(w_{j},d)}
\end{align*}

The symbol $\propto$ denotes proportionality, meaning we ignore $P(d)$ when comparing probabilities across classes.

## 2. Loading Data
Dataset retrieved from [Kaggle - Spam Email](https://www.kaggle.com/datasets/mfaisalqureshi/spam-email?select=spam.csv)

In [2]:
df = pd.read_csv('../_datasets/spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df['Category'].value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

## 3. Text Preprocessing
Before implementing our Multinomial Naive Bayes classifier, we usually need to perform text preprocessing to ensure effective spam email classification. For this project, we will:

- Convert all text to lowercase.

Depending on the project or dataset, we can additionally perform more text preprocessing, such as:
- Remove stopwords (e.g,. 'a', 'the').
- Lemmatisation.
- Removing non-alphabetic characters.
- Removing punctuation.

In [4]:
def clean_text(text: str) -> str:
    """
    Clean and preprocess email text for spam detection.

    Args:
        Raw email text.

    Returns:
        Cleaned and preprocessed text.
    """
    return text.lower()

In [5]:
df['clean_text'] = df['Message'].apply(clean_text)
df.head()

Unnamed: 0,Category,Message,clean_text
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro..."


In [6]:
X, y = df['clean_text'], df['Category']

Now, we apply `CountVectorizer` to convert the text data into a document-term matrix (also known as a Bag of Words), where each unique word in the dataset becomes a feature(column) and each value is the word count in a given document (row).

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectoriser = CountVectorizer()
X_doc_term = vectoriser.fit_transform(X)
X = pd.DataFrame(X_doc_term.toarray(),
                 columns=vectoriser.get_feature_names_out())

## 4. Prior Probability
Class $C_{i}$ (`y`) has only two discrete variables: `ham` and `spam`:

\begin{align*}
P(C_{i}=\text{'ham'}) = \dfrac{\text{Count('ham')}}{\text{Total Count}}
\end{align*}

\begin{align*}
P(C_{i}=\text{'spam'}) = \dfrac{\text{Count('spam')}}{\text{Total Count}}
\end{align*}


In [8]:
print(f'Total count: {len(df)}')
print(f'Counts: {y.value_counts().to_dict()}')

Total count: 5572
Counts: {'ham': 4825, 'spam': 747}


\begin{align*}
P(\text{'ham'}) = \dfrac{4825}{5572} = 0.8659
\end{align*}

\begin{align*}
P(\text{'spam'}) = \dfrac{747}{5572} = 0.1341
\end{align*}

In [9]:
def calculate_priors(y: pd.Series) -> Dict[str, float]:
    """
    Calculate prior probabilities for each class in the target variable.

    Args:
        y: Target variable containing class labels (strings).

    Returns:
        Prior probabilities for each class.
    """
    return y.value_counts(normalize=True).to_dict()

In [10]:
calculate_priors(y)

{'ham': 0.8659368269921034, 'spam': 0.13406317300789664}

## 5. Likelihood for Multinomial NB

For word $w_{j}$ and class $C_{i}$, the likelihood is calculated as:

\begin{align*}
P(w_{j}|C_{i}) = \frac{
    \text{count}(w_{j} \text{ in } C_{i}) + \alpha
}{
    \text{total words in } C_{i} + V \cdot \alpha
}
\end{align*}

where:
- $\text{count}(w_{j} \text{ in } C_{i})$: Total occurrences of word $w_{j}$ in class $C_{i}$.
- $\text{total words in } C_{i}$: Sum of all word counts in class $C_{i}$.
- $V$ : Vocabulary size.
- $\alpha$: Laplace smoothing parameter.

For a document $d$ with word counts $\{c_1, c_2, \dots, c_{V}\}$, the likelihood becomes:
\begin{align*}
P(d|C_{i}) \propto \prod_{i=1}^{V} \left[ P(w_{j}|C_{i}) \right]^{c_i}
\end{align*}


In [None]:
def calculate_likelihoods(X: pd.DataFrame, y: pd.Series, alpha: float = 1.0) -> pd.DataFrame:
    """
    Calculate word likelihood probabilities for Gaussian Naive Bayes using vectorised operations.

    Args:
        X: Document-term matrix with words as columns and documents as rows.
        y: Series of class labels corresponding to each document.
        alpha: Laplace smoothing parameter (default=1.0).

    Returns:
        Word probabilities where:
        - Rows represent classes
        - Columns represent words
        - Values are P(word|class)
    """
    # Group by class and sum word counts
    class_totals = X.groupby(y).sum()
    total_words_per_class = class_totals.sum(axis=1)
    vocab_size = len(X.columns)

    # Vectorised calculation
    numerator = class_totals + alpha
    # Broadcast [:, np.newaxis] for efficient probability computation
    denominator = total_words_per_class.values[:,
                                               np.newaxis] + vocab_size * alpha
    likelihoods = numerator / denominator
    return likelihoods

In [12]:
calculate_likelihoods(X, y).iloc[:, :10]

Unnamed: 0_level_0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ham,1.4e-05,1.4e-05,2.8e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,2.8e-05,1.4e-05
spam,0.000421,0.001148,3.8e-05,0.000115,7.7e-05,7.7e-05,7.7e-05,0.000115,3.8e-05,0.000344


## 6. Posterior Probability for Multinomial NB
As we discussed [above](#1-introduction), the formula of posterior probability is:

\begin{align*}
P(C_{i}|d) \propto P(C_{i}) \cdot \prod_{j=1}^{V} P(w_{j}|C_{i})^{\text{c}(w_{j},d)}
\end{align*}

where:

- $d$: Document.

- $\text{c}(w_{j},d)$ = Frequency of word $w_{j}$ in $d$.

To prevent underflow, we use log probabilities:

\begin{align*}
\log P(C_{i}|d) = \log P(C_{i}) + \sum_{j=1}^{V} \text{c}(w_{j},d) \cdot \log P(w_{j}|C_{i})
\end{align*}

Some extra tricks:
- Instead of iterative word-by-word computation, vectorised matrix operations are used to optimise posterior calculation.
- `log_likelihoods_df.values`: Precomputed log(P(word|class)) matrix (classes × words).
- `@ x_aligned`: Matrix-vector multiplication.
- Computes $\sum_{j=1}^{V} \text{c}(w_{j},d) \cdot \log P(w_{j}|C_{i})$ for all classes simultaneously

In [13]:
def calculate_posterior(x_vec: np.ndarray,
                        priors: Dict[str, float],
                        log_likelihoods: np.ndarray,
                        classes: List[str]) -> Dict[str, float]:
    """
    Calculates log-posterior probabilities for each class.
    Uses array operations for optimised performance.

    Args:
        x_vec: Feature vector for a single document (shape: [n_features]).
        priors: Prior probabilities for each class (keys: class labels).
        log_likelihoods: Log-likelihood matrix (shape: [n_classes, n_features]).
        classes: Ordered list of class labels corresponding to log_likelihoods rows.

    Returns:
        Dictionary mapping class labels to log-posterior probabilities.
    """
    log_priors = np.array([np.log(priors[c]) for c in classes])
    word_contributions = log_likelihoods @ x_vec
    log_posteriors = log_priors + word_contributions
    return dict(zip(classes, log_posteriors))

In [14]:
calculate_posterior(X.iloc[0], calculate_priors(
    y), calculate_likelihoods(X, y), y.unique())

{'ham': -0.11590046159203775, 'spam': -1.9995685516807546}

## 7. Prediction

`x_aligned = x.reindex(log_likelihoods_df.columns, fill_value=0).values` ensures that document word counts match the precomputed log-likelihood matrix columns. The missing words get `0` and are automatically ignored in multiplication.

In [15]:
def predict(X: pd.DataFrame, y: pd.Series) -> List[str]:
    """
    Predicts class labels using Multinomial Naive Bayes.

    Args:
        X: Document-term matrix with documents as rows and features as columns.
        y: Target labels corresponding to document rows.

    Returns:
        Predicted class labels for each document in X.
    """
    # Convert input to DataFrame if needed
    if not isinstance(X, pd.DataFrame):
        X = pd.DataFrame(X)

    priors = calculate_priors(y)
    likelihoods_df = calculate_likelihoods(X, y)
    log_likelihoods_df = np.log(likelihoods_df)
    classes = list(y.unique())

    # Precompute aligned indices for faster access
    aligned_cols = log_likelihoods_df.columns
    X_aligned = X.reindex(columns=aligned_cols, fill_value=0)

    predictions = []
    for i in range(len(X)):
        # Direct array access for speed
        x_vec = X_aligned.iloc[i].values
        posterior = calculate_posterior(
            x_vec, priors, log_likelihoods_df.values, classes
        )
        predictions.append(max(posterior, key=posterior.get))

    return predictions

In [16]:
predict(X, y)[:10]

['ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam']

## 8. Evaluation Metrics
### Binary Confusion Matrix
In a confusion matrix, the terms True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) describe the classification performance for binary classification. 

|                     | Predicted Negative  | Predicted Positive  |
| ------------------- | ------------------- | ------------------- |
| **Actual Negative** | True Negative (TN)  | False Positive (FP) |
| **Actual Positive** | False Negative (FN) | True Positive (TP)  |


1. True Positive (TP): The number of instances correctly predicted as positive (e.g., a disease correctly identified).

2. True Negative (TN): The number of instances correctly predicted as negative (e.g., no disease correctly identified).

3. False Positive (FP): The number of instances incorrectly predicted as positive (e.g., predicting disease when there isn't any).

4. False Negative (FN): The number of instances incorrectly predicted as negative (e.g., missing a disease when it exists).

### Multi-Class Confusion Matrix
For multi-class classification, the concepts can be extended by treating one class as the "positive" class and all others as "negative" classes in a one-vs-all approach. Rows represent the actual classes (true labels), and columns represent the predicted classes. For a class $C$,
1. True Positive (TP): The count in the diagonal cell corresponding to class $C$ ($\text{matrix} [C][C]$).
2. False Positive (FP): The sum of the column for class $C$, excluding the diagonal ($\sum(\text{matrix} [:, C]) - \text{matrix} [C][C]$).
3. False Negative (FN): The sum of the row for class $C$, excluding the diagonal ($\sum(\text{matrix} [C, :]) - \text{matrix} [C][C]$).
4. True Negative (TN): All other cells not in the row or column for class $C$ ($\text{total} - (FP + FN + TP)$).

|                  | Predicted Class 0 | Predicted Class 1 | Predicted Class 2 |
| ---------------- | ----------------- | ----------------- | ----------------- |
| **True Class 0** | 5                 | 2                 | 0                 |
| **True Class 1** | 1                 | 6                 | 1                 |
| **True Class 2** | 0                 | 2                 | 7                 |


For Class 0:
- TP = 5 (diagonal element for Class 0)
- FP = 1 (sum of column 0 minus TP: 1 + 0)
- FN = 2 (sum of row 0 minus TP: 2 + 0)
- TN = 6 + 1 + 2 + 7 = 16 (all other cells not in row 0 or column 0)

For Class 1:
- TP = 6 (diagonal element for Class 1)
- FP = 4 (sum of column 1 minus TP: 2 + 2)
- FN = 2 (sum of row 1 minus TP: 1 + 1)
- TN = 5 + 0 + 0 + 7 = 12 (all other cells not in row 1 or column 1)

In [17]:
def confusion_matrix(y_true: pd.Series, y_pred: List[str],
                     class_names: List[str] = None) -> Tuple[NDArray[np.int64], List[str]]:
    """
    Calculate the confusion matrix.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.

    Returns:
        Tuple: 
        - Confusion matrix.
        - List of class names.
    """
    # Encode labels as integers
    unique_classes = np.unique(y_true)
    if class_names is None:
        class_names = [str(cls) for cls in unique_classes]
    class_to_index = {cls: i for i, cls in enumerate(unique_classes)}

    n_classes = len(unique_classes)
    matrix = np.zeros((n_classes, n_classes), dtype=int)

    for true, pred in zip(y_true, y_pred):
        true_idx = class_to_index[true]
        pred_idx = class_to_index[pred]
        matrix[true_idx][pred_idx] += 1

    return matrix, class_names

### Accuracy
Accuracy is the most common evaluation metric for classification problems, representing the percentage of correct predictions out of total predictions. It provides a simple measure of how often the classifier makes correct predictions across all classes.

\begin{align*}
\text{Accuracy} = \dfrac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Samples}}
\end{align*}

In [18]:
def accuracy(y_true: pd.Series, y_pred: List[str]) -> float:
    """
    Calculate the accuracy of predictions by comparing true and predicted labels.

    Args:
        y_true: Ground truth target values. Contains the actual class labels for each sample.
        y_pred: Estimated target as returned by a classifier. Contains the predicted class labels for each sample.
    Returns:
        Classification accuracy (0.0 to 1.0).
    """
    return np.mean(y_true == y_pred)

### Precision
Precision measures the proportion of true positive predictions out of all positive predictions made by the classifier.

\begin{align*}
\text{Precision} = \dfrac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
\end{align*}

In [19]:
def precision(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate precision for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Precision values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=0) + 1e-7)

### Recall
Recall measures the proportion of true positive predications out of all actual positive cases.

\begin{align*}
\text{Recall} = \dfrac{\text{True Positives (TP)} }{\text{True Positives (TP)} + \text{False Negatives (FN)}}
\end{align*}

In [20]:
def recall(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate recall for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Recall values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=1) + 1e-7)

### F1-Score
The F1-Score is the harmonic mean of precision and recall.

\begin{align*}
\text{F1-Score} = 2 \times \dfrac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\end{align*}

In [21]:
def f1_score(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate F1-score for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        F1-scores for each class.
    """
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2 * (prec * rec) / (prec + rec + 1e-7)

In [22]:
def evaluate(y_true: pd.Series, y_pred: List[str],
             class_names: List[str] = None) -> Tuple[float, float, float, float, NDArray[np.int64]]:
    """
    Calculate evaluation metrics including accuracy, precision, recall, and F1-score for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.

    Returns:
        Tuple:
        - Overall accuracy.
        - Average precision.
        - Average recall.
        - Average F1-score.
        - Confusion matrix.
    """
    cm, class_names = confusion_matrix(y_true, y_pred, class_names)
    acc = accuracy(y_true, y_pred)
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    # print("Class\tPrecision\tRecall\tF1-Score")
    # for i, class_name in enumerate(class_names):
    #     print(f"{class_name}\t{prec[i]:.4f}\t\t{rec[i]:.4f}\t{f1[i]:.4f}")
    return acc, np.mean(prec), np.mean(rec), np.mean(f1), cm

## 9. Train Test Split
Train test split is a fundamental model validation technique in machine learning. It divides a dataset into two separate portions: a **training set** used to train a model, and a **testing set** used to evaluate how well the model can perform on unseen data. 

The typical split ratio is 80% for training and 20% for testing, though this can vary (70/30 or 90/10 are also common). The key principle is that the test set must remain completely separated during model training process, and should never be used to make decisions about the model or tune parameters. 

The split is usually done randomly to ensure both sets are representative of the overall dataset, and many libraries (such as scikit-learn) provide build-in functions that handle this process automatically while maintaining proper randomisation.


In [23]:
def train_test_split(X: pd.DataFrame, y: pd.Series, test_size: float = 0.2,
                     random_state: int = None) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """
    Split arrays or matrices into random train and test subsets.

    Args:
        X: Input features, a 2D array with rows (samples) and columns (features).
        y: Target values/labels, a 1D array with rows (samples).
        test_size: Proportion of the dataset to include in the test split. Must be between 0.0 and 1.0. default = 0.2
        random_state: Seed for the random number generator to ensure reproducible results. default = None

    Returns:
        A tuple containing:
            - X_train: Training set features.
            - X_test: Testing set features.
            - y_train: Training set target values.
            - y_test: Testing set target values.
    """
    # Set a random seed if it exists
    if random_state:
        np.random.seed(random_state)

    # Create a list of numbers from 0 to len(X)
    indices = np.arange(len(X))

    # Shuffle the indices
    np.random.shuffle(indices)

    # Define the size of our test data from len(X)
    test_size = int(test_size * len(X))

    # Generate indices for test and train data
    test_indices: NDArray[np.int64] = indices[:test_size]
    train_indices: NDArray[np.int64] = indices[test_size:]

    # Return: X_train, X_test, y_train, y_test
    return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

## 10. Encapsulation

Note that instead of looping over each document and calling `calculate_posterior()` for each, the `predict()` method does this for all documents at once using matrix operations:

- `log_priors` is the vectorised version of the log of prior probabilities.

- `word_contributions` computes, for all documents and classes, the sum of the log-likelihoods weighted by the word counts (the main computation in `calculate_posterior()`).

- `log_posteriors` adds the log priors to the word contributions, just like `calculate_posterior()` would for a single document.

In [None]:
class CustomMultinomialNB:
    """
    Multinomial Naive Bayes classifier implementation with optimised vector operations.

    Attributes:
        alpha (float): Smoothing parameter (default = 1.0)
        priors_ (Dict[str, float]): Prior probabilities per class
        likelihoods_ (pd.DataFrame): Likelihood probabilities (shape: [n_classes, n_features])
        log_likelihoods_ (np.ndarray): Precomputed log-likelihoods (shape: [n_classes, n_features])
        classes_ (List[str]): Unique class labels
        feature_names_ (pd.Index): Feature names from training data
    """

    def __init__(self, alpha: float = 1.0) -> None:
        """
        Initialise Multinomial Naive Bayes classifier.

        Args:
            alpha: Smoothing parameter for Laplace smoothing (default = 1.0).
        """
        self.alpha = alpha
        self.priors_ = None
        self.likelihoods_ = None
        self.log_likelihoods_ = None
        self.classes_ = None
        self.feature_names_ = None

    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        """
        Train Multinomial Naive Bayes model.
        
        Args:
            X: Document-term matrix (documents x features).
            y: Target class labels.
        """

        # Convert input to DataFrame if needed
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)

        self.feature_names_ = X.columns
        self.classes_ = y.unique().tolist()
        self.priors_ = {cls: 1/len(self.classes_) for cls in self.classes_}
        self.likelihoods_ = self._calculate_likelihoods(X, y)
        self.log_likelihoods_ = np.log(self.likelihoods_.values)

    def predict(self, X: pd.DataFrame) -> List[str]:
        """
        Predict class labels for documents in X.

        Args:
            X: Document-term matrix to predict.

        Returns:
            Predicted class labels.
        """
        # Align features with training data
        X_aligned = X.reindex(columns=self.feature_names_, fill_value=0)

        # Precompute log priors
        log_priors = np.array([np.log(self.priors_[c]) for c in self.classes_])

        # Vectorised prediction (calculating posteriors here)
        word_contributions = X_aligned @ self.log_likelihoods_.T
        log_posteriors = log_priors + word_contributions
        max_indices = np.argmax(log_posteriors, axis=1)

        return [self.classes_[idx] for idx in max_indices]

    def _calculate_priors(self, y: pd.Series) -> Dict[str, float]:
        """
        Calculate prior probabilities for each class in the target variable.

        Args:
            y: Target variable containing class labels (strings).

        Returns:
            Prior probabilities for each class.
        """
        return y.value_counts(normalize=True).to_dict()

    def _calculate_likelihoods(self, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
        """
        Compute feature likelihoods (P(x_i|y)) with Laplace smoothing.

        Returns:
            Likelihood DataFrame (classes x features)
        """
        class_totals = X.groupby(y).sum()
        total_words_per_class = class_totals.sum(axis=1)
        vocab_size = len(X.columns)

        numerator = class_totals + self.alpha
        denominator = total_words_per_class.values[:,
                                                   np.newaxis] + vocab_size * self.alpha
        return numerator / denominator

Let's check the performance on the entire dataset:

In [25]:
model = CustomMultinomialNB(alpha=1.0)
model.fit(X, y)
y_pred = model.predict(X)
acc, prec, rec, f1, cm = evaluate(y, y_pred)
print(f'Accuracy (Entire Dataset): {acc:.4f}')
print(f'Precision (Entire Dataset): {prec:.4f}')
print(f'Recall (Entire Dataset): {rec:.4f}')
print(f'F1-Score (Entire Dataset): {f1:.4f}')
print(f'Confusion Matrix (Entire Dataset):\n{cm}')

Accuracy (Entire Dataset): 0.9855
Precision (Entire Dataset): 0.9588
Recall (Entire Dataset): 0.9809
F1-Score (Entire Dataset): 0.9694
Confusion Matrix (Entire Dataset):
[[4763   62]
 [  19  728]]


Train the model on 80% of the dataset, then evaluate its performance on the remaining 20% (the test set).

In [26]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = CustomMultinomialNB(alpha=1.0)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
acc, prec, rec, f1, cm = evaluate(y_test, y_pred)
print(f'Accuracy (Test): {acc:.4f}')
print(f'Precision (Test): {prec:.4f}')
print(f'Recall (Test): {rec:.4f}')
print(f'F1-Score (Test): {f1:.4f}')
print(f'Confusion Matrix (Test):\n{cm}')

Accuracy (Test): 0.9677
Precision (Test): 0.9115
Recall (Test): 0.9615
F1-Score (Test): 0.9343
Confusion Matrix (Test):
[[936  29]
 [  7 142]]


## 11. Comparison with Scikit-Learn

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f'Predictions: {y_pred}')
print(f'Accuracy: {accuracy:.4f}')
print(f'Classification report:\n{classification_report(y_test, y_pred)}')

Predictions: ['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']
Accuracy: 0.9857
Classification report:
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       966
        spam       0.94      0.95      0.95       149

    accuracy                           0.99      1115
   macro avg       0.97      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115

