# Multinomial Naive Bayes Classifier from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
    - [Bayes' Theorem](#bayes-theorem)
2. [Loading Data](#2-loading-data)
3. [Text Preprocessing]()
4. [Prior Probability]()
5. [Likelihood]()

***

In [288]:
import nltk
from tqdm import tqdm
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict
from numpy.typing import NDArray
import string
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
string.punctuation

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tsu76i/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/tsu76i/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 1. Introduction
Naive Bayes classifiers are probabilistic classification models based on Bayes' Theorem, assuming conditional independence between features given the class labels or values. Naive Bayes is a general framework; the specific variant should be chosen based on the nature of your data:

- **Categorical Naive Bayes**

    - **Features**: Categorical labels (e.g., colours, countries, product types).

    - **Use Case**: Classification with discrete, categorically distributed features.

- **Multinomial Naive Bayes**

    - **Features**: Counts or frequencies (e.g., word occurrences, event counts).

    - **Use** **Case**: Text classification, document classification, or any scenario where features are discrete counts.

- **Gaussian Naive Bayes**

    - **Features**: Continuous data (e.g., measurements, sensor readings).

    - **Use Case**: Classification with numerical features assumed to follow a Gaussian distribution.

- **Bernoulli Naive Bayes**

    - **Features**: Binary features (e.g., True/False, 0/1).

    - **Use Case**: Text classification (presence/absence of words), binary feature spaces.



### Bayes' Theorem
Bayes' theorem describes the probability of a class $C$ given a document $d$:

\begin{align*}
P(C|d) = \dfrac{P(d|C) \cdot P(C)}{P(d)}
\end{align*}

where:
- $P(C|d)$: Posterior probability of class $C$ given document $d$.
- $P(d|C)$: Likelihood of document $d$ given class $C$.
- $P(C)$: Prior probability of class $C$.
- $P(d)$: Probability of document $d$ (acts as a normalising constant).

Naive Bayes assumes word occurrences in $d$ are conditionally independent given $C$. For a document represented by word counts ${\text{c}(w_1,d), \text{c}(w_2,d), \dots, \text{c}(w_n,d)}$, the likelihood is:

\begin{align*}
P(d|C) = \prod_{i=1}^{n} P(w_i|C)^{\text{c}(w_i,d)}
\end{align*}

where:
- $w_i$ = $i$-th word in vocabulary.
- $\text{c}(w_i,d)$ = frequency of $w_i$ in $d$.
- $n$ = vocabulary size.

Replacing $P(d|C)$ in Bayes' theorem, the equation becomes:

\begin{align*}
P(C|d) = \dfrac{P(C) \cdot \prod_{i=1}^{n} P(w_i|C)^{\text{c}(w_i,d)}}{P(d)}
\end{align*}

Since $P(d)$ is constant across classes:

\begin{align*}
P(C|d) \propto P(C) \cdot \prod_{i=1}^{n} P(w_i|C)^{\text{c}(w_i,d)}
\end{align*}

The symbol $\propto$ denotes proportionality, meaning we ignore $P(d)$ when comparing probabilities across classes.

## 2. Loading Data
Dataset retrieved from [Kaggle - Spam Email](https://www.kaggle.com/datasets/mfaisalqureshi/spam-email?select=spam.csv)

In [289]:
df = pd.read_csv('../_datasets/spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [290]:
df['Category'].value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

## 3. Text Preprocessing
Before implementing our Multinomial Naive Bayes classifier, we need to perform text preprocessing to ensure effective spam email classification. We will:

- Convert all text to lowercase
- Remove punctuation
- Remove non-alphabetic characters
- Remove stopwords (e.g,. 'a', 'the')
- Lemmatise words

In [291]:
def clean_text(text: str) -> str:
    """
    Clean and preprocess email text for spam detection.

    1. Convert all text to lowercase
    2. Remove punctuation
    3. Remove non-alphabetic characters
    4. Remove stopwords
    5. Lemmatise words

    Args:
        text (str): Raw email text.

    Returns:
        str: Cleaned and preprocessed text.
    """

    text = text.lower()  # Lowercase
    text = text.translate(str.maketrans(
        '', '', string.punctuation))  # Remove punctuation
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetic chars
    words = text.split()  # Tokenise
    stop_words = set(stopwords.words('english'))
    # Extract only non-stopwords
    words = [w for w in words if w not in stop_words]

    lemmatiser = WordNetLemmatizer()
    words = [lemmatiser.lemmatize(w) for w in words]
    return ' '.join(words)  # Return words as a single string

In [292]:
df['clean_text'] = df['Message'].apply(clean_text)
df.head()

Unnamed: 0,Category,Message,clean_text
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think go usf life around though


In [293]:
X, y = df['clean_text'], df['Category']

Now, we apply `CountVectorizer` to convert the text data into a document-term matrix (also known as a Bag of Words), where each unique word in the dataset becomes a feature(column) and each value is the word count in a given document (row).

In [294]:
from sklearn.feature_extraction.text import CountVectorizer
vectoriser = CountVectorizer()
X_doc_term = vectoriser.fit_transform(X)
X = pd.DataFrame(X_doc_term.toarray(),
                 columns=vectoriser.get_feature_names_out())

## 4. Prior Probability
Class $C$ (`y`) has only two discrete variables: `ham` and `spam`:

\begin{align*}
P(C=\text{'ham'}) = \dfrac{\text{Count(ham)}}{\text{Total Count}}
\end{align*}

\begin{align*}
P(C=\text{'spam'}) = \dfrac{\text{Count(spam)}}{\text{Total Count}}
\end{align*}


In [295]:
print(f'Total count: {len(df)}')
print(f'Counts": {y.value_counts().to_dict()}')

Total count: 5572
Counts": {'ham': 4825, 'spam': 747}


\begin{align*}
P(\text{'ham'}) = \dfrac{4825}{5572} = 0.8659
\end{align*}

\begin{align*}
P(\text{'spam'}) = \dfrac{747}{5572} = 0.1341
\end{align*}

In [296]:
def calculate_priors(y: pd.Series) -> Dict[str, float]:
    """
    Calculate prior probabilities for each class in the target variable.

    Args:
        y (pd.Series): Target variable containing class labels (strings).

    Returns:
        Dict[str, float]: Prior probabilities for each class.
    """
    return y.value_counts(normalize=True).to_dict()

In [297]:
calculate_priors(y)

{'ham': 0.8659368269921034, 'spam': 0.13406317300789664}

## 5. Likelihood for Multinomial Naive Bayes

For word $w_i$ and class $C$, the likelihood is calculated as:

\begin{align*}
P(w_i|C) = \frac{
    \text{count}(w_i \text{ in } C) + \alpha
}{
    \text{total words in } C + |V| \cdot \alpha
}
\end{align*}

where:
- $\text{count}(w_i \text{ in } C)$: Total occurrences of word $w_i$ in class $C$
- $\text{total words in } C$: Sum of all word counts in class $C$
- $|V|$ : Vocabulary size
- $\alpha$: Laplace smoothing parameter

For a document $d$ with word counts $\{c_1, c_2, \dots, c_{|V|}\}$, the likelihood becomes:
\begin{align*}
P(d|C) \propto \prod_{i=1}^{|V|} \left[ P(w_i|C) \right]^{c_i}
\end{align*}


In [298]:
def calculate_likelihoods(X, y, alpha=1.0):

    # Group by class and sum word counts
    class_totals = X.groupby(y).sum()
    total_words_per_class = class_totals.sum(axis=1)
    vocab_size = len(X.columns)

    # Vectorised calculation
    numerator = class_totals + alpha
    # Broadcast [:, np.newaxis] for efficient probability computation
    denominator = total_words_per_class.values[:,
                                               np.newaxis] + vocab_size * alpha
    likelihoods = numerator / denominator
    return likelihoods

In [299]:
calculate_likelihoods(X, y)

Unnamed: 0_level_0,aa,aah,aaniye,aaooooright,aathilove,aathiwhere,ab,abbey,abdomen,abeg,...,zed,zero,zf,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ham,4.4e-05,8.8e-05,4.4e-05,4.4e-05,8.8e-05,8.8e-05,2.2e-05,4.4e-05,4.4e-05,4.4e-05,...,2.2e-05,4.4e-05,2.2e-05,4.4e-05,4.4e-05,4.4e-05,4.4e-05,4.4e-05,2.2e-05,4.4e-05
spam,5.5e-05,5.5e-05,5.5e-05,5.5e-05,5.5e-05,5.5e-05,0.00011,5.5e-05,5.5e-05,5.5e-05,...,0.000385,5.5e-05,0.00011,5.5e-05,5.5e-05,0.00011,5.5e-05,5.5e-05,0.00011,5.5e-05


## 6. Posterior Probability for Multinomial Naive Bayes
As we discussed [above](#1-introduction), the formula of posterior probability is:

\begin{align*}
P(C|d) \propto P(C) \prod_{w \in V} P(w|C)^{\text{c}(w,d)}
\end{align*}

where:

- $d$ = Document.

- $V$ = Vocabulary.

- $\text{c}(w,d)$ = Frequency of word $w$ in $d$.

To prevent underflow, we use log probabilities:

\begin{align*}
\log P(C|d) = \log P(C) + \sum_{w \in V} \text{c}(w,d) \cdot \log P(w|C)
\end{align*}

Some extra tricks:
- Instead of iterative word-by-word computation, vectorised matrix operations are used to optimise posterior calculation.
- `log_likelihoods_df.values`: Precomputed log(P(word|class)) matrix (classes × words).
- `@ x_aligned`: Matrix-vector multiplication.
- Computes $\sum_{w \in V} \text{c}(w,d) \cdot \log P(w|C)$ for all classes simultaneously

In [300]:
def calculate_posterior(x_vec: np.ndarray,
                        priors: Dict[str, float],
                        log_likelihoods: np.ndarray,
                        classes: List[str]) -> Dict[str, float]:
    """Array-only version for maximum speed"""
    log_priors = np.array([np.log(priors[c]) for c in classes])
    word_contributions = log_likelihoods @ x_vec
    log_posteriors = log_priors + word_contributions
    return dict(zip(classes, log_posteriors))

In [301]:
calculate_posterior(X.iloc[0], calculate_priors(
    y), calculate_likelihoods(X, y), y.unique())

{'ham': -0.12586345181382164, 'spam': -2.00427477009244}

## 7. Prediction

`x_aligned = x.reindex(log_likelihoods_df.columns, fill_value=0).values` ensures that document word counts match the precomputed log-likelihood matrix columns. The missing words get `0` and are automatically ignored in multiplication.

In [302]:
def predict(X: pd.DataFrame, y: pd.Series) -> List[str]:
    """
    Predict class labels using Multinomial Naive Bayes.

    Args:
        X: Document-term matrix (DataFrame)
        y: Target labels (Series)

    Returns:
        Predicted class labels
    """
    # Convert input to DataFrame if needed
    if not isinstance(X, pd.DataFrame):
        X = pd.DataFrame(X)

    priors = calculate_priors(y)
    likelihoods_df = calculate_likelihoods(X, y)
    log_likelihoods_df = np.log(likelihoods_df)
    classes = list(y.unique())

    # Precompute aligned indices for faster access
    aligned_cols = log_likelihoods_df.columns
    X_aligned = X.reindex(columns=aligned_cols, fill_value=0)

    predictions = []
    for i in range(len(X)):
        # Direct array access for speed
        x_vec = X_aligned.iloc[i].values
        posterior = calculate_posterior(
            x_vec, priors, log_likelihoods_df.values, classes
        )
        predictions.append(max(posterior, key=posterior.get))

    return predictions

In [303]:
predict(X, y)

['ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'h

## 8. Evaluation Metrics
### Binary Confusion Matrix
In a confusion matrix, the terms True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) describe the classification performance for binary classification. 

|                     | Predicted Negative  | Predicted Positive  |
| ------------------- | ------------------- | ------------------- |
| **Actual Negative** | True Negative (TN)  | False Positive (FP) |
| **Actual Positive** | False Negative (FN) | True Positive (TP)  |


1. True Positive (TP): The number of instances correctly predicted as positive (e.g., a disease correctly identified).

2. True Negative (TN): The number of instances correctly predicted as negative (e.g., no disease correctly identified).

3. False Positive (FP): The number of instances incorrectly predicted as positive (e.g., predicting disease when there isn't any).

4. False Negative (FN): The number of instances incorrectly predicted as negative (e.g., missing a disease when it exists).

### Multi-Class Confusion Matrix
For multi-class classification, the concepts can be extended by treating one class as the "positive" class and all others as "negative" classes in a one-vs-all approach. Rows represent the actual classes (true labels), and columns represent the predicted classes. For a class $C$,
1. True Positive (TP): The count in the diagonal cell corresponding to class $C$ ($\text{matrix} [C][C]$).
2. False Positive (FP): The sum of the column for class $C$, excluding the diagonal ($\sum(\text{matrix} [:, C]) - \text{matrix} [C][C]$).
3. False Negative (FN): The sum of the row for class $C$, excluding the diagonal ($\sum(\text{matrix} [C, :]) - \text{matrix} [C][C]$).
4. True Negative (TN): All other cells not in the row or column for class $C$ ($\text{total} - (FP + FN + TP)$).

|                  | Predicted Class 0 | Predicted Class 1 | Predicted Class 2 |
| ---------------- | ----------------- | ----------------- | ----------------- |
| **True Class 0** | 5                 | 2                 | 0                 |
| **True Class 1** | 1                 | 6                 | 1                 |
| **True Class 2** | 0                 | 2                 | 7                 |


For Class 0:
- TP = 5 (diagonal element for Class 0)
- FP = 1 (sum of column 0 minus TP: 1 + 0)
- FN = 2 (sum of row 0 minus TP: 2 + 0)
- TN = 6 + 1 + 2 + 7 = 16 (all other cells not in row 0 or column 0)

For Class 1:
- TP = 6 (diagonal element for Class 1)
- FP = 4 (sum of column 1 minus TP: 2 + 2)
- FN = 2 (sum of row 1 minus TP: 1 + 1)
- TN = 5 + 0 + 0 + 7 = 12 (all other cells not in row 1 or column 1)

In [304]:
def confusion_matrix(y_true: pd.Series, y_pred: List[str],
                     class_names: List[str] = None) -> Tuple[NDArray[np.int64], List[str]]:
    """
    Calculate the confusion matrix.

    Args:
        y_true (pd.Series): True labels.
        y_pred (List[str]): Predicted labels.
        class_names (List[str], optional): List of class names. Defaults to None.

    Returns:
        Tuple: 
        - NDArray[np.int64]: Confusion matrix.
        - List[str]: List of class names.
    """
    # Encode labels as integers
    unique_classes = np.unique(y_true)
    if class_names is None:
        class_names = [str(cls) for cls in unique_classes]
    class_to_index = {cls: i for i, cls in enumerate(unique_classes)}

    n_classes = len(unique_classes)
    matrix = np.zeros((n_classes, n_classes), dtype=int)

    for true, pred in zip(y_true, y_pred):
        true_idx = class_to_index[true]
        pred_idx = class_to_index[pred]
        matrix[true_idx][pred_idx] += 1

    return matrix, class_names

### Accuracy
Accuracy is the most common evaluation metric for classification problems, representing the percentage of correct predictions out of total predictions. It provides a simple measure of how often the classifier makes correct predictions across all classes.

\begin{align*}
\text{Accuracy} = \dfrac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Samples}}
\end{align*}

In [305]:
def accuracy(y_true: pd.Series, y_pred: List[str]) -> float:
    """
    Calculate the accuracy of predictions by comparing true and predicted labels.

    Args:
        y_true (pd.Series): Ground truth target values. Contains the actual class labels for each sample.
        y_pred (List[str])): Estimated target as returned by a classifier. Contains the predicted class labels for each sample.
    Returns:
        float: Classification accuracy as a percentage (0.0 to 100.0).
    """
    return np.mean(y_true == y_pred)

### Precision
Precision measures the proportion of true positive predictions out of all positive predictions made by the classifier.

\begin{align*}
\text{Precision} = \dfrac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
\end{align*}

In [306]:
def precision(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate precision for each class.

    Args:
        y_true (pd.Series): True labels.
        y_pred (List[str]): Predicted labels.

    Returns:
        NDArray[np.float64]: Precision values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=0) + 1e-7)

### Recall
Recall measures the proportion of true positive predications out of all actual positive cases.

\begin{align*}
\text{Recall} = \dfrac{\text{True Positives (TP)} }{\text{True Positives (TP)} + \text{False Negatives (FN)}}
\end{align*}

In [307]:
def recall(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate recall for each class.

    Args:
        y_true (pd.Series): True labels.
        y_pred (List[str]): Predicted labels.

    Returns:
        NDArray[np.float64]: Recall values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=1) + 1e-7)

### F1-Score
The F1-Score is the harmonic mean of precision and recall.

\begin{align*}
\text{F1-Score} = 2 \times \dfrac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\end{align*}

In [308]:
def f1_score(y_true: pd.Series, y_pred: List[str]) -> NDArray[np.float64]:
    """
    Calculate F1-score for each class.

    Args:
        y_true (pd.Series): True labels.
        y_pred (List[str]): Predicted labels.

    Returns:
        NDArray[np.float64]: F1-scores for each class.
    """
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2 * (prec * rec) / (prec + rec + 1e-7)

In [311]:
def evaluate(y_true: pd.Series, y_pred: List[str],
             class_names: List[str] = None) -> Tuple[float, float, float, float, NDArray[np.int64]]:
    """
    Calculate evaluation metrics including accuracy, precision, recall, and F1-score for each class.

    Args:
        y_true (pd.Series): True labels.
        y_pred (List[str]): Predicted labels.
        class_names (List[str], optional): List of class names. Defaults to None.

    Returns:
        Tuple:
        - float: Overall accuracy.
        - float: Average precision.
        - float: Average recall.
        - float: Average F1-score.
        - NDArray[np.int64]: Confusion matrix.
    """
    cm, class_names = confusion_matrix(y_true, y_pred, class_names)
    acc = accuracy(y_true, y_pred)
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    # print("Class\tPrecision\tRecall\tF1-Score")
    # for i, class_name in enumerate(class_names):
    #     print(f"{class_name}\t{prec[i]:.4f}\t\t{rec[i]:.4f}\t{f1[i]:.4f}")
    return acc, np.mean(prec), np.mean(rec), np.mean(f1), cm

## 9. Encapsulation

## 10. Comparison with Scikit-Learn

In [310]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X, y)

predictions = model.predict(X)
accuracy = model.score(X, y)
print(f'Predictions: {predictions}')
print(f'Accuracy: {accuracy:.4f}')

Predictions: ['ham' 'ham' 'spam' ... 'ham' 'ham' 'ham']
Accuracy: 0.9892
