# Bernoulli Naive Bayes Classifier from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
    - [Bayes' Theorem](#bayes-theorem)
2. [Loading Data](#2-loading-data)
3. [Prior Probability](#3-prior-probability)
***

In [366]:
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict
from numpy.typing import NDArray
from sklearn.preprocessing import LabelEncoder

## 1. Introduction
Naive Bayes classifiers are probabilistic classification models based on Bayes' Theorem, assuming conditional independence between features given the class labels or values. Naive Bayes is a general framework; the specific variant should be chosen based on the nature of your data:

- **Categorical Naive Bayes**

    - **Features**: Categorical labels (e.g., colours, countries, product types).

    - **Use Case**: Classification with discrete, categorically distributed features.

- **Multinomial Naive Bayes**

    - **Features**: Counts or frequencies (e.g., word occurrences, event counts).

    - **Use** **Case**: Text classification, document classification, or any scenario where features are discrete counts.

- **Gaussian Naive Bayes**

    - **Features**: Continuous data (e.g., measurements, sensor readings).

    - **Use Case**: Classification with numerical features assumed to follow a Gaussian distribution.

- **Bernoulli Naive Bayes**

    - **Features**: Binary features (e.g., True/False, 0/1).

    - **Use Case**: Text classification (presence/absence of words), binary feature spaces.



### Bayes' Theorem
Bayes' theorem describes the probability of a class $C_{i}$ given a set of features $X = (x_{1}, x_{2},\ldots,x_{N})$:

\begin{align*}
P(C_{i}|X) = \dfrac{P(X|C_{i}) \cdot P(C_{i})}{P(X)}
\end{align*}

where:
- $P(C_{i}|X)$: Posterior probability of class $C_{i}$ given features $X$.
- $P(X|C_{i})$: Likelihood of features $X$ given class $C_{i}$.
- $P(C_{i})$: Prior probability of class $C_{i}$.
- $P(X)$: Probability of features $X$ (acts as a normalising constant).

Bernoulli Naive Bayes assumes features $X = (x_{1}, x_{2},\ldots,x_{N})$ are conditionally independent given the class $C_{i}$, thus the likelihood is expressed as:

\begin{align*}
P(X|C_{i}) = P(x_{1}, x_{2}, \dots, x_{N}|C_{i}) = \prod_{j=1}^{N}P(x_{j}|C_{i})
\end{align*}

Replacing $P(X|C_{i})$ in Bayes' theorem, the equation becomes:

\begin{align*}
P(C_{i}|X) = \dfrac{P(C_{i}) \cdot \prod_{j=1}^{N} P(x_{j}|C_{i})}{P(X)}
\end{align*}

Since $P(X)$ is constant for all classes,

\begin{align*}
P(C_{i}|X) \propto P(C_{i}) \cdot \prod_{j=1}^{N} P(x_{j}|C_{i})
\end{align*}

The symbol $\propto$ denotes proportionality, meaning we ignore the denominator $P(X)$ when comparing probabilities across classes.

## 2. Loading Data
Retrieved from [Kaggle - Simple Weather Forecast](https://www.kaggle.com/datasets/dheemanthbhat/simple-weather-forecast?select=weather_forecast.csv)

In [367]:
df = pd.read_csv('../_datasets/weather_forecast.csv')
df.head()

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


In [368]:
X = df.drop('Play', axis=1)
y = df['Play']

Bernoulli Naive Bayes requires features to have binary values, so we need to one-hot encode all categorical features.

In [369]:
X_binary = pd.get_dummies(X, drop_first=False)
X_binary.head()

Unnamed: 0,Outlook_Overcast,Outlook_Rain,Outlook_Sunny,Temperature_Cool,Temperature_Hot,Temperature_Mild,Humidity_High,Humidity_Normal,Windy_Strong,Windy_Weak
0,False,False,True,False,True,False,True,False,False,True
1,False,False,True,False,True,False,True,False,True,False
2,True,False,False,False,True,False,True,False,False,True
3,False,True,False,False,False,True,True,False,False,True
4,False,True,False,True,False,False,False,True,False,True


We will also convert `y` into a binary format.

In [370]:
y_binary = pd.Series(np.where(y == 'Yes', 1, 0), name='Play')
y_binary

0     0
1     0
2     1
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    1
11    1
12    1
13    0
Name: Play, dtype: int64

## 3. Prior Probability
Class $C_{i}$ (`y_binary`) has two boolean variables: `1`, and `0` (`1` = 'Yes', `0` = 'No' in `y`):

\begin{align*}
P(C_{i}=1) = \dfrac{\text{Count(1)}}{\text{Total Count}}
\end{align*}

\begin{align*}
P(C_{i}=\text{0}) = \dfrac{\text{Count(0)}}{\text{Total Count}}
\end{align*}

In [371]:
print(f'Total count: {len(df)}')
print(f'Counts: {y_binary.value_counts().to_dict()}')

Total count: 14
Counts: {1: 9, 0: 5}


\begin{align*}
P(1) = \dfrac{9}{14} = 0.6429
\end{align*}

\begin{align*}
P(0) = \dfrac{5}{14} = 0.3571
\end{align*}

In [372]:
def calculate_priors(y: pd.Series) -> Dict[int, float]:
    """
    Calculate prior probabilities for each class in the target variable.

    Args:
        y: Target variable containing class labels (strings).

    Returns:
        Prior probabilities for each class.
    """
    return y.value_counts(normalize=True).to_dict()

In [373]:
calculate_priors(y_binary)

{1: 0.6428571428571429, 0: 0.35714285714285715}

## 4. Likelihood for Bernoulli NB
For a feature vector $X = (x_{1}, x_{2},\ldots,x_{N})$ and class $C_{i}$, the likelihood is expressed as:

\begin{align*}
P(X|C_{i}) = P(x_{1}, x_{2}, \dots, x_{N}|C_{i}) = \prod_{j=1}^{N}P(x_{j}|C_{i})
\end{align*}

where each $P(x_{j}|C_{i})$ follows a **Bernoulli distribution**:

\begin{align*}

P(x_{j}|C_{i}) = 
  \begin{cases}
    p_{ij}     & \text{if $x_{j} = 1$} \\
    1 - p_{ij} & \text{if $x_{j} = 0$}
  \end{cases}
\end{align*}

Here, $p_{ij}$ is the probability that feature $j$ is $1$ in class $C_{i}$:

\begin{align*}
p_{ij} = \dfrac{\text{Count(${x_{j}}$ = 1|${C_{i}}$)} + \alpha}{\text{Count(${C_{i}}$)} + 2 \alpha}
\end{align*}

where $\alpha$ is the Laplace smoothing parameter to avoid zero probabilities (default $\alpha$ = 1).

In [374]:
def calculate_likelihoods(X: pd.DataFrame, y: pd.Series,
                          alpha: float = 1.0) -> Dict[int, Dict[int, Dict[int, float]]]:
    """
    Calculate conditional probabilities for Bernoulli Naive Bayes.

    Args:
        X: Binary feature matrix (DataFrame with 0/1 values)
        y: Target variable (Series of class labels)
        alpha: Smoothing parameter for Laplace smoothing (default=1.0)

    Returns:
        Nested dictionary with structure:
        {feature_name: {class_label: {feature_value: probability}}}
    """
    likelihoods = {}

    for feature in X.columns:
        likelihoods[feature] = {}

        for class_label in y.unique():
            c = int(class_label)
            class_mask = (y == c)
            class_subset = X.loc[class_mask, feature]
            total_in_class = class_mask.sum()  # Number of samples in class

            # Count occurrences of 1s (0s will be total - count_1)
            count_1 = class_subset.sum()
            count_0 = total_in_class - count_1

            # Apply Laplace smoothing for binary features
            # Denominator: total_in_class + 2 * alpha (for two possible values)
            prob_1 = (count_1 + alpha) / (total_in_class + 2 * alpha)
            prob_0 = (count_0 + alpha) / (total_in_class + 2 * alpha)

            # Store probabilities for both values
            likelihoods[feature][c] = {
                0: round(float(prob_0), 4),
                1: round(float(prob_1), 4)
            }

    return likelihoods

In [375]:
calculate_likelihoods(X_binary, y_binary)

{'Outlook_Overcast': {0: {0: 0.8571, 1: 0.1429}, 1: {0: 0.5455, 1: 0.4545}},
 'Outlook_Rain': {0: {0: 0.5714, 1: 0.4286}, 1: {0: 0.6364, 1: 0.3636}},
 'Outlook_Sunny': {0: {0: 0.4286, 1: 0.5714}, 1: {0: 0.7273, 1: 0.2727}},
 'Temperature_Cool': {0: {0: 0.7143, 1: 0.2857}, 1: {0: 0.6364, 1: 0.3636}},
 'Temperature_Hot': {0: {0: 0.5714, 1: 0.4286}, 1: {0: 0.7273, 1: 0.2727}},
 'Temperature_Mild': {0: {0: 0.5714, 1: 0.4286}, 1: {0: 0.5455, 1: 0.4545}},
 'Humidity_High': {0: {0: 0.2857, 1: 0.7143}, 1: {0: 0.6364, 1: 0.3636}},
 'Humidity_Normal': {0: {0: 0.7143, 1: 0.2857}, 1: {0: 0.3636, 1: 0.6364}},
 'Windy_Strong': {0: {0: 0.4286, 1: 0.5714}, 1: {0: 0.6364, 1: 0.3636}},
 'Windy_Weak': {0: {0: 0.5714, 1: 0.4286}, 1: {0: 0.3636, 1: 0.6364}}}

## 5. Posterior Probability for Bernoulli NB
As we discussed [above](#1-introduction), the formula of posterior probability is:


\begin{align*}
P(C_{i}|X) \propto P(C_{i}) \prod_{j=1}^{N} P(x_{j}|C_{i})
\end{align*}

To prevent underflow, we use log probabilities:

\begin{align*}
\text{log } P(C_{i}|X) = \text{log } P(C_{i}) + \sum_{j=1}^{N} \text{log } P(x_{j}|C_{i})
\end{align*}

In the following code, `.get(category, 1e-9)` tries to retrieve the probability for the specific value category from this dictionary. If the category was not seen in the training data for this class (i.e., it's missing from the dictionary), it returns a very small default value (1e-9) instead of raising an error.

In [376]:
def calculate_posterior(x: Dict[str, str], priors: Dict[str, float], likelihoods: Dict[int, Dict[int, Dict[int, float]]],
                        X_columns: List[str], classes: List[int]) -> Dict[str, float]:
    """
    Calculate log-posterior probabilities for all classes given a sample.

    Args:
        x: Input sample as dictionary {feature: value}.
        priors: Prior probabilities from calculate_priors().
        likelihoods: Conditional probabilities from calculate_likelihoods().
        X_columns: List of feature names.
        classes: List of possible class labels.

    Returns:
        Dictionary mapping each class to its log-posterior probability.
    """

    log_posteriors = {}
    for c in classes:
        log_proba = np.log(priors[c])  # Log of prior
        for feature in X_columns:  # Sum of the likelihood for each x given c
            y_value = x[feature]
            # Avoid log(0) if the feature does not exist
            proba = likelihoods[feature][c].get(y_value, 1e-9)
            log_proba += np.log(proba)
        log_posteriors[int(c)] = round(float(log_proba), 4)
    return log_posteriors  # log-posterior probabilities for all classes

In [377]:
calculate_posterior(X_binary.iloc[0], calculate_priors(
    y_binary), calculate_likelihoods(X_binary, y_binary), X_binary.columns, y_binary.unique())

{0: -6.4139, 1: -8.0838}

## 6. Prediction


In [378]:
def predict(X: pd.DataFrame, y: pd.Series) -> List[int]:
    """
    Predict class labels using Bernoulli Naive Bayes.

    Args:
        X: Feature matrix.
        y: Target variable.

    Returns:
        Predicted class labels.
    """
    priors = calculate_priors(y)
    likelihoods = calculate_likelihoods(X, y)
    classes = y.unique()
    X_columns = X.columns

    predictions = []
    for row in X.itertuples(index=False):
        posterior = calculate_posterior(
            row._asdict(), priors, likelihoods, X_columns, classes)
        predictions.append(max(posterior, key=posterior.get))
    return predictions

In [379]:
predict(X_binary, y_binary)[:10]

[0, 0, 1, 1, 1, 1, 1, 0, 1, 1]