# Naive Bayes Classifier from Scratch
***
## Table of Contents
***

In [1]:
import numpy as np
import pandas as pd

## 1. Introduction
Naive Bayes classifiers are probabilistic classification models based on Bayes' Theorem, assuming conditional independence between features given the class labels or values. Naive Bayes is a general framework; the specific variant should be chosen based on the nature of your data:

- **Multinomial Naïve Bayes**: Assumes features follow multinomial distributions; ideal when features are **discrete** values.

- **Gaussian Naïve Bayes**: Assumes features follow a Gaussian (normal) distribution; used for **continuous** features. Fits the model by calculating the mean and standard deviation for each class.

- **Bernoulli Naïve Bayes**: Works with **binary** features (e.g., True/False, 0/1).


### Bayes' Theorem
Bayes' theorem describes the probability of a class $C$ given a set of features $X = (x_{1}, x_{2},\ldots,x_{n})$:

\begin{align*}
P(C|X) = \dfrac{P(X|C) \cdot P(C)}{P(X)}
\end{align*}

where:
- $P(C|X)$: Posterior probability of class $C$ given features $X$.
- $P(X|C)$: Likelihood of features $X$ given class $C$.
- $P(C)$: Prior probability of class $C$.
- $P(X)$: Probability of features $X$ (acts as a normalising constant).

Naive Bayes assumes features $X = (x_{1}, x_{2},\ldots,x_{n})$ are conditionally independent given the class $C$, thus the likelihood is expressed as:
\begin{align*}
P(X|C) = P(x_{1}, x_{2}, \dots, x_{n}|C) = \prod_{i=1}^{n}P(x_{i}|C)
\end{align*}

Replacing $P(X|C)$ in Bayes' theorem, the equation becomes:

\begin{align*}
P(C|X) = \dfrac{P(C) \cdot \prod_{i=1}^{n} P(x_{i}|C)}{P(X)}
\end{align*}

Since $P(X)$ is constant for all classes,

\begin{align*}
P(C|X) \propto P(C) \cdot \prod_{i=1}^{n} P(x_{i}|C)
\end{align*}

The symbol $\propto$ denotes proportionality, meaning we ignore the denominator $P(X)$ when comparing probabilities across classes.

## 2. Loading Data

In [2]:
df = pd.read_csv('../_datasets/weather_forecast.csv')
df.head()

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


In [3]:
X = df.drop('Play', axis=1)
y = df['Play']

## 3. Prior Probability
Class $C$ (`y`) has only two discrete variables: `Yes` and `No`:

\begin{align*}
P(C=\text{'Yes'}) = \dfrac{\text{Count(Yes)}}{\text{Total Count}}
\end{align*}

\begin{align*}
P(C=\text{'No'}) = \dfrac{\text{Count(No)}}{\text{Total Count}}
\end{align*}


In [4]:
print(f'Total count: {len(df)}')
print(f'Counts": {y.value_counts().to_dict()}')

Total count: 14
Counts": {'Yes': 9, 'No': 5}


\begin{align*}
P(\text{'Yes'}) = \dfrac{9}{14} = 0.6429
\end{align*}

\begin{align*}
P(\text{'No'}) = \dfrac{5}{14} = 0.3571
\end{align*}

In [5]:
def calculate_priors(y):
    return y.value_counts(normalize=True).to_dict()

In [6]:
calculate_priors(y)

{'Yes': 0.6428571428571429, 'No': 0.35714285714285715}

## 4. Likelihood

The likelihood quantifies how well parameter $\theta$ explain the observed data. It is defined as:

\begin{align*}
\mathcal{L}(\theta|x) = f(x|\theta)
\end{align*}

where $f$ is the probability density/mass function.

For each feature value and class, we calculate:

\begin{align*}
P(\text{Feature = value|Class})
\end{align*}

For example:

\begin{align*}
P(\text{Outlook = 'Sunny'|Play = 'Yes'}) = \dfrac{\text{Count(Outlook = 'Sunny'|Play = 'Yes')} + \alpha}{\text{Count(Play = 'Yes)} + n \cdot \alpha}
\end{align*}

where $n$ is the number of features and $\alpha$ is the smoothing parameter to handle zero probabilities (**Laplace Smoothing**).

In [7]:
def calculate_likelihoods(X, y, alpha=1.0):
    likelihoods = {}
    for feature in X.columns:  # For each column of X
        likelihoods[feature] = {}
        # Unique feature values in each column
        unique_features = X[feature].unique()

        for c in y.unique():  # Unique target values of y
            class_subset = X[y == c]
            total = len(class_subset)  # Count(C)
            
            # Count frequencies (e.g., {'Sunny':3, 'Rain':2} for class 'No')
            value_counts = class_subset[feature].value_counts()

            # All features values are included, even if missing in subset
            value_counts = value_counts.reindex(unique_features, fill_value=0)
            probas = (value_counts + alpha) / (total + len(value_counts) * alpha)

            likelihoods[feature][c] = probas.to_dict()
    return likelihoods


In [8]:
calculate_likelihoods(X, y)

{'Outlook': {'No': {'Sunny': 0.5, 'Overcast': 0.125, 'Rain': 0.375},
  'Yes': {'Sunny': 0.25,
   'Overcast': 0.4166666666666667,
   'Rain': 0.3333333333333333}},
 'Temperature': {'No': {'Hot': 0.375, 'Mild': 0.375, 'Cool': 0.25},
  'Yes': {'Hot': 0.25,
   'Mild': 0.4166666666666667,
   'Cool': 0.3333333333333333}},
 'Humidity': {'No': {'High': 0.7142857142857143, 'Normal': 0.2857142857142857},
  'Yes': {'High': 0.36363636363636365, 'Normal': 0.6363636363636364}},
 'Windy': {'No': {'Weak': 0.42857142857142855, 'Strong': 0.5714285714285714},
  'Yes': {'Weak': 0.6363636363636364, 'Strong': 0.36363636363636365}}}

## 5. Posterior Probability
As we discussed [above](#1-introduction), the formula of posterior probability is:


\begin{align*}
P(C|X) \propto P(C) \prod_{i=1}^{n} P(x_{i}|C)
\end{align*}

To prevent underflow, we use log probabilities:

\begin{align*}
\text{log } P(C|X) = \text{log } P(C) + \sum_{i=1}^{n} \text{log } P(x_{i}|C)
\end{align*}

In the following code, `.get(category, 1e-9)` tries to retrieve the probability for the specific value category from this dictionary. If the category was not seen in the training data for this class (i.e., it's missing from the dictionary), it returns a very small default value (1e-9) instead of raising an error.

In [9]:
def calculate_posterior(x, priors, likelihoods, X_columns, classes):
    log_posteriors = {}
    for c in classes:
        log_proba = np.log(priors[c]) # Log of prior
        for feature in X_columns: # Sum of the likelihood for each x given c
            category = x[feature]
            proba = likelihoods[feature][c].get(category, 1e-9) # Avoid log(0) if the feature does not exist
            log_proba += np.log(proba)
        log_posteriors[c] = log_proba
    return log_posteriors # log-posterior probabilities for all classes

In [10]:
calculate_posterior(X.iloc[0], calculate_priors(y), calculate_likelihoods(X, y), X.columns, y.unique())

{'No': np.float64(-3.8873659477612463), 'Yes': np.float64(-4.6780075099403575)}