# Naive Bayes Algorithm

Given, some features $X=(x_1,x_2,…,x_n)$ and class $Y \in \Set{C_1,C_2,…,C_k}$, we want to compute $P(Y=C_k|X)$ and choose the class with the highest probability. 

- Probability Classifier
- Generative : like GMM

Everything starts from Bayes’ theorem; 

$$
\Large \begin{align*}P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)} \end{align*}
$$

- $P(Y)$ → Prior ( how common the class is)
- $P(X|Y)$ → Likelihood ( how likely the features are given the class.
- $P(X)$ → Evidence ( normalization constant). This is same for every class because it is compute by summing over all classes.

$$
\begin{align*}P(Y|X) \propto P(X|Y) P(Y) \end{align*}
$$

$$
\begin{align*} \hat{Y} = \text{ArgMin}_Y P(X|Y) P(Y)\end{align*}
$$

Computing the likelihood grows exponentially with number of features; $P(X|Y) = P(x_1,x_2,…,x_n|Y)$.

Here we are assuming that all features are conditionally independent given the class; Mathematically; 

$$
\begin{align*}P(x_1,x_2,...,x_n|Y) = \prod_{i=1}^{n} P(x_i|Y)\end{align*}
$$

This assumption is almost always false in real data. Words in a sentence are correlated. Pixels in an image are correlated. Medical symptoms are correlated. We call that this assumption is Naive because it makes an unrealistically strong independence assumption. 

$$
\begin{align*}P(Y|x_1,x_2,...,x_n)\propto P(Y) \prod_{i=1}^{n} P(x_i|Y)\end{align*}
$$

$$
\hat{Y}=\arg\max_{Y}\left[\log P(Y)+\sum_{i=1}^{n} \log P(x_i \mid Y)\right]
$$

How we estimate the probabilities?

Priors can be estimated using the following formula; 

$$
\begin{align*}P(Y=C_k)=\frac{\text{Number of Samples in the Class } C_k}{\text{Total Number of Samples}}\end{align*}
$$

# Gaussian Naive Bayes ( continuous features)

$$
\begin{align*}
\Large x_i \mid Y = \Large C_k  &\Large\sim \mathcal{N}\!\left(\mu_{ik}, \sigma_{ik}^2\right)\\
\Large P(x_i \mid Y = C_k)&=\Large \frac{1}{\sqrt{2\pi\sigma_{ik}^2}}\exp\!\left(-\frac{(x_i - \mu_{ik})^2}{2\sigma_{ik}^2}\right)\end{align*}
$$

In Gaussian Naive Bayes, we assume a class-conditional Gaussian distribution, write down the likelihood of the observed data, take its log, and maximise it with respect to the parameters, which yields the same mean and MLE sample variance of each feature within each class. 

In [None]:
import numpy as np

# Toy dataset: features are continuous
X_train = np.array([
    [5.1, 3.5],
    [4.9, 3.0],
    [6.2, 3.4],
    [5.9, 3.0],
    [7.0, 3.2],
    [6.4, 3.2]
])

y_train = np.array([0, 0, 1, 1, 2, 2])  # 3 classes

# Query point
x_query = np.array([6.0, 3.0])

# -------------------------------
# Step 1: Compute class priors
# -------------------------------
classes = np.unique(y_train)
priors = {c: np.mean(y_train == c) for c in classes}
priors

In [None]:
# -------------------------------
# Step 2: Compute class-wise mean & variance (MLE)
# -------------------------------
mean_var = {}
for c in classes:
    X_c = X_train[y_train == c]
    mean_var[c] = (X_c.mean(axis=0), X_c.var(axis=0))

mean_var

In [None]:
# -------------------------------
# Step 3: Gaussian likelihood
# -------------------------------
def gaussian_likelihood(x, mean, var):
    eps = 1e-8  # numerical stability
    coeff = 1 / np.sqrt(2 * np.pi * var + eps)
    exponent = np.exp(- (x - mean) ** 2 / (2 * var + eps))
    return coeff * exponent

# Compute posterior for each class
posteriors = {}
for c in classes:
    mean, var = mean_var[c]
    likelihood = np.prod(gaussian_likelihood(x_query, mean, var))
    posteriors[c] = priors[c] * likelihood

# Predicted class
pred_class = max(posteriors, key=posteriors.get)
pred_class, posteriors

# Bernoulli Naive Bayes ( Binary features)

$$
\Large \begin{align*}P(x_i \mid Y)&=p_i^{x_i}\,(1 - p_i)^{1 - x_i}\end{align*}
$$

Naive Bayes naturally supports mixed feature types - you just use a different likelihood model for each feature, and multiply them all together.  - independent likelihoods, each with its own distribution.

Use Bernoulli Naive Bayes when features represent binary presence/absence, and repeated occurrences carry no additional meaning.  

In [18]:
data = np.array([
    [1, 1, 1],
    [0, 1, 0],
    [1, 1, 0],
    [0, 0, 1]
])

In [21]:
feature_probability = data.sum(axis=0)+1 / data.shape[0]# feature wise sum  
feature_probability

array([2.25, 3.25, 2.25])

In [22]:
x_query_bin = np.array([1, 0, 1])
x_query_bin

array([1, 0, 1])

In [23]:
feature_probability*x_query_bin

array([2.25, 0.  , 2.25])

In [15]:
import numpy as np
# Toy binary dataset
X_train_bin = np.array([
    [1, 0, 1],
    [0, 1, 0],
    [1, 1, 0],
    [0, 0, 1]
])
y_train_bin = np.array([0, 1, 1, 0])

# Query point
x_query_bin = np.array([1, 0, 1])

# Step 1: Priors
classes = np.unique(y_train_bin)
priors = {c: np.mean(y_train_bin == c) for c in classes}

# Step 2: Likelihood (Bernoulli)
likelihoods = {}
for c in classes:
    X_c = X_train_bin[y_train_bin == c]
    # Feature-wise probability with Laplace smoothing
    feature_prob = (X_c.sum(axis=0) + 1) / (X_c.shape[0] + 2)
    # Bernoulli likelihood
    likelihood = np.prod(feature_prob ** x_query_bin * (1 - feature_prob) ** (1 - x_query_bin))
    likelihoods[c] = priors[c] * likelihood

# Prediction
pred_class = max(likelihoods, key=likelihoods.get)
pred_class, likelihoods


(np.int64(0),
 {np.int64(0): np.float64(0.140625), np.int64(1): np.float64(0.015625)})

# Multinomial Naive Bayes ( text, counts)

$$
\begin{align*}
\Large P(x_i \mid Y)&=\Large \frac{\text{count of word } i \text{ in class } Y + \alpha}{\text{total words in class } Y + \alpha V}\end{align*}
$$

To understand more about the multinomial naive bayes, consider the following table ( pet lovers); 

|  | Cat | Dog | Rabbit | Hamster | Fish | Total |
| --- | --- | --- | --- | --- | --- | --- |
| Class 1  | 18 | 20 | 6 | 4 | 2 | 50 |
| Class 2 | 15 | 15 | 10 | 5 | 5 | 50 |
| Class 3 | 17 | 18 | 8 | 4 | 3 | 50 |

The above table is a hypothetical dataset about how many students prefer a particular animal as a pet. Each row can be viewed as a random vector from a multinomial distribution. For instance, the first row $(18,20,6,4,2)$ can be viewed as a random draw from a multinomial distribution

 $\Large M_5(n=50;p_1,p_2,…,p_5)$

The second and third row can be viewed as other random draws from the same distribution. 

The PMF of a multinomial distribution has a simple form. 

$$
\Large\begin{align*}x = (x_1,x_2,...,x_V)\end{align*}
$$

$$
\Large \begin{align*}\sum_i x_i = N\end{align*}
$$

$$
\Large p(X=x) = p(x|Y=C_k) = \frac{N!}{\prod_{i}^{}{x_i!}}\prod_{i}^{} \theta_{ik}^{x_i}
$$

Here, $p_i$ is the probability of that category. 

$N$ = Total words in the document

$x_i$ = Count of word $i$


The combinatorial term $\frac{N!}{\prod_i x_i !}$ counts, how many distinct sequences of length N produce the same bag of words counts x. It depends only on the observed document. It does not depend on the class. 

From training data, we estimate class-specific word probabilities. The term is identical no matter which class you test against. 

Even if we drop the combinatorial term, it does not change the ranking of classes. 

In multinomial Naive Bayes, the combinatorial term reflects the number of ways the observed word counts could be arranged, which depends only on the document and not on the class, so it cancels out when computing the argmax over classes. 

$$
\Large \begin{align*}\hat{y} = \text{argmax}_k (\prod_i \theta_{ik}^{x_i} P(C_k))\end{align*}
$$

- log space scoring
- Multinomial NB is a linear classifier in count space
- Laplace smoothing adds a pseudocount $\alpha$ to each feature count, ensuring $P(x_i|y)>0$ and preventing the likelihood from collapsing to zero.

In [24]:
# Toy dataset: counts of 5 animals in 3 classes
X_train_counts = np.array([
    [18, 20, 6, 4, 2],   # Class 0
    [15, 15, 10, 5, 5],  # Class 1
    [17, 18, 8, 4, 3]    # Class 2
])
y_train_counts = np.array([0, 1, 2])

# Query document: counts of each animal
x_query_counts = np.array([1, 2, 0, 0, 0])

# Step 1: Priors
classes = np.unique(y_train_counts)
priors = {c: np.mean(y_train_counts == c) for c in classes}

# Step 2: Class-specific word probabilities with Laplace smoothing
alpha = 1  # pseudocount
likelihoods = {}
for c in classes:
    X_c = X_train_counts[y_train_counts == c]
    # Total word counts in class
    total_count = X_c.sum()
    # Probability for each word
    probs = (X_c.sum(axis=0) + alpha) / (total_count + alpha * X_c.shape[1])
    # Multinomial likelihood (ignore combinatorial term)
    likelihood = np.prod(probs ** x_query_counts)
    likelihoods[c] = priors[c] * likelihood

# Prediction
pred_class = max(likelihoods, key=likelihoods.get)
pred_class, likelihoods


(np.int64(0),
 {np.int64(0): np.float64(0.016787377911344853),
  np.int64(1): np.float64(0.008206361131980965),
  np.int64(2): np.float64(0.01301878287002254)})

# Mixed DataType

In [7]:
import numpy as np

np.random.seed(42)

# -------------------------------
# Toy dataset
# -------------------------------
# 2 continuous features, 3 binary features, 2 count features
X_train = np.array([
    [5.1, 3.5, 1, 0, 1, 10, 2],
    [4.9, 3.0, 0, 1, 0, 5, 1],
    [6.2, 3.4, 1, 1, 0, 8, 3],
    [5.9, 3.0, 0, 0, 1, 7, 2],
    [7.0, 3.2, 1, 0, 0, 12, 4],
    [6.4, 3.2, 0, 1, 1, 9, 3]
])

# Binary labels (can be multi-class)
y_train = np.array([0, 0, 1, 1, 1, 1])

# Query point
x_query = np.array([6.0, 3.1, 1, 0, 0, 6, 2])

# Feature type indices
cont_idx = [0, 1]
bin_idx = [2, 3, 4]
count_idx = [5, 6]

# -------------------------------
# Step 1: Compute class priors
# -------------------------------
classes = np.unique(y_train)
priors = {c: np.mean(y_train == c) for c in classes}

# -------------------------------
# Step 2: Compute likelihoods
# -------------------------------

def gaussian_likelihood(x, mean, var):
    eps = 1e-8
    coeff = 1 / np.sqrt(2 * np.pi * var + eps)
    exponent = np.exp(- (x - mean) ** 2 / (2 * var + eps))
    return coeff * exponent

alpha = 1  # Laplace smoothing for Bernoulli / Multinomial

posteriors = {}
for c in classes:
    X_c = X_train[y_train == c]
    
    # Continuous features (Gaussian)
    mean_c = X_c[:, cont_idx].mean(axis=0)
    var_c = X_c[:, cont_idx].var(axis=0)
    cont_likelihood = np.prod(gaussian_likelihood(x_query[cont_idx], mean_c, var_c))
    
    # Binary features (Bernoulli)
    X_bin_c = X_c[:, bin_idx]
    feature_prob = (X_bin_c.sum(axis=0) + alpha) / (X_bin_c.shape[0] + 2)
    x_bin = x_query[bin_idx]
    bern_likelihood = np.prod(feature_prob ** x_bin * (1 - feature_prob) ** (1 - x_bin))
    
    # Count features (Multinomial)
    X_count_c = X_c[:, count_idx]
    total_count = X_count_c.sum()
    probs = (X_count_c.sum(axis=0) + alpha) / (total_count + alpha * len(count_idx))
    count_likelihood = np.prod(probs ** x_query[count_idx])
    
    # Posterior = prior * all likelihoods
    posteriors[c] = priors[c] * cont_likelihood * bern_likelihood * count_likelihood

# -------------------------------
# Step 3: Prediction
# -------------------------------
pred_class = max(posteriors, key=posteriors.get)

pred_class, posteriors


(np.int64(1),
 {np.int64(0): np.float64(4.481078130176555e-25),
  np.int64(1): np.float64(0.0013051366304736981)})

In [25]:
# note : related topic : later reference 
# kernel density estimation 

# custom implementation: 

In [52]:
import numpy as np

class NaiveBayes:
    def __init__(self, X, y, smoothing, feature_types):
        self.X = np.array(X)
        self.y = np.array(y)
        self.laplace = smoothing
        self.feature_types = feature_types
        self.classes = np.unique(y)
        self.class_priors = {}
        self.parameters = {}
        self._fit()

    def _fit(self):
        n_samples = len(self.y)
        # Compute class priors
        self.class_priors = {c: np.sum(self.y == c)/n_samples for c in self.classes}
        
        self.parameters = {c: {} for c in self.classes}

        for c in self.classes:
            X_c = self.X[self.y == c] # taking the respective columns 
            for idx in range(self.X.shape[1]): # each column/feature
                ftype = self.feature_types[idx]
                values = X_c[:, idx]

                if ftype == 'continuous':
                    mean = np.mean(values.astype(float))
                    std = np.std(values.astype(float)) + 1e-6  # avoid zero std
                    self.parameters[c][idx] = {'type':'continuous', 'mean':mean, 'std':std}

                elif ftype == 'level':
                    unique_vals, counts = np.unique(values, return_counts=True)
                    total = np.sum(counts)
                    probs = {v: (counts[i] + self.laplace)/(total + self.laplace*len(unique_vals))
                             for i, v in enumerate(unique_vals)}
                    self.parameters[c][idx] = {'type':'level', 'probs':probs, 'levels':unique_vals}

                elif ftype == 'count':
                    # Treat as multinomial/count: probability proportional to value
                    total = np.sum(values.astype(float))
                    self.parameters[c][idx] = {'type':'count', 'total':total, 'values':values.astype(float)}
            

    def _gaussian_likelihood(self, x, mean, std):
        exponent = -0.5 * ((x - mean)/std)**2
        return (1 / (np.sqrt(2*np.pi) * std)) * np.exp(exponent)

    def predict(self, X_test):
        X_test = np.array(X_test)
        y_pred = []
        for sample in X_test:
            log_probs = {}
            for c in self.classes:
                log_prob = np.log(self.class_priors[c])
                for idx, x_i in enumerate(sample):
                    param = self.parameters[c][idx]
                    ftype = param['type']

                    if ftype == 'continuous':
                        mean = param['mean']
                        std = param['std']
                        likelihood = self._gaussian_likelihood(float(x_i), mean, std)
                        log_prob += np.log(likelihood + 1e-9)  # avoid log(0)

                    elif ftype == 'level':
                        probs = param['probs']
                        log_prob += np.log(probs.get(x_i, self.laplace*1e-3))  # unseen level handling

                    elif ftype == 'count':
                        total = param['total']
                        prob = (float(x_i) + self.laplace) / (total + self.laplace*len(param['values']))
                        log_prob += np.log(prob + 1e-9)

                log_probs[c] = log_prob
            y_pred.append(max(log_probs, key=log_probs.get))
        return np.array(y_pred)


In [53]:
x = np.array([
    [23.1, 74, 'yes', 'small', 1],
    [18.5, 65, 'no', 'medium', 0],
    [30.2, 80, 'yes', 'large', 1],
    [25.0, 70, 'no', 'small', 0],
    [20.1, 68, 'yes', 'medium', 1],
    [22.3, 72, 'no', 'large', 0],
    [19.8, 66, 'yes', 'small', 1],
    [24.5, 75, 'no', 'medium', 0],
    [21.0, 69, 'yes', 'large', 1],
    [23.9, 73, 'no', 'medium', 0],
    [26.1, 78, 'yes', 'small', 1],
    [18.9, 64, 'no', 'large', 0],
    [29.5, 79, 'yes', 'medium', 1],
    [22.0, 71, 'no', 'small', 0],
    [20.5, 67, 'yes', 'medium', 1]
])
y = np.array([0, 2, 0, 1, 0, 1, 0, 1, 2, 1, 1, 2, 0, 1, 1])
feature_types = ['continuous','continuous','level','level','level']


In [54]:
nb = NaiveBayes(x,y,smoothing=1,feature_types=feature_types)

In [55]:
nb._fit()

In [56]:
nb.parameters

{np.int64(0): {0: {'type': 'continuous',
   'mean': np.float64(24.54),
   'std': np.float64(4.492038399666213)},
  1: {'type': 'continuous',
   'mean': np.float64(73.4),
   'std': np.float64(5.642695391866353)},
  2: {'type': 'level',
   'probs': {np.str_('yes'): np.float64(1.0)},
   'levels': array(['yes'], dtype='<U32')},
  3: {'type': 'level',
   'probs': {np.str_('large'): np.float64(0.25),
    np.str_('medium'): np.float64(0.375),
    np.str_('small'): np.float64(0.375)},
   'levels': array(['large', 'medium', 'small'], dtype='<U32')},
  4: {'type': 'level',
   'probs': {np.str_('1'): np.float64(1.0)},
   'levels': array(['1'], dtype='<U32')}},
 np.int64(1): {0: {'type': 'continuous',
   'mean': np.float64(23.471428571428568),
   'std': np.float64(1.8069038637930912)},
  1: {'type': 'continuous',
   'mean': np.float64(72.28571428571429),
   'std': np.float64(3.282608226593159)},
  2: {'type': 'level',
   'probs': {np.str_('no'): np.float64(0.6666666666666666),
    np.str_('yes'): 