# Naive Bayes

Naive Bayes for text classification problem:

Let $\vec{X} = [X_1, X_2, ..., X_d]$ be the frequency of tokens in a text t. Let $c = \{c_1, c_2, ..., c_k\}$ be the set of classes.

The goal is to compute the probability of $\vec{X}$ being in class $c_i$ as follows:

$p(c_i | \vec{X}) = \frac{p(c_i)p(\vec{X} | c_i)}{p(\vec{X})}$ (i.e bayes rule)

Since the $p(\vec{X})$ is the same for different classes, it is a constant, that is,

$\implies p(c_i | \vec{X}) \propto  p(c_i)p(\vec{X} | c_i) = p(c_i)p(X_1=x_1, X_2=x_2, ..., X_{d}=x_{d} | c_i)$

Here, we make 2 assumptions:
1. Given the class, the features(tokens) are independent, that is $p(X_i=x_i, X_j=x_j| c) = p(X_i=x_i | c)p(X_j=x_j | c)$
2. Positional Independence $P(X_1=x_1 |c) = P(X_2=x_1 | c)$

$\implies p(c_i | \vec{X}) \propto p(c_i)\prod_{1 \leq i \leq d}p(X_i = x_i |c_i)$

This is the naive bayes classification rule, if we are only interested in classification, we can classify a new sample point
$\vec{X^{new}}$ through $\text{argmax}_{c} p(c | \vec{X^{new}})$

Since multiply a lot of probabilities can result in floating point underflow and log function is a monotonic function the class
with highest probability does not change.

$\implies c = \text{argmax}_{c} log(p(c | \vec{X^{new}}))$

Assume now we observed N sample pairs $(\vec{X_j}, c_j)$, we can then use the MAP or MLE to estimate the parameters $\theta$ of the model:

$\theta_{MLE} = \text{argmax}_{\theta} log(\prod_{j}^{N} \prod_{1 \leq i \leq d}p(X_{ji} = x_{ji} |c_j))$

$\theta_{MAP} = \text{argmax}_{\theta} log(\prod_{j}^{N} p(c_j) \prod_{1 \leq i \leq d}p(X_{ji} = x_{ji} |c_j))$


Then, by assuming a multinomial distribution for $p(X_i=x_i | c)$

$\implies \theta_{ci} = \frac{N_{ci} + \alpha}{N_c + \alpha N_d}$

$\implies p(c) = \frac{NT_{c}}{NT}$, $NT_{c}$ is number of samples in class c, $NT$ is total number of samples in dataset


Where,
1. $N_{ci}$ is the number of times feature i appears in class c
2. $N_{c}$ is the total counts of all features in class c
3. $N_d$ is the feature size
4. $\alpha$ is used to prevent zero probability, when $\alpha = 1$ is called laplace smoothing

In [41]:
import pandas as pd
import numpy as np


class NaiveBayes:

    def __init__(self, alpha=1):

        self.alpha = alpha
        self.theta = None
        self.class_prior = None
        self.class_map = None

    def fit(self, X, y):

        if isinstance(X, pd.DataFrame):
            X = X.values

        if isinstance(y, pd.Series):
            y = y.values

        c_ = np.unique(y)
        n_c = c_.size
        NT, N_d = X.shape

        self.theta = np.zeros((n_c, N_d))
        self.class_prior = np.array(np.unique(y, return_counts=True), dtype=np.float64).T
        self.class_prior[:, 1] = self.class_prior[:, 1] / NT
        self.class_map = {}

        for index, c in enumerate(c_):

            N_c = X[y == c].sum()
            self.class_map[index] = c

            for i in range(N_d):

                N_ci = X[y == c, i].sum()
                self.theta[index, i] = np.log((N_ci + self.alpha) / (N_c + self.alpha * N_d))

        return self


    def predict(self, X):

        if isinstance(X, pd.DataFrame):
            X = X.values

        pred_results = np.zeros(X.shape[0])
        class_size = len(self.class_map)

        for index, i in enumerate(X):

            temp_lst = np.zeros(class_size)

            for c in range(class_size):

                temp_lst[c] = np.log(self.class_prior[:, 1][c])

                for d in range(X.shape[1]):

                    temp_lst[c] = temp_lst[c] + self.theta[c][d] * i[d]

            pred_results[index] = self.class_map[np.argmax(temp_lst)]

        return pred_results




In [42]:
a = np.array([[1, 0, 1], [0, 1, 1], [0, 1, 1]])
y = np.array([1, 2, 2])
clf = NaiveBayes().fit(a, y)

In [43]:
from sklearn.datasets import load_iris
from sklearn.naive_bayes import MultinomialNB
X = load_iris()['data']
y = load_iris()['target']

In [44]:
from sklearn.metrics import accuracy_score
clf = NaiveBayes().fit(X, y)
skl_clf = MultinomialNB().fit(X, y)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       2., 1., 2., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 1., 2., 1., 2., 1., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

In [50]:
accuracy_score(skl_clf.predict(X), y)

0.9533333333333334

In [51]:
accuracy_score(clf.predict(X), y)

0.9533333333333334

array([-1.09861229, -1.09861229, -1.09861229])