# Naive Bayes

Naive Bayes for text classification problem:

Let $X_1, X_2, ..., X_d$ be the tokens in a text t. Let $c = \{c_1, c_2, ..., c_k\}$ be the set of classes.

The goal is to compute the probability of text t being in class c as follows:

$p(c | t) = \frac{p(c)p(t | c)}{p(t)}$ (i.e bayes rule)

Since the $p(t)$ is the same for different classes, it is a constant, that is,

$\implies p(c | t) \propto  p(c)p(t | c) = p(c)p(X_1=x_1, X_2=x_2, ..., X_{t_d}=x_{t_d} | c)$

Here, we make 2 assumptions:
1. Given the class, the features(tokens) are independent, that is $p(X_i=x_i, X_j=x_j| c) = p(X_i=x_i | c)p(X_j=x_j | c)$
2. Positional Independence $P(X_1=x_1 |c) = P(X_2=x_1 | c)$

$\implies p(c | t) \propto p(c)\prod_{1 \leq i \leq t_d}p(X_i = x_i |c) $
$\implies \text{posteriori} \propto \text{prior} \times \text{likelihood}$

We can apply MAP to find the class c.

Since multiply a lot of probabilities can result in floating point underflow and log function is a monotonic function the class
with highest probability does not change.

That is we apply log functions to the posterior:

$\implies c_{map} = \text{argmax}_{c} [\log (p(c)) + \sum_{i=1}^{t_d}log(p(X_i=x_i | c))]$

Where,
1. $t_d$ is the number of tokens in text t
2. $p(X_i=x_i | c)$ follows a multinomial distribution
3. $p(c) = \frac{NT_{c}}{NT}$, $NT_{c}$ is number of samples in class c, $NT$ is total number of samples in dataset

$\implies \theta_{ci} = \hat{p}(X_i=x_i|c) = \frac{N_{ci} + \alpha}{N_c + \alpha N_d}$

Where,
1. $N_{ci}$ is the number of times feature i appears in class c
2. $N_{c}$ is the total counts of all features in class c
3. $N_d$ is the feature size
4. $\alpha$ is used to prevent zero probability, when $\alpha = 1$ is called laplace smoothing

In [41]:
import pandas as pd
import numpy as np


class NaiveBayes:

    def __init__(self, alpha=1):

        self.alpha = alpha
        self.theta = None
        self.class_prior = None
        self.class_map = None

    def fit(self, X, y):

        if isinstance(X, pd.DataFrame):
            X = X.values

        if isinstance(y, pd.Series):
            y = y.values

        c_ = np.unique(y)
        n_c = c_.size
        NT, N_d = X.shape

        self.theta = np.zeros((n_c, N_d))
        self.class_prior = np.array(np.unique(y, return_counts=True), dtype=np.float64).T
        self.class_prior[:, 1] = self.class_prior[:, 1] / NT
        self.class_map = {}

        for index, c in enumerate(c_):

            N_c = X[y == c].sum()
            self.class_map[index] = c

            for i in range(N_d):

                N_ci = X[y == c, i].sum()
                self.theta[index, i] = np.log((N_ci + self.alpha) / (N_c + self.alpha * N_d))

        return self


    def predict(self, X):

        if isinstance(X, pd.DataFrame):
            X = X.values

        pred_results = np.zeros(X.shape[0])
        class_size = len(self.class_map)

        for index, i in enumerate(X):

            temp_lst = np.zeros(class_size)

            for c in range(class_size):

                temp_lst[c] = np.log(self.class_prior[:, 1][c])

                for d in range(X.shape[1]):

                    temp_lst[c] = temp_lst[c] + self.theta[c][d] * i[d]

            pred_results[index] = self.class_map[np.argmax(temp_lst)]

        return pred_results




In [42]:
a = np.array([[1, 0, 1], [0, 1, 1], [0, 1, 1]])
y = np.array([1, 2, 2])
clf = NaiveBayes().fit(a, y)

In [43]:
from sklearn.datasets import load_iris
from sklearn.naive_bayes import MultinomialNB
X = load_iris()['data']
y = load_iris()['target']

In [44]:
from sklearn.metrics import accuracy_score
clf = NaiveBayes().fit(X, y)
skl_clf = MultinomialNB().fit(X, y)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       2., 1., 2., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 1., 2., 1., 2., 1., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

In [50]:
accuracy_score(skl_clf.predict(X), y)

0.9533333333333334

In [51]:
accuracy_score(clf.predict(X), y)

0.9533333333333334

array([-1.09861229, -1.09861229, -1.09861229])