# Multinomial Naive Bayes Classifier

Multinomial naive bayes is the naive Bayes algorithm for multinomially distributed data. Everything is similar to Gaussian NB except the $P(x_i \mid y)$. The new equation is,
$$
P(x_i \mid y) = \frac{N_{yi} + \alpha}{N_y + \alpha n} \label{eq1}\tag{1}
$$
Here,
* $\alpha$ is the smoothing parameter,
* $N_{yi}$ is the count of feature $x_i$ in class y.
* $N_y$ is the total count of all features in class y
* $n$ is the total number of features

# Multinomial Naive Bayes

## Multinomial Data
|   $X_1$|$X_2$|$X_3$|
|---|---|---|
|1|0|4|
|4|2|3|

In the table above containing 2 sample of 3 features, we observe that feature $X_1$ has values 1 and 4, and so on. That is the common view of the data. And when other a general model accepts this data, it considers each number as value. For example, $X_{1,2}=3$. But in case of reading a multinomial data, $X_{1,2}$ says how many of feature $X_{2}$ is in sample 1. Meaning $X_{1,2}$ is not value of the feature, instead it is the count of the feature. Let's consider a text corpus. Each sentence is made up of different words $x_i$ and each of those $x_i$ belongs to the vocabulary, $V$. If $V$ contains 8 words, $x_1,x_2,...,x_8$ and if a sentence is: x1 x2 x2 x6 x3 x2 x8, the representation of that sentence will be-

|$x_1$|$x_2$|$x_3$|$x_4$|$x_5$|$x_6$|$x_7$|$x_8$|
|---|---|---|---|---|---|---|---|
| 1|3 |1 | 0| 0|1 | 0|1 |

After inserting some other random sentences, the dataset is-

|$x_1$|$x_2$|$x_3$|$x_4$|$x_5$|$x_6$|$x_7$|$x_8$|
|---|---|---|---|---|---|---|---|
| 1|3 |1 | 0| 0|1 | 0|1 |
| 1|0 |0 | 0| 1|1 | 1|3 |
| 0|0 |0 | 0| 0|2 | 1|2 |



* $N_{yi}$ is the count of feature $x_i$ in each unique class of y. For example, for $y=1$, \
$N_{y,1}=1, N_{y,6}=3$
* $N_y$ is the total count of all features in each unique class of y. For example, for $y=1$, \
$N_y=12$
* $n=8$ is the total number of features
* $\alpha$ is known as smoothing parameter. It is needed for zero probability problem

To calculate likelihoods for a test sentence, all we need is $P(x_i \mid y)$ which will be used to calculate $P(X \mid y)$ from training data. But $P(x_i \mid y)$ is the probability of feature $x_i$ appearing under class y once. If our test sentence has any feature $x_i$ n times, we will need to include $P(x_i \mid y)$ in $P(X \mid y)$ n times too. So, final equation for $P(X_i \mid y)$ will be-
$$
P(X_i \mid y) = P(x_1 \mid y) \times P(x_2 \mid y) \times ... \times P(x_n \mid y)
$$


# Class MultiNB

In [None]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [None]:
class MultiNB:
    def __init__(self,alpha=1):
        self.alpha = alpha

    def _prior(self):
        """
        Calculates prior for each unique class in y. P(y)
        """
        P = np.zeros((self.n_classes_))
        _, self.dist = np.unique(self.y,return_counts=True)
        for i in range(self.classes_.shape[0]):
            P[i] = self.dist[i] / self.n_samples
        return P

    def fit(self, X, y):
        """
        Calculates the following things-
            class_priors_ is list of priors for each y.
            N_yi: 2D array. Contains for each class in y, the number of time each feature i appears under y.
            N_y: 1D array. Contains for each class in y, the number of all features appear under y.

        params
        ------
        X: 2D array. shape(n_samples, n_features)
            Multinomial data
        y: 1D array. shape(n_samples,). Labels must be encoded to integers.
        """
        self.y = y
        self.n_samples, self.n_features = X.shape
        self.classes_ = np.unique(y)
        self.n_classes_ = self.classes_.shape[0]
        self.class_priors_ = self._prior()

        # distinct values in each features
        self.uniques = []
        for i in range(self.n_features):
            tmp = np.unique(X[:,i])
            self.uniques.append( tmp )

        self.N_yi = np.zeros((self.n_classes_, self.n_features)) # feature count
        self.N_y = np.zeros((self.n_classes_)) # total count
        for i in self.classes_: # x axis
            indices = np.argwhere(self.y==i).flatten()
            columnwise_sum = []
            for j in range(self.n_features): # y axis
                columnwise_sum.append(np.sum(X[indices,j]))

            self.N_yi[i] = columnwise_sum # 2d
            self.N_y[i] = np.sum(columnwise_sum) # 1d

    def _theta(self, x_i, i, h):
        """
        Calculates theta_yi. aka P(xi | y) using eqn(1) in the notebook.

        params
        ------
        x_i: int.
            feature x_i

        i: int.
            feature index.

        h: int or string.
            a class in y

        returns
        -------
        theta_yi: P(xi | y)
        """

        Nyi = self.N_yi[h,i]
        Ny  = self.N_y[h]

        numerator = Nyi + self.alpha
        denominator = Ny + (self.alpha * self.n_features)

        return  (numerator / denominator)**x_i

    def _likelihood(self, x, h):
        """
        Calculates P(E|H) = P(E1|H) * P(E2|H) .. * P(En|H).

        params
        ------
        x: array. shape(n_features,)
            a row of data.
        h: int.
            a class in y
        """
        tmp = []
        for i in range(x.shape[0]):
            tmp.append(self._theta(x[i], i,h))

        return np.prod(tmp)

    def predict(self, X):
        samples, features = X.shape
        self.predict_proba = np.zeros((samples,self.n_classes_))

        for i in range(X.shape[0]):
            joint_likelihood = np.zeros((self.n_classes_))

            for h in range(self.n_classes_):
                joint_likelihood[h]  = self.class_priors_[h] * self._likelihood(X[i],h) # P(y) P(X|y)

            denominator = np.sum(joint_likelihood)

            for h in range(self.n_classes_):
                numerator = joint_likelihood[h]
                self.predict_proba[i,h] = (numerator / denominator)

        indices = np.argmax(self.predict_proba,axis=1)
        return self.classes_[indices]

# Spam Classification

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df = pd.read_csv("/content/spam.csv",encoding='iso8859_14')
df.drop(labels=df.columns[2:],axis=1,inplace=True)
df.columns=['target','text']

In [None]:
df

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## Simple Preprocessing


In [None]:
def clean_util(text):
    punc_rmv = [char for char in text if char not in string.punctuation]
    punc_rmv = "".join(punc_rmv)
    stopword_rmv = [w.strip().lower() for w in punc_rmv.split() if w.strip().lower() not in stopwords.words('english')]

    return " ".join(stopword_rmv)

In [None]:
df['text'] = df['text'].apply(clean_util)

In [None]:
df

Unnamed: 0,target,text
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though
...,...,...
5567,spam,2nd time tried 2 contact u u å£750 pound prize...
5568,ham,ì b going esplanade fr home
5569,ham,pity mood soany suggestions
5570,ham,guy bitching acted like id interested buying s...


# Vectorizing
Conforming the texts to the multinomial format we have discussed in the beginning. Also, classes in y must be converted to integers.

In [None]:
cv = CountVectorizer()
X = cv.fit_transform(df['text']).toarray()
lb = LabelBinarizer()
y = lb.fit_transform(df['target']).ravel()
print(X.shape,y.shape)

(5572, 9381) (5572,)


In [None]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(4179, 9381) (1393, 9381) (4179,) (1393,)


In [None]:
me = MultiNB()
me.fit(X_train, y_train)
yhat = me.predict(X_test)
print(accuracy_score(y_test,yhat))

0.9784637473079684
