# Naive Bayes Classifier
Naive Bayes is a family of simple, supervised learning algorithms that are based on Bayes' Theorem:

$$
P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}
$$

With multiple features, we generalize this to

$$
P(Y|x_1, x_2, ..., x_n) = \frac{P(x_1, x_2, ..., x_n| Y)P(Y)}{P(x_1, x_2, ..., x_n)}
$$

Naive Bayes assumes that the features are indepenent of each other, i.e.

$$
P(x_i|Y, x_1, ..., x_{i-1}, x_{i+1}, ..., x_n) = P(x_i|Y)
$$

Additionally, $P(x_1, x_2, ..., x_n)$ is a constant. We can thus simplify the calculation to
$$
P(Y|x_1, x_2, ..., x_n) = P(Y)\prod_{i=1}^nP(x_i,|Y)
$$

We can predict class labels like so:
$$
\hat{y} = \underset{y}{\arg\max } \space P(y)\prod_{i=1}^nP(x_i|y)
$$

To explain this in words, I'll use the classic example of classifying emails as either spam or ham. Our feature space is the entire vocabulary of words that we've seen among the emails in our training dataset, and the feature vector for a particular observation is the vector of counts for each word. Let's say a spam email frequently contains the word "opportunity", while ham emails frequently contain the word "report". If a particular email contains the word "opportunity", then it's likely spam; if it contains "report", it's likely ham. Additionally, if the email contains "opportunity" more than once, then (in the Multinomial case, at least) the email is even more likely to be spam.

For text data, the prior probilities can be calculated with simple word counts:
$$
P(X|Y) = \frac{ P(X,Y) } { P(Y) } \approx \frac{ \textrm{Count}(X,Y) } { \textrm{Count}(Y) }
$$

In this notebook, I've implemented a Multinomial Naive Bayes classifier inspired by [sklearn's MultinomialNB model](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html). The model is implemented using numpy. I've also implemented some basic text data cleaning with pandas and feature engineering with sklearn. I wrote the code specifically for text data. I'm not sure it will work with other types of data.

Like the sklearn model, I implement fit and predict methods. However, I wrote them in a quirky way because I tried to vectorize as much as possible with NumPy. My fit method works like this:
    1. Take in X_train (NxW matrix) and y_train (Nx1 vector)
    2. Determine the classes by finding the unique values in y_train
    3. Iterate over each class:
        a. Create a new vector of 0s and 1s indicating whether an element of y_train equals the target class
        b. Find the wordcounts for each observation and word in this class by multiplying the 0s/1s vector by X_train
            * X_train is already a matrix of word counts derived using CountVectorizer
        c. Add in the Laplace smoothing parameter
        d. Find the vector log probabilities for each word by dividings the word counts for each word by the total number of
           words for that class
        e. Append that vector to a list of vectors
    4. Vertically stack the list of log probability vectors, resulting in a WxC matrix (W = # words, C = # classes)

My predict method is straightforward:
    1. Multiply the X_test matrix by the transpose of the matrix of log probabilities
    2. Sum the log probabilities for each class in each observation
    3. The predicted class for each observation is the class with the highest log probability
    4. Return the vector of predicted classes
           

In [16]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix

Read the train and test data into DataFrames. Clean the Tweets - remove punctuation, lowercase the strings, and remove words that are less than 3 characters long.

In [17]:
def clean_df(df):
    df['Tweet_cleaned'] = df['Tweet'].str.replace('[.:?!;,]', ' ').str.lower()  # remove punctuation; lowercase
    df['Tweet_cleaned'] = df['Tweet_cleaned'].str.replace('(?:^| )\w{0,3} ', ' ')  # remove words with 3 characters or less
    df = df[['Class', 'Tweet_cleaned']]
    return df

In [18]:
train_df = pd.read_csv('traintweets.csv', sep='\t')
train_df = clean_df(train_df)
train_df.head()

Unnamed: 0,Class,Tweet_cleaned
0,OTHER,¿en donde esta remontada mandrill
1,OTHER,@katie_phd alternate 'reproachful mandrill' c...
2,OTHER,"@theophani i ""drill"" there it would a picture..."
3,OTHER,“@chrisjboyland baby mandrill paignton 29th ap...
4,OTHER,“@missmya #nameanamazingband mandrill ” mint c...


In [19]:
test_df = pd.read_csv('testtweets.csv', sep='\t')
test_df = clean_df(test_df)
test_df.head()

Unnamed: 0,Class,Tweet_cleaned
0,APP,just love @mandrillapp transactional email ser...
1,APP,@rossdeane mind submitting request http //help...
2,APP,@veroapp chance you'll adding mandrill support...
3,APP,@elie__ @camj59 jparle relai smtp million mail...
4,APP,would like send emails welcome password resets...


Transform the training and test data into labels and feature vectors.

In [20]:
count_vec = CountVectorizer()
le = LabelEncoder()

X_train = count_vec.fit_transform(train_df['Tweet_cleaned']).todense()
y_train = le.fit_transform(train_df['Class'])

X_test = count_vec.transform(test_df['Tweet_cleaned']).todense()
y_test = le.transform(test_df['Class'])

Classifier code

In [23]:
class MultinomialNaiveBayes:
    
    def __init__(self, k=1):
        self.k = k  # Laplace smoothing parameter
        
        
    def fit(self, X_train, y_train):
        # Find the number of observations for each label
        classes = np.unique(y_train)
        
        # Form the NxC matrix of priors (N = # obs, C = # classes)
        pxy_vectors = []
        for cls in classes:            
                        
            # Find the observations where the observation's class is equal to the target class.
            # Convert from True/False to 1/0 so we can use vector multiplication.
            y_train_isclass = (y_train == cls).astype(int)  
            
            # Multiply the feature matrix by the isclass indicator vector to 
            # zero-out the observations that aren't labeled the target class.
            wordcounts = y_train_isclass * X_train
            wordcounts = wordcounts + self.k  # +k for Laplace smoothing
            
            # Calculate the log probabilities by dividing the wordcounts by the total number
            # of words, and taking the natural log of that vector.
            log_probs = np.log(wordcounts/np.sum(wordcounts))            
            
            # Store in the prior vectors list
            pxy_vectors.append(log_probs)
        
        # Combine the list of prior vectors into a NxC matrix of priors.
        self.p_xy = np.vstack(pxy_vectors)
        
    
    def predict(self, X_pred):
        
        # Find the probability that the observation is equal to each class
        class_probabilities = X_test * self.p_xy.T
        
        # Find the maximum class probability as the predicted class
        y_pred = np.argmax(class_probabilities, axis=1)
        
        # Convert the Nx1 "matrix" into an ndarray before returning it
        y_pred = np.squeeze(np.asarray(y_pred))
        return y_pred

Fit the model. Predict the test labels.

In [24]:
mnb = MultinomialNaiveBayes()
mnb.fit(X_train, y_train)

In [25]:
y_pred = mnb.predict(X_test)

Performance metrics

In [26]:
print('Accuracy score: {score}'.format(score=accuracy_score(y_test, y_pred)))
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy score: 0.85
Confusion matrix:
[[9 1]
 [2 8]]


Great.