# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [17]:
# Your code here
# import pandas
import pandas as pd

# read in data with delimiter tab
data = pd.read_csv('SMSSpamCollection', delimiter = '\t', header = None,
                  names = ['label', 'text'])

# inspect first five rows
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [18]:
# Your code here
# how many instances of each?
data.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [25]:
import numpy as np

# get spam data
spam_data = data[ data.label == 'spam' ]

# get ham data
ham_data = data[ data.label == 'ham' ]

# select a random sample of indices from ham_data, as many as are in spam_data
ham_indices = np.random.choice(ham_data.index, size = len(spam_data), replace = False)

# subset ham_data
ham_data = ham_data.loc[ham_indices, :]

# put data back together
subset_data = pd.concat([spam_data, ham_data], axis = 0)

subset_data

Unnamed: 0,label,text
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
...,...,...
2637,ham,Thank god they are in bed!
1360,ham,&lt;#&gt; %of pple marry with their lovers... ...
1774,ham,"I'm not coming over, do whatever you want"
2828,ham,"Oh right, ok. I'll make sure that i do loads o..."


In [27]:
# check number of each, spam and ham
subset_data.label.value_counts()

spam    747
ham     747
Name: label, dtype: int64

In [29]:
# I don't think it matters, but I would like the rows to be back in order by index
subset_data.sort_index(inplace = True)

## Train-test split

Now implement a train-test split on the dataset: 

In [30]:
# Your code here
from sklearn.model_selection import train_test_split
X = subset_data['text']
y = subset_data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2, test_size = 0.25)
train_df = pd.concat([X_train, y_train], axis = 1)
test_df = pd.concat([X_test, y_test], axis = 1)

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [39]:
# Your code here
# for each unique label, count words that show up in rows with that label

# get classes by selecting unique elements in the label column of the training set
classes = np.unique(train_df.label)

# create an empty dictionary to hold a dictionary of word counts for each label
class_word_freq = {}

for label in classes:
    # create a bag to hold frequencies of words for this label
    bag = {}
    
    # find records in training set with this label
    records = train_df[ train_df.label == label ]
    
    # loop over each record with this label
    for row in records.index:
        
        # get the text for that record
        doc = train_df.loc[row, 'text']
        
        # split the text by word
        for word in doc.split():
            
            # add this word to the bag, or increment its count if already added
            bag[word] = bag.get(word, 0) + 1
            
    # after looping over all records, save this bag to this class in class_word_freq
    class_word_freq[label] = bag

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [40]:
# Your code here

# create an empty set
vocab = set()

# add each word in the 'text' section of the data frame to the vocabulary set

# iterate over all records in training set
for record in train_df.index:
    doc = train_df.loc[record, 'text']
    
    # split into words; add each one to vocab set
    for word in doc.split():
        vocab.add(word)

# find the length of the vocabulary set
total_corpus_words = len(vocab)

total_corpus_words

5935

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [41]:
# Your code here
def bag_it(doc):
    # define an empty bag
    bag = {}
    
    # break document into words; add word to bag, or increment its count if already added
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
        
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [58]:
# Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):  
    # ultimately, based on posteriors, return class
    classes = []
    posteriors = []
    
    # iterate over each class
    for label in class_word_freq.keys():
        
        # get probability of that class overall as base probability
        p = np.log(p_classes[label]) # take log of probability to avoid tiny # rounding error
        
        # for each word in this doc:
        # multiply p by the ratio (word count in this doc + 1) / (word count in class i + total
        # number of unique words in the entire corpus)
        # but make it a log probability, so add log(num / denom) successively to p
        
        # get a bag of words for this doc
        bag = bag_it(doc)
        
        for word in bag:
            numerator = bag[word] + 1
            denominator = class_word_freq[label].get(word, 0) + V
            p += np.log(numerator / denominator)
        
        classes.append(label)
        posteriors.append(p)
    
    if return_posteriors:
        return classes[np.argmax(posteriors)], posteriors
    else:
        return classes[np.argmax(posteriors)]

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [52]:
# get probability of ham and spam independently (should be 50 / 50 because of the way we
# constructed the subset of the original data)

p_classes = train_df.label.value_counts(normalize = True)

In [61]:
# Your code here
# for each record in the training set
# pass in the text for that record as doc
# pass in class_word_freq, p_classes, total_corpus_words, return_posteriors = True
# store predictions as y_hat preds

y_preds = [classify_doc(X_train.loc[i], class_word_freq, p_classes,
                        total_corpus_words) for i in X_train.index]

In [63]:
# check accuracy
residuals = y_train == y_preds
residuals.value_counts(normalize = True)

False    0.772321
True     0.227679
Name: label, dtype: float64

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!