# Document Classification with Naive Bayes

## Introduction

In this lab session, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `SMSSpamCollection.txt`.

In [15]:
import pandas as pd
df = pd.read_csv('data/SMSSpamCollection.txt', sep='\t', header=None, names=['class', 'text'])
df

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [16]:
from collections import Counter
from sklearn.model_selection import train_test_split

In [17]:
spam_instances = df[df['class'] == 'spam']
ham_instances = df[df['class'] == 'ham']

#count the number of instances in each class
num_spam = len(spam_instances)
num_ham = len(ham_instances)

#determine the number of instances to keep from the majority class
num_instances_to_keep = min(num_spam, num_ham)

#subset examples of the majority class to match the number of spam instances
subset_ham = ham_instances.sample(n=num_instances_to_keep, random_state=42)

#combine the selected instances of both classes
balanced_df = pd.concat([spam_instances, subset_ham])

balanced_df['class'].value_counts()


class
spam    747
ham     747
Name: count, dtype: int64

## Train-test split

Now implement a train-test split on the dataset:

In [18]:
# Your code here
from sklearn.model_selection import train_test_split
X = df['text']
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
train_df = pd.DataFrame({'text': X_train, 'class': y_train})
test_df = pd.DataFrame({'text': X_test, 'class': y_test})


In [19]:
##reset index

train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

In [20]:
train_df

Unnamed: 0,text,class
0,"He will, you guys close?",ham
1,CAN I PLEASE COME UP NOW IMIN TOWN.DONTMATTER ...,ham
2,Ok k..sry i knw 2 siva..tats y i askd..,ham
3,"I'll see, but prolly yeah",ham
4,"I'll see if I can swing by in a bit, got some ...",ham
...,...,...
4452,What pa tell me.. I went to bath:-),ham
4453,Jus finish watching tv... U?,ham
4454,Moby Pub Quiz.Win a £100 High Street prize if ...,spam
4455,Free entry in 2 a weekly comp for a chance to ...,spam


In [21]:
test_df

Unnamed: 0,text,class
0,No need to buy lunch for me.. I eat maggi mee..,ham
1,Ok im not sure what time i finish tomorrow but...,ham
2,Waiting in e car 4 my mum lor. U leh? Reach ho...,ham
3,"You have won ?1,000 cash or a ?2,000 prize! To...",spam
4,If you r @ home then come down within 5 min,ham
...,...,...
1110,AH POOR BABY!HOPE URFEELING BETTERSN LUV! PROB...,ham
1111,O ic lol. Should play 9 doors sometime yo,ham
1112,Ambrith..madurai..met u in arun dha marrge..re...,ham
1113,Dear umma she called me now :-),ham


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class:

In [22]:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize

In [23]:
# Your code here
word_freq_dict_spam = defaultdict(int)
word_freq_dict_ham = defaultdict(int)

# count word spam class
for text in train_df[train_df['class'] == 'spam']['text']:
    words = word_tokenize(text)
    for word in words:
        word_freq_dict_spam[word] += 1

# count words for ham class
for text in train_df[train_df['class'] == 'ham']['text']:
    words = word_tokenize(text)
    for word in words:
        word_freq_dict_ham[word] += 1

print("Word frequency dictionary for spam class:")
print(dict(list(word_freq_dict_spam.items())[:15]))  #print first 15 items
print("\nWord frequency dictionary for ham class:")
print(dict(list(word_freq_dict_ham.items())[:15]))  #print first 10 items for 

Word frequency dictionary for spam class:
{'FREE': 90, 'NOKIA': 12, 'Or': 3, 'Motorola': 8, 'with': 77, 'upto': 1, '12mths': 1, '1/2price': 1, 'linerental': 2, ',': 320, '500': 20, 'x-net': 1, 'mins': 18, '&': 140, '100txt/mth': 1}

Word frequency dictionary for ham class:
{'He': 42, 'will': 226, ',': 1269, 'you': 1335, 'guys': 26, 'close': 10, '?': 1077, 'CAN': 7, 'I': 1565, 'PLEASE': 2, 'COME': 3, 'UP': 13, 'NOW': 4, 'IMIN': 1, 'TOWN.DONTMATTER': 1}


## Count the total corpus words
Calculate V, the total number of words in the corpus:

In [24]:
# Your code here
unique_words = set()

#iterate through all text samples in train_df and test_df
for text in train_df['text']:
    words = word_tokenize(text)
    unique_words.update(words)

for text in test_df['text']:
    words = word_tokenize(text)
    unique_words.update(words)

#calculate the total number of words in the corpus
V = len(unique_words)

print("Total number of unique words in the corpus (V):", V)

Total number of unique words in the corpus (V): 11516


## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [25]:
# Your code here

def bag_it(text):
    #tokenize the words
    words = word_tokenize(text)
    
    #initialize a defaultdict to store word frequencies
    bag_of_words = defaultdict(int)
    
    #count the occurrences of each word
    for word in words:
        bag_of_words[word] += 1
    
    return bag_of_words

#example
text = "Juggling a dripping watermelon, the eccentric violinist sprinted between raindrops, his melody a whimsical challenge to the approaching storm"
bag_of_words = bag_it(text)
print("Bag of words representation:", bag_of_words)

Bag of words representation: defaultdict(<class 'int'>, {'Juggling': 1, 'a': 2, 'dripping': 1, 'watermelon': 1, ',': 2, 'the': 2, 'eccentric': 1, 'violinist': 1, 'sprinted': 1, 'between': 1, 'raindrops': 1, 'his': 1, 'melody': 1, 'whimsical': 1, 'challenge': 1, 'to': 1, 'approaching': 1, 'storm': 1})


## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [26]:
import math

In [27]:
# Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    words = doc.split()
    
    # create dictionary to store class probabilities
    class_probs = {}
    
    #calculate class probabilities using Naive Bayes formula
    for label, word_freq in class_word_freq.items():
        class_probs[label] = math.log(p_classes[label])  #initialize with prior probability
        
        for word in words:
            #add Laplace smoothing for words not in the class's word frequency dictionary
            word_freq_with_smoothing = word_freq.get(word, 0) + 1
            #calculate the probability of the word given the class
            word_prob = math.log(word_freq_with_smoothing / (sum(word_freq.values()) + V))
            class_probs[label] += word_prob
    
    #return the class with the highest probability
    predicted_class = max(class_probs, key=class_probs.get)
    
    if return_posteriors:
        #normalize class probabilities to obtain posteriors
        class_posteriors = {label: math.exp(prob - class_probs[predicted_class]) for label, prob in class_probs.items()}
        return predicted_class, class_posteriors
    else:
        return predicted_class

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [28]:
# Your code here
class_word_freq = {label: dict(bag_it(' '.join(train_df[train_df['class'] == label]['text']))) for label in train_df['class'].unique()}

#create prior probabilities dictionary
total_documents = len(train_df)
p_classes = {label: count / total_documents for label, count in train_df['class'].value_counts().items()}

#test the classifier on the test set
correct_predictions = 0
total_predictions = len(test_df)

for index, row in test_df.iterrows():
    doc = row['text']
    true_label = row['class']
    
    predicted_label = classify_doc(doc, class_word_freq, p_classes, V)
    
    if predicted_label == true_label:
        correct_predictions += 1

accuracy = correct_predictions / total_predictions
print("Accuracy:", accuracy)

Accuracy: 0.9345291479820628


## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!