# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [1]:
# Your code here
import pandas as pd
from IPython.core.display import HTML

df = pd.read_table('SMSSpamCollection', sep='\t')
df.columns = ['cat', 'text']
df.head()

Unnamed: 0,cat,text
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [2]:
# Your code here
df_spam = df[df.cat=='spam']
df_ham = df[df.cat=='ham'].sample(len(df_spam))
df_evenly_distributed = pd.concat([df_spam, df_ham], axis=0)
p_classes = dict(df_evenly_distributed['cat'].value_counts(normalize=True)) # from lecture
p_classes

{'ham': 0.5, 'spam': 0.5}

## Train-test split

Now implement a train-test split on the dataset: 

In [3]:
# Your code here
from sklearn.model_selection import train_test_split
target = 'cat'
X = df_evenly_distributed.drop([target], axis=1)
y = df_evenly_distributed[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)  # from the "lecture" associated with this lab
train_df = pd.concat([y_train, X_train], axis=1, join='inner')
display(HTML(train_df.head(10).to_html()))
test_df = pd.concat([y_test, X_test], axis=1, join='inner')
display(HTML(test_df.head(10).to_html()))

Unnamed: 0,cat,text
3717,ham,I'm gonna rip out my uterus.
311,spam,"Think ur smart ? Win £200 this week in our weekly quiz, text PLAY to 85222 now!T&Cs WinnersClub PO BOX 84, M26 3UZ. 16+. GBP1.50/week"
843,spam,"Urgent! call 09066350750 from your landline. Your complimentary 4* Ibiza Holiday or 10,000 cash await collection SAE T&Cs PO BOX 434 SK3 8WP 150 ppm 18+"
1381,ham,"We spend our days waiting for the ideal path to appear in front of us.. But what we forget is.. ""paths are made by walking.. not by waiting.."" Goodnight!"
1095,ham,Ryder unsold.now gibbs.
628,spam,New TEXTBUDDY Chat 2 horny guys in ur area 4 just 25p Free 2 receive Search postcode or at gaytextbuddy.com. TXT ONE name to 89693
1305,ham,Designation is software developer and may be she get chennai:)
3173,spam,"Dear Voucher Holder, To claim this weeks offer, at you PC please go to http://www.e-tlp.co.uk/reward. Ts&Cs apply."
1125,spam,For taking part in our mobile survey yesterday! You can now have 500 texts 2 use however you wish. 2 get txts just send TXT to 80160 T&C www.txt43.com 1.50p
5220,ham,"Jane babes not goin 2 wrk, feel ill after lst nite. Foned in already cover 4 me chuck.:-)"


Unnamed: 0,cat,text
5199,spam,Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 85. No prepayment. Direct access! www.telediscount.co.uk
1580,ham,"I shall book chez jules for half eight, if that's ok with you?"
2692,spam,"Urgent Urgent! We have 800 FREE flights to Europe to give away, call B4 10th Sept & take a friend 4 FREE. Call now to claim on 09050000555. BA128NNFWFLY150ppm"
1373,spam,"Bears Pic Nick, and Tom, Pete and ... Dick. In fact, all types try gay chat with photo upload call 08718730666 (10p/min). 2 stop texts call 08712460324"
2412,spam,I don't know u and u don't know me. Send CHAT to 86688 now and let's find each other! Only 150p/Msg rcvd. HG/Suite342/2Lands/Row/W1J6HL LDN. 18 years or over.
514,spam,"You are guaranteed the latest Nokia Phone, a 40GB iPod MP3 player or a £500 prize! Txt word: COLLECT to No: 83355! IBHltd LdnW15H 150p/Mtmsgrcvd18+"
3264,ham,I will send them to your email. Do you mind &lt;#&gt; times per night?
1845,ham,Hi. || Do u want | to join me with sts later? || Meeting them at five. || Call u after class.
1564,ham,Tmrw. Im finishing 9 doors
229,ham,Dear good morning now only i am up


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [4]:
# Your code here
class_word_freq = {} 
classes = train_df['cat'].unique()
for class_ in classes:
    temp_df = train_df[train_df.cat == class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [5]:
# Your code here
vocabulary = set()
for text in train_df['text']:
    for word in text.split():
        vocabulary.add(word)
V = len(vocabulary)
V

6083

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [6]:
# Your code here
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [7]:
# Your code here
import numpy as np

#borrowed this function from the lecture
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        #p = p_classes[class_]
        p = np.log(p_classes[class_]) # use log to avoid underflow
        for word in bag.keys():
            num = bag[word]+1
            denom = class_word_freq[class_].get(word, 0) + V
            #p *= (num/denom)
            p *= np.log(num/denom) # use log to avoid underflow
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(sorted(posteriors))]

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [8]:
# Your code here
n_test_trials = 10
n = 0
while n < n_test_trials:
    row_sel = np.random.choice(range(len(train_df)))
    t = train_df.iloc[row_sel]['text']
    doc_class = classify_doc(t, class_word_freq, p_classes, V, return_posteriors=True)
    print(f"{doc_class}: {t}\n")
    n += 1

[1490474.1590784835, 1483972.161056787]
spam: Babe ? I lost you ... :-(

[7.104265506852228e+27, 7.144500156089006e+27]
spam: You'll not rcv any more msgs from the chat svc. For FREE Hardcore services text GO to: 69988 If u get nothing u must Age Verify with yr network & try again

[2.6679268857391247e+22, 2.7188585526069687e+22]
spam: Free 1st week entry 2 TEXTPOD 4 a chance 2 win 40GB iPod or £250 cash every wk. Txt POD to 84128 Ts&Cs www.textpod.net custcare 08712405020.

[1.2643001248621446e+24, 1.2684588701338894e+24]
spam: I miss you so much I'm so desparate I have recorded the message you left for me the other day and listen to it just to hear the sound of your voice. I love you

[1489665703979785.8, 1483069737449954.2]
spam: I think I‘m waiting for the same bus! Inform me when you get there, if you ever get there.

[5.898653716859975e+18, 5.973191763622286e+18]
spam: Thanks for the Vote. Now sing along with the stars with Karaoke on your mobile. For a FREE link just reply with 

In [9]:
y_hat_train = X_train.text.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

False    0.50625
True     0.49375
dtype: float64

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!