# Naive Bayes Implementation

In [1]:
from __future__ import division # ensure that all division is float division
from __future__ import print_function # print function works properly when used with paranthesis

%matplotlib inline
import matplotlib.pyplot as plt

import os, sys, re
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_colwidth", 255)

**Read in SMS Data.**

>The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

>A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.

>A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.


- Primary Source: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
- Secondary: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

### Read in Data

In [2]:
df = pd.read_csv("../data/sms.tsv", sep="\t", names=['label', 'message'])
print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


### Stratified Train Test Split

Stratified means the proprtions of spam/ham in the train/test sets reflect the original dataset. You can see the percentage is about the same here.

In [3]:
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size=0.2, stratify=df.label)
print(train.shape, test.shape)
train.label.value_counts()['ham'] / len(train), test.label.value_counts()['ham'] / len(test)

(4457, 2) (1115, 2)


(0.86582903298182634, 0.86636771300448434)

### Create sample data frame and sample rows.

Extract two sample messages that we will use for testing in the functions below.

In [4]:
sample_df = train.sample(2)

sample_row1 = sample_df.iloc[0] # first row of sample_df
sample_row2 = sample_df.iloc[1] # second row of sample_df

sample_message1 = sample_row1.message
sample_message2 = sample_row2.message

print(sample_row1.label, "|", sample_message1)
print(sample_row2.label, "|", sample_message2)

spam | WELL DONE! Your 4* Costa Del Sol Holiday or £5000 await collection. Call 09050090044 Now toClaim. SAE, TCs, POBox334, Stockport, SK38xh, Cost£1.50/pm, Max10mins
ham | What's up my own oga. Left my phone at home and just saw ur messages. Hope you are good. Have a great weekend.


### Tokenize Message

Use http://regex101.com to come up with regular expressions.

In [5]:
def tokenize(msg):
    """
    input: "Change again... It's e one next to escalator..."
    output: ["change", "again", "it's", "one", "next", "to", "escalator"]
    """
    msg_lowered = msg.lower()
    # at least two characters long, cannot start with number
    all_tokens = re.findall(r"\b[a-z][a-z0-9']+\b", msg_lowered)
    return list(set(all_tokens))

tokens1 = tokenize(sample_message1)
tokens2 = tokenize(sample_message2)

print(tokens1)
print(tokens2)

['costa', 'now', 'max10mins', 'call', 'await', 'well', 'sol', 'collection', 'or', 'pobox334', 'cost', 'done', 'sae', 'sk38xh', 'del', 'stockport', 'holiday', 'tcs', 'your', 'toclaim', 'pm']
['and', 'great', 'own', 'are', 'just', 'my', "what's", 'messages', 'up', 'weekend', 'ur', 'phone', 'good', 'at', 'have', 'saw', 'home', 'you', 'oga', 'hope', 'left']


### Vectorize Message

Walk through the steps of vectorizing a message outside of a function.

In [6]:
token_dict1 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}
for token in tokens1:
    token_dict1[token] = 1 
series1 = pd.Series(token_dict1) # convert the dictionary into a series where the row labels are words

# rewrite the same as above using a dict comprehension
series1 = pd.Series({token: 1 for token in tokens1})

token_dict2 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}
for token in tokens2:
    token_dict2[token] = 1 
series2 = pd.Series(token_dict2) # convert the dictionary into a series where the row labels are words

# rewrite the same as above using a dict comprehension
series2 = pd.Series({token: 1 for token in tokens2})

print("Sample Message 1:", sample_message1)
print("Tokens 1:", tokens1)
print("Series 1:")
print(series1)
print()
print("Sample Message 2:", sample_message2)
print("Tokens 2:", tokens2)
print("Series 2:")
print(series2)
print()

print("Combine Series 1 and Series 2:")
df2 = pd.DataFrame([series1, series2]) # comebine the two 
df2.fillna(0, inplace=True)
df2

Sample Message 1: WELL DONE! Your 4* Costa Del Sol Holiday or £5000 await collection. Call 09050090044 Now toClaim. SAE, TCs, POBox334, Stockport, SK38xh, Cost£1.50/pm, Max10mins
Tokens 1: ['costa', 'now', 'max10mins', 'call', 'await', 'well', 'sol', 'collection', 'or', 'pobox334', 'cost', 'done', 'sae', 'sk38xh', 'del', 'stockport', 'holiday', 'tcs', 'your', 'toclaim', 'pm']
Series 1:
await         1
call          1
collection    1
cost          1
costa         1
del           1
done          1
holiday       1
max10mins     1
now           1
or            1
pm            1
pobox334      1
sae           1
sk38xh        1
sol           1
stockport     1
tcs           1
toclaim       1
well          1
your          1
dtype: int64

Sample Message 2: What's up my own oga. Left my phone at home and just saw ur messages. Hope you are good. Have a great weekend.
Tokens 2: ['and', 'great', 'own', 'are', 'just', 'my', "what's", 'messages', 'up', 'weekend', 'ur', 'phone', 'good', 'at', 'have', '

Unnamed: 0,and,are,at,await,call,collection,cost,costa,del,done,...,stockport,tcs,toclaim,up,ur,weekend,well,what's,you,your
0,0,0,0,1,1,1,1,1,1,1,...,1,1,1,0,0,0,1,0,0,1
1,1,1,1,0,0,0,0,0,0,0,...,0,0,0,1,1,1,0,1,1,0


Repeat the same process as above of tokenzing and then vectorizing using a function.

In [7]:
def vectorize_row(row):
    """
    input: row in data frame with a ".message" attribute
    output: vectorized row where the row labels are words and the values are 1 for each row
    """
    message = row.message
    tokens = tokenize(message)
    vectorized_row = pd.Series({token: 1 for token in tokens})
    return vectorized_row

In [8]:
vectorize_row(sample_row1)

await         1
call          1
collection    1
cost          1
costa         1
del           1
done          1
holiday       1
max10mins     1
now           1
or            1
pm            1
pobox334      1
sae           1
sk38xh        1
sol           1
stockport     1
tcs           1
toclaim       1
well          1
your          1
dtype: int64

In [9]:
vectorize_row(sample_row2)

and         1
are         1
at          1
good        1
great       1
have        1
home        1
hope        1
just        1
left        1
messages    1
my          1
oga         1
own         1
phone       1
saw         1
up          1
ur          1
weekend     1
what's      1
you         1
dtype: int64

### Create Feature Matrix

This is input to our Naive Bayes model.

In [10]:
def get_feature_matrix(df):
    feature_matrix = df.apply(vectorize_row, axis=1)
    feature_matrix.fillna(0, inplace=True)
    return feature_matrix

In [11]:
get_feature_matrix(sample_df)

Unnamed: 0,and,are,at,await,call,collection,cost,costa,del,done,...,stockport,tcs,toclaim,up,ur,weekend,well,what's,you,your
1942,0,0,0,1,1,1,1,1,1,1,...,1,1,1,0,0,0,1,0,0,1
4809,1,1,1,0,0,0,0,0,0,0,...,0,0,0,1,1,1,0,1,1,0


In [12]:
feature_matrix = get_feature_matrix(train)
feature_matrix.shape

(4457, 7213)

In [13]:
feature_matrix.columns[:50]

Index([u'a21', u'a30', u'aa', u'aah', u'aaniye', u'aaooooright', u'aathi',
       u'ab', u'abbey', u'abdomen', u'abeg', u'abel', u'aberdeen', u'abi',
       u'ability', u'abiola', u'abj', u'able', u'about', u'aboutas', u'above',
       u'abroad', u'absence', u'absolutely', u'absolutly', u'abstract', u'abt',
       u'abta', u'aburo', u'abuse', u'abusers', u'ac', u'academic', u'acc',
       u'accent', u'accenture', u'accept', u'access', u'accessible',
       u'accidant', u'accident', u'accidentally', u'accommodation',
       u'accommodationvouchers', u'accordin', u'accordingly', u'account',
       u'account's', u'accounting', u'accounts'],
      dtype='object')

In [14]:
feature_matrix.columns[-50:]

Index([u'yet', u'yetty's', u'yetunde', u'yhl', u'yi', u'yijue', u'ym', u'ymca',
       u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'you'd',
       u'you'ld', u'you'll', u'you're', u'you've', u'youdoing', u'young',
       u'younger', u'your', u'your's', u'youre', u'yourinclusive', u'yourjob',
       u'yours', u'yourself', u'youuuuu', u'yowifes', u'yr', u'yrs',
       u'ystrday', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou',
       u'yup', u'yupz', u'zac', u'zealand', u'zed', u'zhong', u'zoe',
       u'zogtorius', u'zoom', u'zouk'],
      dtype='object')

In [15]:
feature_matrix.head()

Unnamed: 0,a21,a30,aa,aah,aaniye,aaooooright,aathi,ab,abbey,abdomen,...,yup,yupz,zac,zealand,zed,zhong,zoe,zogtorius,zoom,zouk
889,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1202,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
212,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3752,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5554,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Calculate Feature Probabilities (Train/Fit Model)

In [16]:
def get_conditional_probability_for_word(col):
    return col.sum() / len(col)

In [17]:
def get_feature_prob(feature_matrix):
    
    spam_boolean_mask = (df.label == "spam")
    ham_boolean_mask = (df.label == "ham")
    
    # Explanation for "confusing" syntax:
    # http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    
    feature_matrix_spam = feature_matrix.loc[spam_boolean_mask, :] # get all rows for spam boolean mask
    feature_matrix_ham = feature_matrix.loc[ham_boolean_mask, :] # get all rows for ham boolean mask
    
    # mymatrix[:, 0] is to get the first column
    # mymatrix[:, 1] is to get the second column
    
    # mymatrix[0, :] is to get the first row
    # mymatrix[1, :] is to get the second row
    
    # mymatrix[boolean_mask, :] is to get the rows where boolean_mask is True
    
    feature_prob_spam = feature_matrix_spam.apply(get_conditional_probability_for_word, axis=0)
    feature_prob_ham = feature_matrix_ham.apply(get_conditional_probability_for_word, axis=0)
    
    feature_prob = pd.concat([feature_prob_spam, feature_prob_ham], axis=1)
    feature_prob.columns = ['spam', 'ham']
    
    return feature_prob

In [18]:
feature_prob = get_feature_prob(feature_matrix)
feature_prob.shape

(7213, 2)

In [19]:
feature_prob.head()

Unnamed: 0,spam,ham
a21,0.001672,0.0
a30,0.0,0.000259
aa,0.0,0.000259
aah,0.0,0.000518
aaniye,0.0,0.000259


### Analyze Feature Probabilities in Classifier

Words with the largest conditional probability for predicting spam.

P(w_i | y= "spam")

In [20]:
feature_prob.sort_values(by='spam', ascending=False).head(10)

Unnamed: 0,spam,ham
to,0.625418,0.253693
call,0.431438,0.047162
your,0.322742,0.073594
you,0.311037,0.276237
now,0.252508,0.060897
for,0.244147,0.094325
or,0.242475,0.045608
free,0.227425,0.013993
the,0.219064,0.182172
txt,0.204013,0.00285


Words with the smallest conditional probability for predicting ham.

P(w_i | y= "ham")

In [21]:
feature_prob.sort_values(by='ham', ascending=True).head(10)

Unnamed: 0,spam,ham
a21,0.001672,0
lastest,0.001672,0
largest,0.006689,0
large,0.001672,0
landmark,0.001672,0
landlines,0.003344,0
land,0.016722,0
la32wu,0.001672,0
la3,0.001672,0
la1,0.001672,0


**Key Takeaway**: These models are trained looking only at one class at a time, so the largest conditional probabilities may end up being common stop words. However, this will occur in both classes which ends up "cancelling out". The stop words won't predict one way or the other. Instead, looking at the least predictive words of the opposite class - in this case the words least predictive of "ham" will show us highly predictive spam words.

In [22]:
df[df.message.str.contains("a21", case=False)]

Unnamed: 0,label,message
1673,spam,URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050001295 from land line. Claim A21. Valid 12hrs only


In [23]:
df[df.message.str.contains("landmark", case=False)]

Unnamed: 0,label,message
4373,spam,"Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck!"


In [24]:
df[df.message.str.contains("landlines", case=False)]

Unnamed: 0,label,message
3998,spam,Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!
4864,spam,Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!


### Predict Test Data

In [25]:
# ....