### Guided Project: Building a Spam Filter with Naive Bayes
 
In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous lesson that the computer:

1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=['Label', 'SMS'])

In [3]:
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df.shape

(5572, 2)

In [5]:
df['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In [6]:
train = df.sample(frac=0.8, random_state=1).reset_index(drop=True)
test = df.sample(frac=0.2, random_state=1).reset_index(drop=True)

In [7]:
train.shape

(4458, 2)

In [8]:
test.shape

(1114, 2)

In [9]:
train['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [10]:
test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [11]:
train.head(5)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [12]:
test.head(5)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [13]:
# data cleaning
# remove punctuation and perform lowercase
train['SMS'] = train['SMS'].str.replace(r'\W', ' ').str.lower()

In [14]:
train.head(5)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [15]:
train['SMS'] = train['SMS'].str.split()

In [16]:
vocabulary = []
for row in train['SMS']:
    for word in row:
        vocabulary.append(word)

print("total words in vocabulary before deduping: " + str(len(vocabulary)))
vocabulary = set(vocabulary)
print("total words in vocabulary after deduping: " + str(len(vocabulary)))
print("converting set back to list")
vocabulary = list(vocabulary)

total words in vocabulary before deduping: 72427
total words in vocabulary after deduping: 7783
converting set back to list


In [17]:
vocabulary[:5]

['83355', 'dsn', 'floor', 'gr8', 'helpline']

In [18]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

In [19]:
for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [20]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head(5)

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [21]:
train_final = pd.concat([train, word_counts], axis=1)

In [22]:
train_final[['Label', 'SMS', 'yep']].head(5)

Unnamed: 0,Label,SMS,yep
0,ham,"[yep, by, the, pretty, sculpture]",1
1,ham,"[yes, princess, are, you, going, to, make, me,...",0
2,ham,"[welp, apparently, he, retired]",0
3,ham,[havent],0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0


In [23]:
# calculate p_spam and p_ham
alpha = 1
n_vocab = len(vocabulary)
train_size = len(train_final)
train_spam = train_final[train_final['Label'] == 'spam']
train_ham = train_final[train_final['Label'] == 'ham']
p_spam = len(train_spam) / train_size
p_ham = len(train_ham) / train_size

In [24]:
train_spam = train_final[train_final['Label'] == 'spam']
train_ham = train_final[train_final['Label'] == 'ham']
n_spam = train_spam['SMS'].apply(lambda x: len(x)).sum()
n_ham = train_ham['SMS'].apply(lambda x: len(x)).sum()

In [25]:
print('n_spam: ' + str(n_spam))
print('n_ham: ' + str(n_ham))
print('n_vocab: ' + str(n_vocab))


n_spam: 15190
n_ham: 57237
n_vocab: 7783


In [26]:
p_spam_dict = {unique_word: 0 for unique_word in vocabulary}
p_ham_dict = {unique_word: 0 for unique_word in vocabulary}

In [27]:
for word in vocabulary:
    n_word_given_spam = train_spam.loc[:, word].sum()
    n_word_given_ham = train_ham.loc[:, word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + (alpha * n_vocab))
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + (alpha * n_vocab))
    p_spam_dict[word] = p_word_given_spam
    p_ham_dict[word] = p_word_given_ham
    

In [28]:
p_spam_dict['hi']

0.0006964697688590954

In [29]:
# classify function
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    '''    
    This is where we calculate:

    p_spam_given_message = ?
    p_ham_given_message = ?
    '''    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in p_spam_dict:
            p_spam_given_message *= p_spam_dict[word]
        if word in p_ham_dict:
            p_ham_given_message *= p_ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'
        
        

In [30]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [31]:
classify("Sounds good, Tom, then see u there")

'ham'

In [32]:
test['predicted'] = test['SMS'].apply(classify)
test.head(20)

Unnamed: 0,Label,SMS,predicted
0,ham,"Yep, by the pretty sculpture",ham
1,ham,"Yes, princess. Are you going to make me moan?",ham
2,ham,Welp apparently he retired,ham
3,ham,Havent.,ham
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham
5,ham,Ok i thk i got it. Then u wan me 2 come now or...,ham
6,ham,I want kfc its Tuesday. Only buy 2 meals ONLY ...,ham
7,ham,No dear i was sleeping :-P,ham
8,ham,Ok pa. Nothing problem:-),ham
9,ham,Ill be there on &lt;#&gt; ok.,ham


In [38]:
correct = (test['Label'] == test['predicted']).sum()

        
total = len(test)
accuracy = correct * 100.0 / total

print('correct: {}'.format(correct))
print('accuracy is {:.2f}%'.format(accuracy))


correct: 1106
accuracy is 99.28%


In [34]:
incorrect = (test['Label'] != test['predicted']).sum()


In [35]:
incorrect

8