# Teach Computer how to classify Messages

We will use multinominal Naive Bayes Alogorithm 
The dataset will have 5572 SMS Messages

Tha dataset is put together by UCI Machine Learning REpository.

https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

Can be download from 
https://dq-content.s3.amazonaws.com/433/SMSSpamCollection

In [1]:
import pandas as pd

# Data is separated by tab ,used sep='\t'
# Data doesn't have header used Header = None
# Created column lables.
# Spam means spam
# Ham Means non-spam

spamdataset = pd.read_csv('SMSSpamCollection', sep='\t',header=None)
names = ['Label','SMS']
spamdataset.columns = names

# Number of Rows = 5572 and No.of columns = 2
print(spamdataset.shape)

# % of Spam Messages
# 87% messages are non-spam 
# 13% messages are spam

print(spamdataset['Label'].value_counts(normalize=True) * 100)

(5572, 2)
ham     86.593683
spam    13.406317
Name: Label, dtype: float64


#  Training and Test Set

It's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

 - A __training set__, which we'll use to "train" the computer how to classify messages.
 - A __test set__, which we'll use to test how good the spam filter is with classifying new messages.


For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).


In [2]:
# Use the frac=1 parameter to randomize the entire dataset
random_sample = spamdataset.sample(random_state = 1,frac=1)
print(random_sample.head(5))

#Calculate Index for split
training_split_index = round(len(random_sample) * 0.8)
print(training_split_index)

## split the dataset into 2 , 80% in training and 20% Testing
training_sample = random_sample.iloc[:training_split_index].reset_index(drop=True)

## 20% in Testing
testing_sample = random_sample.iloc[training_split_index:].reset_index(drop=True)

# percentage of training and testing
print(training_sample['Label'].value_counts(normalize=True)*100)
print(testing_sample['Label'].value_counts(normalize=True)*100)


     Label                                                SMS
1078   ham                       Yep, by the pretty sculpture
4028   ham      Yes, princess. Are you going to make me moan?
958    ham                         Welp apparently he retired
4642   ham                                            Havent.
4674   ham  I forgot 2 ask ü all smth.. There's a card on ...
4458
ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


Percentages of Training and Testing samples are similar to what we have in full dataset

# Data Cleaning

To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a formattimg hat will allow us to extract easily all the information we need.

Split the text into columns




In [3]:
## Cleaning

print(training_sample['SMS'].head(5))
training_sample['SMS'] = training_sample['SMS'].str.replace('\W',' ')
training_sample['SMS'] = training_sample['SMS'].str.lower()
print(training_sample['SMS'].head(5))

0                         Yep, by the pretty sculpture
1        Yes, princess. Are you going to make me moan?
2                           Welp apparently he retired
3                                              Havent.
4    I forgot 2 ask ü all smth.. There's a card on ...
Name: SMS, dtype: object
0                         yep  by the pretty sculpture
1        yes  princess  are you going to make me moan 
2                           welp apparently he retired
3                                              havent 
4    i forgot 2 ask ü all smth   there s a card on ...
Name: SMS, dtype: object


# Creating the Vocabulary

Count the each words in the SMS column into a List and remove duplicate by SET 

In [4]:
# Split SMS column into words in 
training_sample['SMS']  = training_sample['SMS'].str.split()
print(training_sample['SMS'].head(5))

vocabulary = []
### iterate through SMS colum

for items in training_sample['SMS']:
    for word in items:
        vocabulary.append(word)
        
print(len(vocabulary))        

#Transform List into SET to remove dups
vocabularyset = set(vocabulary)
vocabulary = list(vocabularyset)

print("After Removing dups" , len(vocabulary))        

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object
72427
After Removing dups 7783


# The Final Training Set
To create the dictionary we need for our training set, we can use the code below, where:

We start by initializing a dictionary named __word_counts_per_sms__, where each key is a unique word (a string) from the vocabulary, and each value is a list of the length of training set, where each element in the list is a 0.

The code __[0] * 5 outputs [0, 0, 0, 0, 0]__. So the code __[0] * len(training_set['SMS'])__ outputs a list of the length of __training_set['SMS']__, where each element in the list will be a 0.
- We loop over training_set['SMS'] using at the same time the enumerate() function to get both the index and the SMS message (index and sms).

    - Using a nested loop, we loop over sms (where sms is a list of strings, where each string represents a word in a message).
        - We incremenent word_counts_per_sms[word][index] by 1.


    

In [5]:
word_counts_per_sms = {unique_word: [0] * len(training_sample['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_sample['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [6]:
## Transform word count per sms into Dataframe

print("done")
word_count = pd.DataFrame(word_counts_per_sms)
print(word_count['ticket'].head())



done
0    0
1    0
2    0
3    0
4    0
Name: ticket, dtype: int64


In [7]:
training_set_clean = pd.concat([training_sample, word_count], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [8]:
print(len(training_sample['SMS']))

4458


# Calculating Constants First

Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:

- P(Spam) and P(Ham)
- __N__Spam, __N__Ham, __N__Vocabulary

In [14]:
# Calculate p(spam) - Probablity of spam

spam_dataset  = training_set_clean[training_set_clean['Label'] == 'spam']
ham_dataset = training_set_clean[training_set_clean['Label'] == 'ham']

print("Spam DataSet ", spam_dataset.shape)
print("Non Spam DataSet ", ham_dataset.shape)

print("Length Spam DataSet ", len(spam_dataset))
print("Length Non Spam DataSet ", len(ham_dataset))
print("Length of Training Dataset", len(training_set_clean))

Spam DataSet  (600, 7785)
Non Spam DataSet  (3858, 7785)
Length Spam DataSet  600
Length Non Spam DataSet  3858
Length of Training Dataset 4458


In [15]:
### Spam and Non SPam numbers
p_spam = len(spam_dataset) / len(training_set_clean)
p_ham = len(ham_dataset) / len(training_set_clean)

print ("Probablity of Spam  : ", p_spam)
print ("Probablity of non Spam  : ", p_ham)


Probablity of Spam  :  0.13458950201884254
Probablity of non Spam  :  0.8654104979811574


In [11]:
alpha = 1 
n_vocabulary = len(vocabulary)
print("Length of Vocabulary  : ", n_vocabulary)

Length of Vocabulary  :  7783


In [17]:
## n_spam
no_of_words = spam_dataset['SMS'].apply(len)
n_spam = no_of_words.sum()
print("n_spam : ", n_spam )

## n_ham
no_of_words_ham = ham_dataset['SMS'].apply(len)
n_ham = no_of_words_ham.sum()
print("n_ham : ", n_ham )


n_spam :  15190
n_ham :  57237


# Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters 
 and 
-  Each parameter will thus be a conditional probability value associated with each word in the vocabulary.


In [None]:
# Initialize 2 dictionaries 

parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_dataset[word].sum()  
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_dataset[word].sum()   # ham_dataset already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham


# Classifying A New Message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [26]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    '''    
    This is where we calculate:

    p_spam_given_message = ?
    p_ham_given_message = ?
    '''    
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    print(p_spam)
    print(p_ham)
    print(len(parameters_spam))
    print(len(parameters_ham))
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
        


In [27]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')


0.13458950201884254
0.8654104979811574
7783
7783
P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [28]:
classify('Sounds good, Tom, then see u there')

0.13458950201884254
0.8654104979811574
7783
7783
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


# Measuring the Spam Filter's Accuracy

 we'll change the __classify()__ function that we wrote previously to return the labels instead of printing them. Below, note that we now have return statements instead of print() functions:
 

In [29]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'
    
    
    

In [33]:
## Now that we have a function that returns labels instead of 
## printing them, we can use it to create a new column in our test set.

print(testing_sample.shape)

testing_sample['predict_msg'] = testing_sample['SMS'].apply(classify_test_set)
print(testing_sample.head())


(1114, 3)
  Label                                                SMS predict_msg
0   ham          Later i guess. I needa do mcat study too.         ham
1   ham             But i haf enuff space got like 4 mb...         ham
2  spam  Had your mobile 10 mths? Update to latest Oran...        spam
3   ham  All sounds good. Fingers . Makes it difficult ...         ham
4   ham  All done, all handed in. Don't know if mega sh...         ham


#### Now, we'll write a function to measure the accuracy of our spam filter to find out how well our spam filter does.



In [39]:
total = testing_sample.shape[0]
correct = 0

for row in testing_sample.iterrows():
    row = row[1]
    if (row['Label'] == row['predict_msg']):
        correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)
    

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

