# Lab 06: Take home exercise

## st121411
Classifying whether the message is a spam message or a ham message can be done using naive bayes. The dataset is from kaggle https://www.kaggle.com/uciml/sms-spam-collection-dataset. The function "get_words" was copied from one of the answers in kaggle, the rest of the assignment was done by me. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from collections import Counter
import nltk
import string
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /home/rom/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
raw_data = pd.read_csv('spam.csv', encoding='latin-1')
raw_data.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)
raw_data = raw_data.rename(columns={'v1': 'class','v2': 'sentence'})

raw_data.head()

Unnamed: 0,class,sentence
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
def get_words(sentence):
    '''
    What will be covered:
    1. Remove punctuation
    2. Remove stopwords
    3. Return list of clean text words
    '''
    
    #1
    nopunc = [char for char in sentence if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    #2
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    #3
    return clean_words
#raw_data['sentence'].apply(process_text).head()
#raw_data

By using stopwords, we can get words that are really important for each of the classes, this decreases the effect from the difference in the total number of words in spam messages vs number of ham messages as we are only going to look at words that are not stopwords.

In [4]:
train_data, test_data = train_test_split(raw_data.iloc[:,[0,1]], test_size = 0.2)
train_data.head()

Unnamed: 0,class,sentence
2202,ham,(And my man carlos is definitely coming by mu ...
3101,ham,Even if he my friend he is a priest call him now
5279,ham,"Helloooo... Wake up..! \Sweet\"" \""morning\"" \""..."
3091,ham,"Dear, take care. I am just reaching home.love ..."
1263,ham,Ok. No wahala. Just remember that a friend in ...


In [5]:
ham_docs = [train['sentence'] for index,train in train_data.iterrows() if train['class'] == 'ham']
spam_docs = [train['sentence'] for index,train in train_data.iterrows() if train['class'] == 'spam']
all_docs = [train['sentence'] for index,train in train_data.iterrows()]

In [6]:
spam_words = get_words(spam_docs)
ham_words = get_words(ham_docs)
all_words = get_words(all_docs)

In [7]:
def get_word_frequency_dictionary(words):
    word_frequency_dictionary = Counter(words)
    return word_frequency_dictionary

In [8]:
spam_dictionary = get_word_frequency_dictionary(spam_words)
ham_dictionary = get_word_frequency_dictionary(ham_words)

In [9]:
spam_frequency = []
ham_frequency = []
for key in spam_dictionary.keys():
    spam_frequency.append(spam_dictionary[key])
    
for key in ham_dictionary.keys():
    ham_frequency.append(ham_dictionary[key])

In [18]:
a = 0.00001
conditional_spam = []
for count in spam_frequency:
    #print(word, count)
    conditional_spam.append((count+a)/(sum(spam_frequency)+len(spam_frequency)*a))
conditional_spam.append(a/(sum(spam_frequency)+len(spam_frequency)*a))
    
conditional_ham = []
for count in ham_frequency:
    conditional_ham.append((count+a)/(sum(ham_frequency)+len(ham_frequency)*a))
conditional_ham.append(a/(sum(ham_frequency)+len(ham_frequency)*a))
    

def prior(className):    
    denominator = len(ham_docs) + len(spam_docs)
    
    if className == 'spam':
        numerator =  len(spam_docs)
    else:
        numerator =  len(ham_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words([sentence])
    prob = 1
    for word in words:
        if className == 'spam':
            try: 
                idx = spam_words.index(word)
                prob = prob * conditional_spam[idx]
            except:
                prob = prob * conditional_spam[-1]
        else:
            try:
                idx = ham_words.index(word)
                prob = prob * conditional_ham[idx]   
            except:
                prob = prob * conditional_ham[-1]
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_spam = classCondProb(sentence, 'spam') * prior('spam')
    prob_ham = classCondProb(sentence, 'ham') * prior('ham')
    if  prob_spam > prob_ham:
        return 'spam'
    elif prob_spam < prob_ham:
        return 'ham'
    else:
        return 'equal'

test_docs = list([test['sentence'] for index,test in test_data.iterrows()])

predictions = []

for i,sentence in enumerate(test_docs):
    #print('Getting prediction for %s"' % sentence)
    predictions.append(predict(sentence))
predictions = np.array(predictions)

y = test_data['class'].values

accuracy = sum(y==predictions)/y.size
print("Accuracy =",accuracy)

Accuracy = 0.9085201793721973


In [16]:
print(predictions[:100])
print(y[:100])
print(predictions[:100]==y[:100])

['ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham'
 'spam' 'ham' 'spam' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham'
 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham'
 'ham' 'spam' 'spam' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam'
 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'spam' 'spam'
 'ham' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham'
 'ham' 'spam' 'ham' 'spam' 'ham' 'spam' 'ham' 'spam' 'spam' 'ham' 'ham'
 'ham' 'ham' 'spam' 'spam' 'spam' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham'
 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham']
['ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham'
 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham'
 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham'
 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham'
 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'spam' 'spam' 'ham'
 

Instead of increasing the frequency of every word by 1 I found that all of the messages were classified as ham messages, but as I decreased the parameter, I found out that the accuracy got higher and higher. Therefore, I used 0.00001 as my final parameter value. Furthermore, removing the prior slightly reduced the accuracy.