## Spam Classifier
The goal is to classify the message as spam or non-spam using multinomial Naive Bayes algorithm.

To classify messages as spam or non-spam ,we follow these steps to achieve our goal:-
* Learns how humans classify messages.
* Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
* Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

Lets start by exploring data set.

In [1]:
import pandas as pd 
sms_data= pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [2]:
print(sms_data.shape)
print(sms_data.head())
print(sms_data['Label'].value_counts(normalize='True')*100)

(5572, 2)
  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
ham     86.593683
spam    13.406317
Name: Label, dtype: float64


We can see from above that thier are 13.4% message which are spam and other are non-spam(ham).

We will need to test how good it is with classifying new messages. To test the spam filter, we are first going to split our dataset into two categories:
* A training set, which we will use to "train" the computer how to classify messages.
* A test set, which we will use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing.

In [3]:
data_sample=sms_data.sample(frac=1,random_state=1)
random_data=round(len(data_sample)*0.8)
train_set=data_sample[:random_data].reset_index(drop=True)
test_set=data_sample[random_data:].reset_index(drop=True)
print(train_set)
print(test_set)

     Label                                                SMS
0      ham                       Yep, by the pretty sculpture
1      ham      Yes, princess. Are you going to make me moan?
2      ham                         Welp apparently he retired
3      ham                                            Havent.
4      ham  I forgot 2 ask ü all smth.. There's a card on ...
5      ham  Ok i thk i got it. Then u wan me 2 come now or...
6      ham  I want kfc its Tuesday. Only buy 2 meals ONLY ...
7      ham                         No dear i was sleeping :-P
8      ham                          Ok pa. Nothing problem:-)
9      ham                    Ill be there on  &lt;#&gt;  ok.
10     ham  My uncles in Atlanta. Wish you guys a great se...
11     ham                                           My phone
12     ham                       Ok which your another number
13     ham  The greatest test of courage on earth is to be...
14     ham  Dai what this da.. Can i send my resume to thi...
15     h

In [4]:
print(train_set['Label'].value_counts(normalize=True))
print(test_set['Label'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


To calculate all these probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. Right now, our training and test sets have this format
![Image][1]
[1]:https://dq-content.s3.amazonaws.com/433/cpgp_dataset_3.png

We remove other char such as punctuations.

In [5]:
train_set['SMS']=train_set['SMS'].str.replace('\W'," ")
train_set['SMS']=train_set['SMS'].str.lower()

In [6]:
train_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


For creating col with require all unique words in the message(vocabulary)

In [7]:
train_set['SMS']=train_set['SMS'].str.split()
vocabulary = []
for sms in train_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [8]:
word_counts_per_sms = {unique_word: [0] * len(train_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [9]:
word_col=pd.DataFrame(word_counts_per_sms)

In [10]:
word_col.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [11]:
training_set=pd.concat([train_set,word_col],axis=1)

In [12]:
training_set.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


The probability values of the two equations below to be able to classify new messages:
\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}
Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we need to use these equations:
\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}
Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:

P(Spam), P(Ham),
NSpam, NHam and  NVocabulary
* NSpam is equal to the number of words in all the spam messages 
* NHam is equal to the number of words in all the non-spam messages  

In [17]:
prob=training_set['Label'].value_counts(normalize=True)

In [18]:
p_ham,p_spam=prob[0],prob[1]

In [22]:
spam_messages = training_set[training_set['Label'] == 'spam']
ham_messages = training_set[training_set['Label'] == 'ham']
spam_word=spam_messages['SMS'].apply(len)
n_spam=spam_word.sum()

ham_word=ham_messages['SMS'].apply(len)
n_ham=ham_word.sum()

n_vocabulary=len(vocabulary)
alpha=1

In [23]:
n_vocabulary

7783

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}



In [24]:
param_spam={word:0 for word in vocabulary}
param_ham={word:0 for word in vocabulary}

In [28]:
for word in vocabulary:
    spam_w=spam_messages[word].sum()
    p_w_given_spam=(spam_w+alpha)/(n_spam+alpha*n_vocabulary)
    param_spam[word]=p_w_given_spam
   
    ham_w=ham_messages[word].sum()
    p_w_given_ham=(ham_w+alpha)/(n_ham+alpha*n_vocabulary)
    param_ham[word]=p_w_given_ham

Now that we've calculated all the constants and parameters we need,The spam filter can be understood as a function that:

Takes in as input a new message (w1, w2, ..., wn)
Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
* If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
* If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
* If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

Note:some new messages will contain words that are not part of the vocabulary,we simply ignore these words when we're calculating the probabilities.

In [29]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message=p_spam
    p_ham_given_message=p_ham
    for word in message:
        if word in param_spam:
            p_spam_given_message*=param_spam[word]
        if word in param_ham:
            p_ham_given_message*=param_ham[word]
        
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Lets test the classifier to classify message:

In [30]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [31]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


We'll determine how well the spam filter does on our test set of 1,114 messages.
The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human)

In [34]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]

        if word in param_ham:
            p_ham_given_message *= param_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [35]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [41]:
total_test_case=test_set.shape[0]
correct=0
for row in test_set.iterrows():
    row = row[1]
    if row['Label']==row['predicted']:
        correct+=1
        
print("correct : ",correct)
print("Total messages:",total_test_case)
print("Accuracy:",correct/total_test_case)

correct :  1100
Total messages: 1114
Accuracy: 0.9874326750448833


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.