<h1>Spam Classifier<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

**This project aims to apply TF-IDF method to analyze the term importance and classify spam messages.**

In [17]:
import pandas as pd
import os
import re
import nltk
import string
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from math import log
import numpy as np
from sklearn.metrics import confusion_matrix

os.chdir('C:/Users/ricci/Desktop/Coding & Techniques/Spam_Detection')

**The data set used in this project is the SMS sample from Kaggle. It contains messages with defined results of spam/ham. The first few steps of text analysis are to clean the text data by removing punctuations and stop words. For example, stop words 'the', 'a', 'me' and 'him' do not provide useful information if they both exist in spam/ham messages.** 

In [18]:
raw_data=pd.read_csv('SMSSpamCollection.txt',sep = '\t', header=None, names=["label", "sms"])

spam=raw_data.copy(deep=True)
spam['label']=[str.replace(spam['label'][i],'ham','0') for i in range(len(spam))]
spam['label']=[str.replace(spam['label'][i],'spam','1') for i in range(len(spam))]

**package used for removing punctuation/stopwords**  
nltk, string

**The processed column is created after string manipulation for the sms columns. Removing unrealted words as many as possible helps improve the efficiency of spam message classification.**

In [21]:
#remove punctuation and numbers
nltk.download('stopwords')
nltk.download('punkt')
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(('u','U'))
def process_text(data):
    punctuation = string.punctuation
    remove_punctuation=re.sub('['+punctuation+']','',data)
    remove_number=re.sub('[0-9]','',remove_punctuation)
    tokenize = nltk.tokenize.word_tokenize(remove_number)
    remove_stopwords=[word for word in tokenize if word.lower() not in stopwords]
    return remove_stopwords

spam['processed'] = spam['sms'].apply(lambda x: process_text(x))

print(spam[['sms','processed']].head)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ricci\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ricci\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<bound method NDFrame.head of                                                     sms  \
0     Go until jurong point, crazy.. Available only ...   
1                         Ok lar... Joking wif u oni...   
2     Free entry in 2 a wkly comp to win FA Cup fina...   
3     U dun say so early hor... U c already then say...   
4     Nah I don't think he goes to usf, he lives aro...   
...                                                 ...   
5567  This is the 2nd time we have tried 2 contact u...   
5568               Will ü b going to esplanade fr home?   
5569  Pity, * was in mood for that. So...any other s...   
5570  The guy did some bitching but I acted like i'd...   
5571                         Rofl. Its true to its name   

                                              processed  
0     [Go, jurong, point, crazy, Available, bugis, n...  
1                           [Ok, lar, Joking, wif, oni]  
2     [Free, entry, wkly, comp, win, FA, Cup, final,...  
3               [dun, say, ea

**Formula for TF-IDF calculation**  
TF(term frequency)*log(total number of messages/number of messages containing the term) 

**Classification Methodology**  
1.Calculate TF-IDF by spam/ham for each word  
2.Sum up all individual results within the message as a probability  
3.Compare the probabilities and classified results

In [23]:
#calculate TF-IDF
spam['label']=spam['label'].astype(int)
    
training_set, test_set = train_test_split(spam,test_size=0.3,random_state=2)
training_set=training_set.reset_index(drop=True)
test_set=test_set.reset_index(drop=True)
  
spam_1=spam[spam['label']==1].reset_index(drop=True)
spam_0=spam[spam['label']==0].reset_index(drop=True)

def classify(data):
    result_spam=[]
    result_ham=[]
    for i in range(data.shape[0]):
        message_processed = data['processed'][i]
        #TF
        total_words=len(data['processed'][i])
        for word in message_processed:
            result_0=0
            result_1=0
            fre=[i.lower() for i in message_processed].count(word.lower())
            tf=fre/total_words
            total_messages = len(spam)
            total_messages_0, total_messages_1=len(spam_0),len(spam_1)
            denominator_0=sum([[i.lower() for i in spam_0['processed'][j]].count(word.lower()) for j in range(len(spam_0))])       
            denominator_1=sum([[i.lower() for i in spam_1['processed'][j]].count(word.lower()) for j in range(len(spam_1))])
            if (denominator_0 == 0 and denominator_1 != 0):
                result_0 -= log(total_messages_0)
                result_1 += log(tf*log(total_messages_1/denominator_1))
            elif (denominator_0 != 0 and denominator_1 == 0):
                result_1 -= log(total_messages_1) #adjust the weight of spam probability
                result_0 += log(tf*log(total_messages_0/denominator_0))
            else:            
                result_0 += 0
                result_1 += 0
        result_spam.append(result_1)
        result_ham.append(result_0)
    
    test_set['result_spam']=result_spam
    test_set['result_ham']=result_ham
    test_set['predict']=[i*1 for i in [test_set['result_spam']>test_set['result_ham']]][0]
    logregconfusion_matrix = confusion_matrix(test_set['label'], test_set['predict'])
    
    print("Accuracy: ",sum(test_set['predict']==test_set['label'])/len(test_set))
    print(logregconfusion_matrix)
    
classify(test_set)

Accuracy:  0.9401913875598086
[[1445    0]
 [ 100  127]]


**The above classification reach 94% accuracy. But fitting more training data will help the classification function improve the accuracy for targeting spam messages.**