# Spam Classification using Support Vector Machines

Many email services today provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. Using SVM's, we can build our own spam filter. Let's train the SVM classifier to classify whether a given email, x, is spam (y = 1) or non-spam (y = 0). Each email is converted into a feature vector x ∈ R (n-dimentional).
  
Dataset: All data is taken from the course "Machine Learning by Stanford University, Coursera". This dataset(excludes email header) is based on a subset of the SpamAssassin Public Corpus.

### Email Preprocessing

To use an SVM to classify emails into Spam v.s. Non-Spam, first we need to convert each email into a vector of features. The body of this email should be pre-processed i.e.,  
a. Convert everything to lowercase  
b. strip all HTML (< or >)  
c. Handle the numbers  
d. Handle URLs (http://, https://)  
e. Handle Email Addresses  
f. Get rid of punctuations, tabs, newlines (whitespaces)  
g. Stem the words (“discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”)

In [1]:
# reading the text file
with open ('Data/emailSample.txt','r') as email:
    file_contents = email.read()
file_contents

'Folks,\n \nmy first time posting - have a bit of Unix experience, but am new to Linux.\n\n \nJust got a new PC at home - Dell box with Windows XP. Added a second hard disk\nfor Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went\nfine except it didn\'t pick up my monitor.\n \nI have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4\nTi4200 video card, both of which are probably too new to feature in Suse\'s default\nset. I downloaded a driver from the nVidia website and installed it using RPM.\nThen I ran Sax2 (as was recommended in some postings I found on the net), but\nit still doesn\'t feature my video card in the available list. What next?\n \nAnother problem. I have a Dell branded keyboard and if I hit Caps-Lock twice,\nthe whole machine crashes (in Linux, not Windows) - even the on/off switch is\ninactive, leaving me to reach for the power cable instead.\n \nIf anyone can help me in any way with these probs., I\'d be really grateful

### Vocabulary List

After preprocessing the emails, there is a list of words for each email. The next step is to choose which words will be used in the classifier and which will be left out.  
  
For simplicity reasons, only the most frequently occuring words are considered (Vocabulary list). Since words that occur rarely in the training set are only in a few emails, they might cause the model to overfit the training set. The complete vocabulary list is in the file "vocab.txt". The vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used.  
  
Given the vocabulary list, each word can be now mapped in the preprocessed emails into a list of word indices that contains the index of the word in the vocabulary list. The code in "processEmail" performs this mapping. In the code, a single word from the processed email is searched in the vocabulary list. If the word exists, the index of the word is added into the word_indices variable. If the word does not exist, and is therefore not in the vocabulary, the word can be skipped.

In [2]:
# let's preprocess this email

import re
from string import punctuation
from nltk.stem.snowball import SnowballStemmer

def getVocabList():
    # a function to read the fixed vocab list.
    with open ('Data/vocab.txt','r') as vocab:
        vocab_dict = {}
        for line in vocab.readlines():
            i,word = line.split()
            vocab_dict[word] = int(i)
    return vocab_dict

def processEmail(email_contents):
    # Function to pre-process the email contents
    vocabList = getVocabList() 
    word_indices = [] #init the return value
    
    #--------------------------- Preprocessing ----------------------------------#
    
    # convert to lower case
    email_contents = email_contents.lower() 
    
    # Strip all HTML
    email_contents = re.sub('<[^<>]+>', ' ', email_contents)
    
    # Handle numbers
    email_contents = re.sub('[0-9]+', 'number', email_contents)
    
    # Handle URLs
    email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)
    
    # Handle $ sign (most spam emails are lottery/discount emails!)
    email_contents = re.sub('[$]+', 'dollar', email_contents)
    
    print('\n---------- Processed Email ---------------\n')
    
    # Get rid of any punctuation.
    email_contents = email_contents.translate(str.maketrans('', '', punctuation))

    # Split the email text string into individual words.
    word_content = email_contents.split()
    
    l = 0
    
    for token in word_content:

        # Remove any non alphanumeric characters.
        token = re.sub('[^a-zA-Z0-9]', '', token)
        
        # Create the stemmer.
        stemmer = SnowballStemmer("english")
        
        # Stem the word.
        token = stemmer.stem(token.strip())

        # Skip the word if it is too short
        if len(token) < 1:
           continue
        
        # Look up the word in the dictionary and add to word_indices if found.
        if token in vocabList:
            idx = vocabList[token]
            word_indices.append(idx)

        # Print to screen, ensuring that the output lines are not too long.
        if l + len(token) + 1 > 78:
            print()
            l = 0
        print(token, end=' ')
        l = l + len(token) + 1

    return word_indices

In [3]:
# Extract features.
word_indices = processEmail(file_contents)

# Print stats.
print('\n\n\n -------------Word Indices--------------------------\n')
print(word_indices)


---------- Processed Email ---------------

folk my first time post have a bit of unix experi but am new to linux just 
got a new pc at home dell box with window xp ad a second hard disk for linux 
partit the disk and have instal suse numbernumb from cd which went fine 
except it didnt pick up my monitor i have a dell brand enumberfpp number lcd 
flat panel monitor and a nvidia geforcenumb tinumb video card both of which 
are probabl too new to featur in suse default set i download a driver from 
the nvidia websit and instal it use rpm then i ran saxnumb as was recommend 
in some post i found on the net but it still doesnt featur my video card in 
the avail list what next anoth problem i have a dell brand keyboard and if i 
hit capslock twice the whole machin crash in linux not window even the onoff 
switch is inact leav me to reach for the power cabl instead if anyon can help 
me in ani way with these prob id be realli grate ive search the net but have 
run out of idea or should i be

### Extracting Features from Emails

The feature extraction converts each email into a vector in R (n-dimensional). For this, n=# words in vocabulary list should be used. Specifically, xi=1 if the i-th word is in the email and xi=0 if the i-th word is not present in the email.

In [4]:
import numpy as np

def emailFeatures(word_indices):
    # This function takes in a word_indices vector and produces a feature vector from the word indices.
    
    n = 1899 # total number of words in the vocab.txt file
    x = np.zeros((n, 1)) #initial feature vector
    
    for i in range(len(word_indices)):
        x[word_indices[i]] = 1
    
    return x

features = emailFeatures(word_indices)

print('Extracting features from sample email...\n')
print('Length of feature vector: {}\n'.format(len(features)))
print('Number of non-zero entries: {}'.format(np.sum(features > 0)))

Extracting features from sample email...

Length of feature vector: 1899

Number of non-zero entries: 120


### Training SVM for Spam Classification

Load the preprocessed training dataset to train the SVM classifier. spamTrain.mat contains 4000 training examples of spam and non-spam email, while spamTest.mat contains 1000 test examples. Each original email was processed using the "processEmail" and "emailFeatures" functions and converted into a vector x (i) ∈ R^1899 .  
After loading the dataset, proceed to train a SVM to classify between spam (y = 1) and non-spam (y = 0) emails.

In [5]:
import scipy.io as sio
from sklearn.svm import SVC

In [6]:
train_data = sio.loadmat('Data/spamTrain.mat')
X = train_data.get('X')
y = train_data.get('y')
X.shape, y.shape

((4000, 1899), (4000, 1))

In [7]:
print('Training SVC classifier...\n\n')

C = 0.1
model = SVC(C, 'linear')
model.fit(X, y.ravel())
print('Model trained succesfully! Test it out to get the accuracy!')

Training SVC classifier...


Model trained succesfully! Test it out to get the accuracy!


### Test Spam Classification

After training the classifier, we can evaluate it on a test set. Test set is loaded from spamTest.mat. Evaluate SVM classifier on the test features(Xtest) Vs target (ytest). 

In [8]:
# Evaluating the classifier for test data 
test_data = sio.loadmat('Data/spamTest.mat')
X_test = test_data.get('Xtest')
y_test = test_data.get('ytest')

X_test.shape, y_test.shape

((1000, 1899), (1000, 1))

In [9]:
print('Evaluating the trained Linear SVM on a test set ...')

prediction = model.predict(X_test)

print('Test Accuracy: {0:.2f}%'.format(np.mean((prediction == y_test.ravel()).astype(int)) * 100))

Evaluating the trained Linear SVM on a test set ...
Test Accuracy: 98.90%


### Top Predictors for Spam

To better understand how the spam classifier works, we can inspect the parameters to see which words the classifier thinks are the most predictive of spam. Display the top 10 words that has the largest positive values in the classifier.

In [10]:
# Get the weights.
weights = model.coef_[0]

# Get the 10 indices that sort the most important weights.
indices = weights.argsort()[-10:]

# Return a sorted list from the dictionary.
vocabList = sorted(getVocabList())

print('Top predictors of spam: \n');
for i in indices: 
    print( '{0:10s} ({1:8f})'.format(vocabList[i], float(weights[i])))

Top predictors of spam: 

pleas      (0.261169)
price      (0.267298)
will       (0.269724)
dollar     (0.323632)
basenumb   (0.345064)
visit      (0.367710)
guarante   (0.383622)
remov      (0.422869)
click      (0.465916)
our        (0.500614)


Thus, If the email contains words such as guarante or dollar, it is likely to be classified as spam.