# Spam Classification with SVMs

Many email services provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. In this exercise, we will use SVMs to build a spam filter.

We will be training a classfier to classify whether a given email, x, is spam. In particular, we need to convert each email into a feature vector x in R^n.

In [1]:
import numpy as np
import re
from string import punctuation
from nltk.stem import PorterStemmer
import scipy.io
from sklearn.svm import SVC

## Part 1: Email Preprocessing

In [2]:
# first look at a sample email that contains a URL, an email address, numbers, and dollar amounts
emailSample = open('emailSample1.txt').read()
emailSample

"> Anyone knows how much it costs to host a web portal ?\n>\nWell, it depends on how many visitors you're expecting.\nThis can be anywhere from less than 10 bucks a month to a couple of $100. \nYou should checkout http://www.rackspace.com/ or perhaps Amazon EC2 \nif youre running something big..\n\nTo unsubscribe yourself from this mailing list, send an email to:\ngroupname-unsubscribe@egroups.com\n\n"

In [3]:
# getVocabList() reads the fixed vocabulary list in vocab.txt and returns a dictionary 
def getVocabList():
    vocabtext = open('vocab.txt').read()
    vocablist = vocabtext.split('\n')
    vocabList = {}
    for item in vocablist:
        try:
            ind_word = item.split('\t')
            ind = int(ind_word[0])
            word = ind_word[1]
            vocabList[ind] = word
        except:
            pass
    return vocabList

In [4]:
vocabList = getVocabList()

In [5]:
# processEmail() preprocesses the body of an email and returns a list of word_indices
def processEmail(email_contents):
    # replace '\n' by ' '
    email_contents = email_contents.replace('\n', ' ')
    # lower case
    email_contents = email_contents.lower()
    # strip all html
    email_contents = re.sub(r'<[^<>]+>', ' ', email_contents)
    # handle numbers
    email_contents = re.sub(r'[0-9]+', 'number', email_contents)
    # handle urls
    email_contents =re.sub(r'(https?://[^\s]+)', 'httpaddr', email_contents)
    # handle email addresses
    email_contents =re.sub(r'[^\s]+@[^\s]+', 'emailaddr', email_contents)
    # handle $ sign
    email_contents = re.sub(r'[$]+', 'dollar ', email_contents)
    # handle punctuation
    email_contents = re.sub(r'[{}]'.format(punctuation), ' ', email_contents)
    # start to find word_indices
    words = email_contents.split()
    word_indices = []
    ps = PorterStemmer()
    for word in words:
        stem = ps.stem(word)    
        for i in range(1,len(vocabList)+1):
            if stem == vocabList[i]:
                word_indices.append(i)
    return word_indices        

In [6]:
word_indices = processEmail(emailSample)

## Part 2: Feature Extraction

In [7]:
# emailFeatures() takes in a word_indices vector and produces a feature vector from the word_indices 
def emailFeatures(word_indices):
    n = 1899
    x =  np.zeros([n,])
    for i in range(len(word_indices)):
        x[word_indices[i]] = 1
    return x

In [8]:
features = emailFeatures(word_indices)

In [9]:
print('Length of feature vector: {}'.format(len(features)))
print('Number of non-zero entries: {}'.format(sum(features)))

Length of feature vector: 1899
Number of non-zero entries: 45.0


## Part 3: Train Linear SVM for Spam Classification

In [10]:
# load the spam email dataset
spamTrain = scipy.io.loadmat('spamTrain.mat')
X = spamTrain['X']
y = spamTrain['y']
y = y.reshape([len(y),])

In [11]:
# train linear SVM with C = 0.1
clf = SVC(C=0.1, kernel='linear') 
clf.fit(X, y) 

SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [12]:
predictions = clf.predict(X)
print( 'Training Accuracy: {}'.format(np.mean(predictions== y)))

Training Accuracy: 0.99825


## Part 4: Test Spam Classification

In [13]:
# load the test dataset
spamTest = scipy.io.loadmat('spamTest.mat')
Xtest = spamTest['Xtest']
ytest = spamTest['ytest']
ytest = ytest.reshape([len(ytest),])

In [14]:
print('Test Accuracy: {}'.format(np.mean(clf.predict(Xtest)== ytest)))

Test Accuracy: 0.989


## Part5: Top Predictors of Spam
Since the model we are training is a linear SVM, we can inspect the weights learned by the model to understand better how it is determing whether an email is spam or not. The following code finds the words with the highest weights in the classifer. 

In [15]:
# get weights from the model
weights = clf.coef_[0]
weights

array([ 0.00793208,  0.01563324,  0.05546492, ..., -0.08670606,
       -0.00661274,  0.06506632])

In [16]:
print('sort weights from big to small:\n')
print(np.sort(weights)[::-1])

sort weights from big to small:

[ 0.50061374  0.46591639  0.42286912 ... -0.42835516 -0.43807244
 -0.60513164]


In [17]:
# get the index of weights from big to small
id_descend = weights.argsort()[::-1]

In [18]:
# check the first index corresponding to the biggest weight
weights[id_descend[0]]

0.5006137361746403

In [19]:
print('Top predictors of spam:')
for i in range(20):
    print(vocabList[id_descend[i]+1] + ' ({})'.format((weights[id_descend[i]])))

Top predictors of spam:
our (0.500613736175)
click (0.465916390689)
remov (0.422869117061)
guarante (0.383621601794)
visit (0.367710398246)
basenumb (0.345064097946)
dollar (0.323632035796)
will (0.269724106037)
price (0.267297714618)
pleas (0.2611688867)
most (0.257298197952)
nbsp (0.25394145516)
lo (0.253466524314)
ga (0.248296990456)
hour (0.246404357832)
al (0.237310668172)
da (0.233261215232)
se (0.23295496246)
want (0.23194709266)
dollarnumb (0.229639162845)
