## Spam Classification

using dataset:
    
    ex6data1.mat - Example Dataset 1
    
    ex6data2.mat - Example Dataset 2
    
    ex6data3.mat - Example Dataset 3
    
    spamTrain.mat - Spam training set
    
    spamTest.mat - Spam test set
    
    emailSample1.txt - Sample email 1
    
    emailSample2.txt - Sample email 2
    
    spamSample1.txt - Sample spam 1
    
    spamSample2.txt - Sample spam 2
    
    vocab.txt - Vocabulary list

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat

%matplotlib inline
%config InlineBackend.figure_format='svg'

### 2.1 Preprocessing Emails

emails form:

    Anyone knows how much it costs to host a web portal ?

    Well, it depends on how many visitors youre expecting. This can be anywhere from less than 10 bucks a month to a  couple of $100. You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if youre running something big..

    To unsubscribe yourself from this mailing list, send an email to: groupname-unsubscribe@egroups.com

In [2]:
#Read emailSample1
emailSample1=pd.read_table('emailSample1.txt',sep='\t',header=None,names=['emailSample1'],index_col=None)
emailSample1

Unnamed: 0,emailSample1
0,> Anyone knows how much it costs to host a web...
1,>
2,"Well, it depends on how many visitors you're e..."
3,This can be anywhere from less than 10 bucks a...
4,You should checkout http://www.rackspace.com/ ...
5,if youre running something big..
6,To unsubscribe yourself from this mailing list...
7,groupname-unsubscribe@egroups.com


The following email preprocessing and normalization steps:

    • Lower-casing: The entire email is converted into lower case, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).

    • Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.

    • Normalizing URLs: All URLs are replaced with the text “httpaddr”.

    • Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr”.

    • Normalizing Numbers: All numbers are replaced with the text “number”.

    • Normalizing Dollars: All dollar signs ($) are replaced with the text “dollar”.

    • Word Stemming: Words are reduced to their stemmed form. For ex- ample, “discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”. Sometimes, the Stemmer actually strips off additional characters from the end, so “include”, “includes”, “included”, and “including” are all replaced with “includ”.

    • Removal of non-words: Non-words and punctuation have been re- moved. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.


After preprocess:

    anyon know how much it cost to host a web portal well it depend on how
    
    mani visitor your expect thi can be anywher from less than number buck
    
    a month to a coupl of dollarnumb you should checkout httpaddr or perhap
    
    amazon ecnumb if your run someth big to unsubscrib yourself from thi
    
    mail list send an email to emailaddr

#### 2.2.1 Vocabulary List

The next step is to choose which words we would like to use in our classifier and which we would want to leave out

Our vocabulary list was selected by choosing all words which occur at least a 100 tims in spam corpus,reuliting in a list of 1899 words.

In [3]:
vocab=pd.read_table('vocab.txt',header=None,sep='\t',names=['index','voc'])
vocab

Unnamed: 0,index,voc
0,1,aa
1,2,ab
2,3,abil
3,4,abl
4,5,about
...,...,...
1894,1895,your
1895,1896,yourself
1896,1897,zdnet
1897,1898,zero


Given the vovabulary list,we can now map each word in the prepreocessed emails into a list os word indices that contains the index of the word in the vocabulary list.

### 2.2 Extracting Features for Emails

This a hard part which is to vectorize the eamils.

But using this prepreocessed dataset is cheating.

In [4]:
def load_mat():
    spam_train=loadmat('spamTrain.mat')
    spam_test=loadmat('spamTest.mat')
    X = spam_train['X']
    Xtest = spam_test['Xtest']
    y = spam_train['y'].ravel()
    ytest = spam_test['ytest'].ravel()
    
    return X,y,Xtest,ytest

In [5]:
X,y,Xtest,ytest=load_mat()

### 2.3 Training SVM for Spam Classification

In [6]:
from sklearn.svm import SVC

svc=SVC(kernel='rbf')
svc.fit(X,y)
print("Score:{}".format(svc.score(X,y)))

Score:0.99325


In [7]:
from sklearn.metrics import classification_report
ypred=svc.predict(Xtest)

print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       692
           1       0.99      0.97      0.98       308

    accuracy                           0.99      1000
   macro avg       0.99      0.98      0.98      1000
weighted avg       0.99      0.99      0.99      1000

