<a href="https://colab.research.google.com/github/ravindrabharathi/spam-classifier/blob/master/Spam-classifier-with-comments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Classifier using Email Data from TREC 2007 Public Corpus for classifying emails as Spam / Ham(Not spam) . 

1. We will use the files from 'full' folder in this dataset. 

2. We will use Pandas to read the content and categorize them into spam:1 and ham:0 

3. We will clean up the content by removing line endings , tabs , return characters . We will also remove email addresses, numbers and punctuations from the sentences and convert the text to lowercase.

4. We will use NLTK library to remove stopwords like 'the' 'had' , etc

5. We will will use NLTK for stemming which is to convert various forms of a root word to the root word itself e.g 'like' is the root word for 'liked' , 'likes' , 'liking' , etc . We will use NLTK SnowballStemmer which handles languages other than English. When handling English language alone Porter stemmer could be used.

6. We will use TfidfVectorizer from Scikit Learn library to vectorize and create the train, validation samples 

7. We will train and test spam / ham classification using Support Vector Machines and NaiveBayes 



### import Pandas and other modules for reading data

In [0]:
import pandas as pd
import os
from pathlib import Path



### read email data 

In [0]:
data=pd.read_csv("./full/index",sep=' ',header=None)
data.head()


Unnamed: 0,0,1
0,spam,../data/inmail.1
1,ham,../data/inmail.2
2,spam,../data/inmail.3
3,spam,../data/inmail.4
4,spam,../data/inmail.5


### create class, filepath and contents columns

In [0]:
data.columns=['class','filepath']

In [0]:
data['contents']=None
data.head()


Unnamed: 0,class,filepath,contents
0,spam,../data/inmail.1,
1,ham,../data/inmail.2,
2,spam,../data/inmail.3,
3,spam,../data/inmail.4,
4,spam,../data/inmail.5,


### polulate the contents column with email text

In [0]:
import re
import string
    
for i,row in data.iterrows():
    
    filepath=os.path.join(os.getcwd(),row['filepath'].replace('../',''))
    with open(filepath, 'rb') as f:
        email_txt = f.read()
        email_text=str(email_txt)
        if i<2:
            print(email_txt)
        
    data.at[i,'contents']= email_txt
        
print(data.head())
print(data.info())

b'From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\nReturn-Path: <RickyAmes@aol.com>\nReceived: from 129.97.78.23 ([211.202.101.74])\n\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n\tSun, 8 Apr 2007 13:07:21 -0400\nReceived: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\nMessage-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\nFrom: "Tomas Jacobs" <RickyAmes@aol.com>\nReply-To: "Tomas Jacobs" <RickyAmes@aol.com>\nTo: the00@speedy.uwaterloo.ca\nSubject: Generic Cialis, branded quality@ \nDate: Sun, 08 Apr 2007 21:00:48 +0300\nX-Mailer: Microsoft Outlook Express 6.00.2600.0000\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="--8896484051606557286"\nX-Priority: 3\nX-MSMail-Priority: Normal\nStatus: RO\nContent-Length: 988\nLines: 24\n\n----8896484051606557286\nContent-Type: text/html;\nContent-Transfer-Encoding: 7Bit\n\n<html>\n<body bgcolor="#ffffff">\n<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 

### categorize class as spam:1 and ham:0

In [0]:
category={'spam':1,'ham':0}
data['class']=[category[item] for item in data['class'] ]
data.head()

Unnamed: 0,class,filepath,contents
0,1,../data/inmail.1,b'From RickyAmes@aol.com Sun Apr 8 13:07:32 ...
1,0,../data/inmail.2,b'From bounce-debian-mirrors=ktwarwic=speedy.u...
2,1,../data/inmail.3,b'From 7stocknews@tractionmarketing.com Sun A...
3,1,../data/inmail.4,b'From vqucsmdfgvsg@ruraltek.com Sun Apr 8 1...
4,1,../data/inmail.5,b'From dcube@totalink.net Sun Apr 8 13:19:30...


### create train, test data from the full set

In [0]:
X_train=(data['contents'][0:1999]).copy()

In [0]:
y_train=data['class'][0:1999]

In [0]:
X_test=(data['contents'][2000:2500]).copy()

In [0]:
y_test=data['class'][2000:2500]

### import nltk stowords 

In [0]:
import nltk 
from nltk.corpus import stopwords
nltk.download ('stopwords')

nltk.download ('punkt')
stop_words=set(stopwords.words("english"))

    

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ravindra/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/ravindra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### import nltk tokenizers, stemmers and add function definitions for processing the email text 

In [0]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer,SnowballStemmer

def cleanText(email_txt):
    email_txt=str(email_txt).replace('\\n', ' ').replace('\\r', ' ').replace('\\t',' ')
    #print(email_txt)
    clean1 = re.compile('<.*?>')
    email_txt=re.sub(clean1, '', str(email_txt)).lower()
    clean2=re.compile('\S*@\S*\s?')
    email_txt=re.sub(clean2,'emailAddress',email_txt)
    email_txt=email_txt.translate(str.maketrans('','',string.punctuation))
    email_txt=email_txt.translate(str.maketrans('','','1234567890'))
    return str(email_txt)

def tokenizeText(text):
     return word_tokenize(text)
    
def removeStopWords(text):
    result=[]
    for word in text:
        if word not in stop_words:
            result.append(word)
    return result


def performStemming(text):
    result=''
    stemr=SnowballStemmer('english')
    for word in text:
        result +=(stemr.stem(word))+' '
    return result

def preprocessText(text):
    text0=cleanText(text)
    text1=tokenizeText(text0)
    text2=removeStopWords(text1)
    
    text3=performStemming(text2)
    #print(text3)
    return text3

### apply preprocessing on train , test set 

In [0]:

X_train = X_train.apply(preprocessText)

In [0]:
X_test=X_test.apply(preprocessText)

### use TfidVectorizer for feature extraction /vectorization and create train/validation set using train_test_split module

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
vectorizer = TfidfVectorizer("english")
features = vectorizer.fit_transform(X_train)
features_train, features_test, labels_train, labels_test = train_test_split(features, y_train, test_size=0.3, random_state=42)

### use SVM for training and then test spam / ham classification on validation set

In [0]:
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

svc = SVC(kernel='sigmoid', gamma=1.0)
svc.fit(features_train, labels_train)
prediction = svc.predict(features_test)
accuracy_score(labels_test,prediction)

0.9916666666666667

### use NaiveBayes to train and then test spam / ham classification on validation set

In [0]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB(alpha=0.2)
mnb.fit(features_train, labels_train)
prediction = mnb.predict(features_test)
accuracy_score(labels_test,prediction)

0.9716666666666667

### Get the prediction of NaiveBayes classifier on the test set 

In [0]:

features3 = vectorizer.transform(X_test)
print(features3.shape)
prediction = mnb.predict(features3)


(500, 82430)


### print accuracy score by comparing predictions vs True values

In [0]:
accuracy_score(y_test,prediction)

0.986