# Email-Spam Classification

The aim of text classification is to automatically classify the text documents based on pretrained categories.

In this part we will try to solve a classification problem namely spam-ham classification using machine learning techniques.

We will employ an open source dataset found in Kaggle website. The dataset can be found at the following url: https://www.kaggle.com/uciml/sms-spam-collection-dataset#spam.csv

# 1. Understanding the dataset

In [1]:
# import required libraries
import pandas as pd

In [7]:
# read the dataset into DataFrame
df = pd.read_csv('./dataset/spam.csv', encoding='latin-1')

In [8]:
# read the first 5 occurences
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Looking at the dataset we can see that we have 3 columns in addition to v1, v2 columns which are unnamed. Let's investigate by looking at the DataFrame columns.

In [9]:
# understanding the dataset
df.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [10]:
# get the first 2 columns v1, v2 since the rest do not contain any information
df = email_data[['v1', 'v2']]

In [11]:
# rename the columns to something more meaningful e.g 'target', 'email'
df = df.rename(columns={'v1': 'target', 'v2': 'email'})

In [13]:
# check for the last change
df.head()

Unnamed: 0,target,email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# 2. Text preprocessing

In [23]:
# install required libraries
!pip3 install textblob --user
!pip3 install pandas --user
!pip3 install numpy --user
!python3 -mpip install matplotlib
!pip3 install nltk --user
!pip3 install scikit-learn --user



In [89]:
# import required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import string
import os

from nltk.stem import SnowballStemmer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

from textblob import TextBlob
from textblob import Word

import sklearn.feature_extraction.text as text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import model_selection, preprocessing, naive_bayes, metrics, svm
import sklearn.linear_model as lm

In [25]:
# applying preprocessing steps e.g lowercase, stemming, lemmatization
df['email'] = df['email'].apply(lambda x: ' '.join(x.lower() for x in x.split()))

In [29]:
# check for the last change by printing the first 5 occurences
df['email'].head()

0    go until jurong point, crazy.. available only ...
1                        ok lar... joking wif u oni...
2    free entry in 2 a wkly comp to win fa cup fina...
3    u dun say so early hor... u c already then say...
4    nah i don't think he goes to usf, he lives aro...
Name: email, dtype: object

In [30]:
# create dictionary of english stopwords
stop_words = stopwords.words('english')

In [31]:
# remove stopwords from 'email' feature 
df['email'] = df['email'].apply(lambda x: ' '.join(x for x in x.split() if x not in stop_words))

In [32]:
# check for the last change by printing the first 5 occurences
df['email'].head()

0    go jurong point, crazy.. available bugis n gre...
1                        ok lar... joking wif u oni...
2    free entry 2 wkly comp win fa cup final tkts 2...
3            u dun say early hor... u c already say...
4              nah think goes usf, lives around though
Name: email, dtype: object

In [33]:
# next, normalize each sentence using PorterStemmer algorithm 
# (1) create PorterStemmer object
st = PorterStemmer()

In [34]:
# (2) apply stemming on each email sentence
df['email'] = df['email'].apply(lambda x: ' '.join([st.stem(word) for word in x.split()]))

In [35]:
# check for the last change printing the first 5 occurences
df['email'].head()

0    go jurong point, crazy.. avail bugi n great wo...
1                          ok lar... joke wif u oni...
2    free entri 2 wkli comp win fa cup final tkt 21...
3            u dun say earli hor... u c alreadi say...
4                nah think goe usf, live around though
Name: email, dtype: object

In [36]:
# apply lemmatization
df['email'] = df['email'].apply(lambda x: ' '.join([Word(word).lemmatize() for word in x.split()]))

In [38]:
# check fot the last change
df['email'].head()

0    go jurong point, crazy.. avail bugi n great wo...
1                          ok lar... joke wif u oni...
2    free entri 2 wkli comp win fa cup final tkt 21...
3            u dun say earli hor... u c alreadi say...
4                nah think goe usf, live around though
Name: email, dtype: object

In [39]:
# check dataset
df.head()

Unnamed: 0,target,email
0,ham,"go jurong point, crazy.. avail bugi n great wo..."
1,ham,ok lar... joke wif u oni...
2,spam,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,u dun say earli hor... u c alreadi say...
4,ham,"nah think goe usf, live around though"


# 3. Feature Engineering

In [40]:
# next, split dataset into training and validation using sklearn
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['email'], df['target'])

In [41]:
# TF-IDF feature generation for a maximum of 5000 features
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

In [42]:
train_y

array([0, 0, 0, ..., 0, 0, 0])

In [43]:
valid_y

array([0, 0, 1, ..., 0, 1, 1])

In [44]:
# create TF-IDF object which takes 5000 features
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)

In [45]:
tfidf_vect.fit(df['email'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=5000,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='\\w{1,}', tokenizer=None,
                use_idf=True, vocabulary=None)

In [49]:
xtrain_tfidf = tfidf_vect.transform(train_x)

In [52]:
xtrain_tfidf

<4179x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 38197 stored elements in Compressed Sparse Row format>

In [50]:
xvalid_tfidf = tfidf_vect.transform(valid_x)

In [53]:
xvalid_tfidf

<1393x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 13117 stored elements in Compressed Sparse Row format>

In [51]:
xtrain_tfidf.data

array([0.30123519, 0.22596515, 0.40846268, ..., 0.42899124, 0.27609199,
       0.3086596 ])

In [54]:
xvalid_tfidf.data

array([0.58949323, 0.53904849, 0.45567537, ..., 0.21241168, 0.22264131,
       0.13722819])

# 4. Model training and Evaluation

In [95]:
# define a function for training any given model
def train_model(classifier, feature_vector_train, label):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    return classifier

In [96]:
nb_model = train_model(naive_bayes.MultinomialNB(alpha=0.2), xtrain_tfidf, train_y)

In [97]:
# Naive Bayes accuracy score
predictions = nb_model.predict(xvalid_tfidf)
nb_accuracy = metrics.accuracy_score(predictions, valid_y)
print("Accuracy: ", nb_accuracy)

Accuracy:  0.9870782483847811


In [98]:
# Linear Classifier on Word Level TF IDF Vectors
linear_model = train_model(lm.LogisticRegression(), xtrain_tfidf, train_y)



In [99]:
# Linear model accuracy score
predictions = linear_model.predict(xvalid_tfidf)
linear_model_accuracy = metrics.accuracy_score(predictions, valid_y)
print("Accuracy: ", linear_model_accuracy)b

Accuracy:  0.955491744436468


Looking at the generated accuracy prediction score on both classifiers, we can see that Naive Bayes is giving better results compared to linear classification model. Our model reached 98.7% accuracy score out of 100%. 
We can try many more classifiers and then choose the best one.