## NLP Exploration- Spam or Ham?


In this notebook I explore pre-processing and processing of natural language data in the context of a machine learning classification problem. I use a toy dataset of text messages labelled as either "spam" or "ham" and explore various classification algorithms suited to this application.

### Pre-processing the text data

We clean the text dataset using the following steps:
1. Remove punctuation
2. Tokenize
3. Remove stopwords
4. Lemmatize/Stem


In [1]:
#read in the data and take a peek
import pandas as pd
pd.set_option('display.max_colwidth', 100)

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=None)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [3]:
#remove punctuation- use the string library to help here
import string

#define a function that removes punctuation from input text
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

#apply the function to the dataset
data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punct(x))

#check it out
data.head()

Unnamed: 0,label,body_text,body_text_clean
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


In [4]:
#tokenize the data- use the re library here (re is standing for regular expression)

import re

#define a function that splits the text up on non-word characters
def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

#apply the function to the dataset
data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

#check that it worked-yes
data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


In [10]:
#now lets remove all stopwords

import nltk
nltk.download('stopwords')
#create a list of all stopwords to remove from the dataset
stopword=nltk.corpus.stopwords.words('english')

#define a function that actually removes these stopwords from input text
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

#apply to the data set
data['body_text_nostop']=data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

#check it worked
data.head()



[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sofiapasquini/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"


In [13]:
#now lets try to lemmatize
#choosing to lemmatize over stem since the vocab space is large and the documents are small
#also, algorithm processing time is not important for this exploration

#lets use the WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
wn = nltk.WordNetLemmatizer()

#define a function that applies the lemmatizer to each word in input text
def lemmatizing(tokenized_text):
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

#apply to the data set
data['body_text_lemmatized'] = data['body_text_nostop'].apply(lambda x: lemmatizing(x))

#take a peek to check
data.head()


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sofiapasquini/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/sofiapasquini/nltk_data...


Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_lemmatized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, searching, right, word, thank, breather, promise, wont, take, help, granted, fulfil, promi..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, go, usf, life, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]"


### Process the text data.

Now we process the cleaned text data by applying a tf-idf vectorizer. This will create a document-term matrix where the columns represent unigrams and the rows hold the computed weightings for each unigram.

We also perform some feature engineering that will include the length of the message (not including white spaces) as well as the percent of the message which is punctuation in the feature matrix.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
tfidf_vect = TfidfVectorizer(analyzer=lemmatizing)
X_tfidf = tfidf_vect.fit_transform(data['body_text_nostop'])

#check out the shape of the result of the transformation
print(X_tfidf.shape)

(5568, 8914)


In [26]:
#be sure to convert the result from a sparse matrix to a dataframe so we can use it in ML

X_tfidf_df = pd.DataFrame(X_tfidf.toarray())
X_tfidf_df.columns = tfidf_vect.get_feature_names()
X_tfidf_df.head()



Unnamed: 0,Unnamed: 1,0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,...,zindgi,zoe,zogtorius,zoom,zouk,zyada,é,ü,üll,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
#lets now perform the feature engineering
import matplotlib.pyplot as plt

#define a function that calculates the percent of punctuation in the message
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

#create the percent punctuation feature
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

#create the body length feature
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))

#check the dataframeto make sure it worked
data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_lemmatized,punct%,body_len
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, searching, right, word, thank, breather, promise, wont, take, help, granted, fulfil, promi...",2.5,160
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...",4.7,128
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, go, usf, life, around, though]",4.1,49
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]",3.2,62
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]",7.1,28


### Now apply the machine learning model.

Here we can explore the performance of Random Forest and Gradient Boosting Classifiers on this data set. We will use the same number of classifiers in both cases and compare the models based on both the time it takes to train/predict in each case as well as various performance metrics (ex accuracy, precision, etc).

In [32]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time
from sklearn.model_selection import train_test_split

In [49]:
#split the data into the training and testing sets (use an 80/20 split)
X=pd.concat([data[['body_len', 'punct%']], X_tfidf_df],axis=1)
y=data['label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 )

In [50]:
#first the random forest classifier

#initialize the model
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

#compute the training time
start = time.time()
rf_model = rf.fit(X_train, y_train)
end = time.time()
fit_time = (end - start)

#compute the prediction time
start = time.time()
y_pred = rf_model.predict(X_test)
end = time.time()
pred_time = (end - start)

#compute other performance metrics and print out
precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 4.062 / Predict time: 0.163 ---- Precision: 1.0 / Recall: 0.822 / Accuracy: 0.977


In [51]:
#now the gradient boosting classifier
#initialize the classifier
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

#compute the training time
start = time.time()
gb_model = gb.fit(X_train, y_train)
end = time.time()
fit_time = (end - start)

#compute the prediction time
start = time.time()
y_pred = gb_model.predict(X_test)
end = time.time()
pred_time = (end - start)

#compute other performance metrics and print out
precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 131.532 / Predict time: 0.222 ---- Precision: 0.932 / Recall: 0.842 / Accuracy: 0.971


### Conclusions:

If the fitting time is important in a given industry context, the Gradient Boosting Classifier would create a significant bottleneck in any workflow and the Random Forest Classifier should be chosen. The Random Forest Classifier has a higher recall than the Gradient Boosting Classifier- that is to say it is better at picking out spam messages; this would be the ideal model for something like an email spam filter, for example. Overall, the Random Forest Classifier does appear to have a higher accuracy as well. As Gradient Boosting Classifiers are, relatively speaking, easier to overfit and slower to learn, a further analysis of training metrics and the effects of a larger sample size would be necessary to rule either of these effects out from result explanations.