# Spam Classifier

We have a dataset consisting of text messages labelled as spam and ham(not spam). We'll build a model and try to predict whether a message is spam or ham.

Dataset downloaded from: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Dataset stored as SMSSpamCollection.csv in home directory.

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

In [4]:
text_messages = pd.read_csv('SMSSpamCollection', sep = '\t', names = ['label', 'message'])

In [5]:
text_messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
text_messages.describe()

Unnamed: 0,label,message
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [7]:
text_messages.tail()

Unnamed: 0,label,message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [9]:
text_messages.shape

(5572, 2)

In [10]:
text_messages.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

Adding a new column 'label_number' converting the labels into numerical format:

ham : 0

spam : 1

In [11]:
text_messages['label_number'] = text_messages.label.map({'ham': 0, 'spam': 1})

In [12]:
text_messages.head()

Unnamed: 0,label,message,label_number
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


Identifying training and testing datasets:

X = message

y = label_number

In [13]:
X = text_messages.message
y = text_messages.label_number

In [14]:
print(X.shape)

(5572,)


X is one dimensional, suitable for CountVectorizer() which accepts only unidimensional objects.

In [16]:
print(y.shape)

(5572,)


# Splitting X, y into training and testing datasets

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 13)

In [21]:
print(X_train.shape , X_test.shape)

(3900,) (1672,)


In [22]:
print(y_train.shape , y_test.shape)

(3900,) (1672,)


# Using CountVectorizer() to get documenter matrix

Now, instantiating a countvectorizer object, 'vector', to convert dataset from 1D to 2D.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

In [24]:
vector = CountVectorizer()

Fitting model to learn vocabulary of X_train:

In [25]:
vector.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Transforming X_train to documenter matrix, 'X_train_dtmatrix':

In [26]:
X_train_dtmatrix = vector.transform(X_train)

In [27]:
print (X_train_dtmatrix.shape)

(3900, 7125)


In [31]:
# 7125 features
# vector.fit_tranform(X_train) executes the above sequence in one line.

In [29]:
X_train_dtmatrix

<3900x7125 sparse matrix of type '<class 'numpy.int64'>'
	with 51886 stored elements in Compressed Sparse Row format>

In [30]:
X_test_dtmatrix = vector.transform(X_test)

In [32]:
X_test_dtmatrix.shape

(1672, 7125)

In [33]:
X_test_dtmatrix

<1672x7125 sparse matrix of type '<class 'numpy.int64'>'
	with 20507 stored elements in Compressed Sparse Row format>

Since the model was trained on X_train, any new words in X_test will be dropped.

# Using Multinomial Naive Bayes : good for text analysis where the number of features are known.

In [35]:
from sklearn.naive_bayes import MultinomialNB

In [47]:
# nb_model = MultinomialNB()

In [37]:
nb_model.fit(X_train_dtmatrix, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [38]:
y_prediction = nb_model.predict(X_test_dtmatrix)

In [40]:
from sklearn import metrics

In [43]:
print (metrics.accuracy_score(y_test, y_prediction))

0.9808612440191388


nb_model gives 98% accuracy in labelling messages as spam and ham.

In [45]:
print (metrics.confusion_matrix(y_test, y_prediction))

[[1431    8]
 [  24  209]]


8 False Positives; 24 False Negatives

# Logistic Regression Model to Compare with nb_model's predictions

In [48]:
from sklearn.linear_model import LogisticRegression

In [49]:
log_model = LogisticRegression()

In [50]:
log_model.fit(X_train_dtmatrix, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [52]:
y_prediction2 = log_model.predict(X_test_dtmatrix)

In [53]:
print (metrics.accuracy_score(y_test, y_prediction2))

0.979066985645933


Accuracy of log_model = 97.9%

In [54]:
print (metrics.accuracy_score(y_test, y_prediction))

0.9808612440191388


Accuracy of nb_model = 98%

In [58]:
print (metrics.confusion_matrix(y_test, y_prediction2))

[[1433    6]
 [  29  204]]


29 False Negatives, 6 False Positives

Checking the confusion matrix of nb_model:

In [60]:
print (metrics.confusion_matrix(y_test, y_prediction))

[[1431    8]
 [  24  209]]


24 False Negatives, 8 False Positives

# Printing out False Positives and False Negatives of nb_model

In [61]:
#ham: 0, spam: 1

8 False Positives(ham classified as spam):

In [62]:
X_test[y_test < y_prediction] # 0 < 1

1672                              Glad to see your reply.
4622                   Received, understood n acted upon!
4862                               Nokia phone is lovly..
574                                Waiting for your call.
216     Finally the match heading towards draw as your...
991                                          26th OF JULY
4729    I (Career Tel) have added u as a contact on IN...
4702                               I liked the new mobile
Name: message, dtype: object

24 False Negatives(spam classified as ham):

In [63]:
X_test[y_test > y_prediction] # 1 < 0

2558    This message is brought to you by GMW Ltd. and...
1500    SMS. ac JSco: Energy is high, but u may not kn...
2354    Please CALL 08712402902 immediately as there i...
4527    I want some cock! My hubby's away, I need a re...
3425    Am new 2 club & dont fink we met yet Will B gr...
4298    thesmszone.com lets you send free anonymous an...
3064    Hi babe its Jordan, how r u? Im home from abro...
3391    Please CALL 08712402972 immediately as there i...
731     Email AlertFrom: Jeri StewartSize: 2KBSubject:...
684     Hi I'm sue. I am 20 years old and work as a la...
4821    Check Out Choose Your Babe Videos @ sms.shsex....
4213    Missed call alert. These numbers called but le...
1663    Hi if ur lookin 4 saucy daytime fun wiv busty ...
751     Do you realize that in about 40 years, we'll h...
1940    More people are dogging in your area now. Call...
672     SMS. ac sun0819 posts HELLO:"You seem cool, wa...
1269    Can U get 2 phone NOW? I wanna chat 2 set up m...
4069    TBS/PE

# Printing out False Positives and False Negatives of log_model

6 False Positives(ham classified as spam):

In [62]:
X_test[y_test < y_prediction] # 0 < 1

1672                              Glad to see your reply.
4622                   Received, understood n acted upon!
4862                               Nokia phone is lovly..
574                                Waiting for your call.
216     Finally the match heading towards draw as your...
991                                          26th OF JULY
4729    I (Career Tel) have added u as a contact on IN...
4702                               I liked the new mobile
Name: message, dtype: object

29 False Negatives(spam classified as ham):

In [63]:
X_test[y_test > y_prediction] # 1 < 0

2558    This message is brought to you by GMW Ltd. and...
1500    SMS. ac JSco: Energy is high, but u may not kn...
2354    Please CALL 08712402902 immediately as there i...
4527    I want some cock! My hubby's away, I need a re...
3425    Am new 2 club & dont fink we met yet Will B gr...
4298    thesmszone.com lets you send free anonymous an...
3064    Hi babe its Jordan, how r u? Im home from abro...
3391    Please CALL 08712402972 immediately as there i...
731     Email AlertFrom: Jeri StewartSize: 2KBSubject:...
684     Hi I'm sue. I am 20 years old and work as a la...
4821    Check Out Choose Your Babe Videos @ sms.shsex....
4213    Missed call alert. These numbers called but le...
1663    Hi if ur lookin 4 saucy daytime fun wiv busty ...
751     Do you realize that in about 40 years, we'll h...
1940    More people are dogging in your area now. Call...
672     SMS. ac sun0819 posts HELLO:"You seem cool, wa...
1269    Can U get 2 phone NOW? I wanna chat 2 set up m...
4069    TBS/PE

# Finding out most common spam and ham words

In [74]:
X_train_words = vector.get_feature_names()

In [65]:
len(X_train_words)

7125

How many times a certain word appears in nb_model

In [66]:
nb_model.feature_count_

array([[ 0.,  0.,  1., ...,  1.,  1.,  1.],
       [ 6., 25.,  0., ...,  0.,  0.,  0.]])

Upper row indicates the number of times a word appears in a ham message.
Lower row indicates the number of times a word appears in a spam message.

In [70]:
ham_word_count = nb_model.feature_count_[0,:]
ham_word_count

array([0., 0., 1., ..., 1., 1., 1.])

In [71]:
spam_word_count = nb_model.feature_count_[1,:]
spam_word_count

array([ 6., 25.,  0., ...,  0.,  0.,  0.])

Combining these two to create a dataframe, 'tokens':

In [139]:
tokens = pd.DataFrame({'token': X_train_words, 'ham': ham_word_count, 'spam': spam_word_count})
tokens = tokens.set_index('token')
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,0.0,6.0
000,0.0,25.0
000pes,1.0,0.0
008704050406,0.0,1.0
0089,0.0,1.0


In [135]:
nb_model.class_count_

array([3386.,  514.])

3386 observations in ham class, 
514 observations in spam class

In [140]:
# since there are 0's in both ham and spam classes, 
# add 10 to avoid zero division error
tokens['ham'] = tokens.ham + 10

In [141]:
tokens['spam'] = tokens.spam + 10

In [142]:
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,10.0,16.0
000,10.0,35.0
000pes,11.0,10.0
008704050406,10.0,11.0
0089,10.0,11.0


Converting these values to frequencies:

In [143]:
from __future__ import division

In [144]:
tokens['ham'] = tokens.ham/nb_model.class_count_[0]

In [145]:
tokens['spam'] = tokens.spam/nb_model.class_count_[1]

In [146]:
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,0.002953,0.031128
000,0.002953,0.068093
000pes,0.003249,0.019455
008704050406,0.002953,0.021401
0089,0.002953,0.021401


In [147]:
tokens['ham_to_spam_ratio'] = tokens.ham/tokens.spam

In [148]:
tokens.head()

Unnamed: 0_level_0,ham,spam,ham_to_spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
00,0.002953,0.031128,0.094876
000,0.002953,0.068093,0.043372
000pes,0.003249,0.019455,0.166982
008704050406,0.002953,0.021401,0.138001
0089,0.002953,0.021401,0.138001


Most common ham words:

In [152]:
# high value of ham_to_spam_ratio
tokens.sort_values('ham_to_spam_ratio', ascending = False).head(20)

Unnamed: 0_level_0,ham,spam,ham_to_spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
my,0.164206,0.033074,4.964803
but,0.096869,0.027237,3.556493
gt,0.065859,0.019455,3.385174
lt,0.065564,0.019455,3.369994
me,0.162729,0.05642,2.884229
he,0.053751,0.019455,2.762788
ll,0.058181,0.021401,2.718628
come,0.052569,0.021401,2.456425
ok,0.063792,0.027237,2.342081
it,0.151802,0.068093,2.229314


Most common spam words:

In [151]:
tokens.sort_values('ham_to_spam_ratio', ascending = False).tail(20)

Unnamed: 0_level_0,ham,spam,ham_to_spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000,0.002953,0.068093,0.043372
ringtone,0.002953,0.070039,0.042167
16,0.003249,0.077821,0.041745
awarded,0.002953,0.07393,0.039948
cs,0.002953,0.075875,0.038923
1000,0.002953,0.077821,0.03795
50,0.003839,0.103113,0.037234
co,0.003249,0.087549,0.037107
500,0.002953,0.079767,0.037025
18,0.002953,0.087549,0.033734


# Alternate approach to look at most common spam words

Adding a new column spam_to_ham_ratio

In [153]:
tokens['spam_to_ham_ratio'] = tokens.spam/tokens.ham

In [154]:
tokens.sort_values('spam_to_ham_ratio', ascending = False).head(20)

Unnamed: 0_level_0,ham,spam,ham_to_spam_ratio,spam_to_ham_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim,0.002953,0.171206,0.01725,57.970428
prize,0.002953,0.155642,0.018975,52.700389
www,0.003249,0.142023,0.022874,43.717368
txt,0.005611,0.229572,0.024443,40.912144
150p,0.002953,0.116732,0.0253,39.525292
tone,0.002953,0.108949,0.027107,36.890272
uk,0.003249,0.118677,0.027374,36.530952
mobile,0.005316,0.180934,0.029381,34.035668
guaranteed,0.002953,0.099222,0.029765,33.596498
nokia,0.003249,0.099222,0.032742,30.542271


# Conclusion

Most common ham word: 'my'

Most common spam word: 'claim'