# Spam SMS Detection

## Overview


The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. 
It contains one set of SMS messages in English of 5,577 messages, tagged acording being ham (legitimate) or spam

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

The dataset is taken from kaggle.

## Approach

- Loading Data

- Input and Output Data

- Applying Regular Expression

- Each word to lower case

- Splitting words to Tokenize

- Stemming with PorterStemmer handling Stop Words

- Preparing Messages with Remaining Tokens

- Preparing WordVector Corpus

- Applying Classification

## SMS Spam Classification Steps

- Data Preparation
- Exploratory Data Analysis(EDA)
- Text Pre-processing and TF-IDF
- Model Building with Classification Algorithm

In [33]:
!pip install wordcloud

Collecting wordcloud
  Obtaining dependency information for wordcloud from https://files.pythonhosted.org/packages/f5/b0/247159f61c5d5d6647171bef84430b7efad4db504f0229674024f3a4f7f2/wordcloud-1.9.3-cp311-cp311-win_amd64.whl.metadata
  Downloading wordcloud-1.9.3-cp311-cp311-win_amd64.whl.metadata (3.5 kB)
Downloading wordcloud-1.9.3-cp311-cp311-win_amd64.whl (300 kB)
   ---------------------------------------- 0.0/300.2 kB ? eta -:--:--
   ----- ---------------------------------- 41.0/300.2 kB 1.9 MB/s eta 0:00:01
   --------------- ------------------------ 112.6/300.2 kB 1.1 MB/s eta 0:00:01
   ------------------ ------------------- 143.4/300.2 kB 944.1 kB/s eta 0:00:01
   ------------------------ ------------- 194.6/300.2 kB 985.7 kB/s eta 0:00:01
   -------------------------------- ----- 256.0/300.2 kB 983.0 kB/s eta 0:00:01
   -------------------------------------- 300.2/300.2 kB 929.2 kB/s eta 0:00:00
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.3


In [34]:


# Importing libraries

from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

from wordcloud import WordCloud
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC

## 1. Data Preparation

In [36]:
df = pd.read_csv(r"spam.csv",encoding='latin-1')

In [37]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [38]:
df.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5571,ham,Rofl. Its true to its name,,,
5572,ham,How are you doing? Hope you've settled in for ...,,,
5573,spam,Give me your account number,,,
5574,spam,REMINDER FROM O2: To get 2.50 pounds free call...,,,
5575,ham,Hahaha....you are so funny,,,


In [39]:
#Dropping unnecessary columns
df = df[df.columns.drop(list(df.filter(regex='Unnamed')))]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [40]:
import numpy as np
df['Count']=0
for i in np.arange(0,len(df.v2)):
    df.loc[i,'Count'] = len(df.loc[i,'v2'])

In [41]:
df.head()

Unnamed: 0,v1,v2,Count
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


In [42]:
df.tail()

Unnamed: 0,v1,v2,Count
5571,ham,Rofl. Its true to its name,26
5572,ham,How are you doing? Hope you've settled in for ...,92
5573,spam,Give me your account number,27
5574,spam,REMINDER FROM O2: To get 2.50 pounds free call...,147
5575,ham,Hahaha....you are so funny,26


In [43]:
# Total ham(0) and spam(1) messages
df['v1'].value_counts()

v1
ham     4827
spam     749
Name: count, dtype: int64

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5576 entries, 0 to 5575
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5576 non-null   object
 1   v2      5576 non-null   object
 2   Count   5576 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 130.8+ KB


In [45]:
corpus = []
ps = PorterStemmer()

In [46]:
# Original Messages

print (df['v2'][2])
print (df['v2'][3])

Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
U dun say so early hor... U c already then say...


## 2. Exploratory Data Analysis

In [47]:
df.groupby('v1').describe()

Unnamed: 0_level_0,Count,Count,Count,Count,Count,Count,Count,Count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
v1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ham,4827.0,70.985705,57.990057,2.0,33.0,52.0,92.0,910.0
spam,749.0,138.285714,29.302152,13.0,132.0,149.0,157.0,223.0


In [48]:
#Replacing column names
df.rename(columns={'v1':'label','v2':'sms'},inplace=True)

In [49]:
df.label.value_counts()

label
ham     4827
spam     749
Name: count, dtype: int64

In [50]:
df.sms.value_counts()

sms
Sorry, I'll call later                                                                                                                                         30
I cant pick the phone right now. Pls send a message                                                                                                            12
Ok...                                                                                                                                                          10
Ok                                                                                                                                                              4
Your opinion about me? 1. Over 2. Jada 3. Kusruthi 4. Lovable 5. Silent 6. Spl character 7. Not matured 8. Stylish 9. Simple Pls reply..                        4
                                                                                                                                                               ..
I gotta collect da car a

In [51]:
df.groupby('sms').describe()

Unnamed: 0_level_0,Count,Count,Count,Count,Count,Count,Count,Count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
sms,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
&lt;#&gt; in mca. But not conform.,1.0,36.0,,36.0,36.0,36.0,36.0,36.0
&lt;#&gt; mins but i had to stop somewhere first.,1.0,51.0,,51.0,51.0,51.0,51.0,51.0
&lt;DECIMAL&gt; m but its not a common car here so its better to buy from china or asia. Or if i find it less expensive. I.ll holla,1.0,132.0,,132.0,132.0,132.0,132.0,132.0
and picking them up from various points,1.0,41.0,,41.0,41.0,41.0,41.0,41.0
"came to look at the flat, seems ok, in his 50s? * Is away alot wiv work. Got woman coming at 6.30 too.",1.0,103.0,,103.0,103.0,103.0,103.0,103.0
...,...,...,...,...,...,...,...,...
yay! finally lol. i missed our cinema trip last week :-(,1.0,56.0,,56.0,56.0,56.0,56.0,56.0
yeah sure thing mate haunt got all my stuff sorted but im going sound anyway promoting hex for .by the way who is this? dont know number. Joke,1.0,142.0,,142.0,142.0,142.0,142.0,142.0
"yeah, that's what I was thinking",1.0,32.0,,32.0,32.0,32.0,32.0,32.0
yes baby! I need to stretch open your pussy!,1.0,44.0,,44.0,44.0,44.0,44.0,44.0


In [52]:
df.groupby('label').describe()

Unnamed: 0_level_0,Count,Count,Count,Count,Count,Count,Count,Count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ham,4827.0,70.985705,57.990057,2.0,33.0,52.0,92.0,910.0
spam,749.0,138.285714,29.302152,13.0,132.0,149.0,157.0,223.0


### Inference
We can see the top msgs in ham and spam. Please call our customer service rep seems to be the most common spam message.

In [55]:
df.groupby('v2').describe()

KeyError: 'v2'

In [None]:
#Replacing column names
df.rename(columns={'v1':'label','v2':'sms'},inplace=True)

In [None]:
df.label.value_counts()

In [None]:

df['sms length'] = df['sms'].apply(len)
df.head()

In [None]:

# Plotting length of sms text for spam sms
plt.hist(df[df['label']=='spam']['sms length'],color='blue',bins=50)
plt.title('Spam Message Length',fontsize=20)
plt.xlabel('Message Length')
plt.ylabel('Count')
plt.show()

In [None]:
# Plotting length of sms text for spam sms
plt.hist(df[df['label']=='ham']['sms length'],color='yellow',bins=50,range=(0,300))
plt.title('Ham Message Length',fontsize=20)
plt.xlabel('Message Length')
plt.ylabel('Count')
plt.show()

### Inference
We can see that sms with longer text tend to be spam.

In [None]:
spam_words = ' '.join(list(df[df['label'] == 'spam']['sms']))
spam_wc = WordCloud(width=520,height=520).generate(spam_words)
plt.figure(figsize=(16,9))
plt.imshow(spam_wc)
plt.axis("off")
plt.title("Spam Words Word Cloud",fontsize=20)
plt.show()

### Inference
We can see that sms containing words FREE,Please Call, Now , Win,Text,Call tend to be very common spam words

In [None]:
ham_words = ' '.join(list(df[df['label'] == 'ham']['sms']))
ham_wc = WordCloud(width=520,height=520).generate(ham_words)
plt.figure(figsize=(16,9))
plt.imshow(ham_wc)
plt.axis("off")
plt.title("Ham Words Word Cloud",fontsize=20)
plt.show()

### Inference
We can see the most common ham sms contain words will, know, gt (got), OK, know, Love,now.

# 3.Text Preprocessing

In [None]:
#Processing Text - removing stopwords, punctuation and apply stemming
import string
ps = PorterStemmer()

def process_sms(sms):
    ''' This function removes punctuations and returns the sms as a list of words'''
    sms = sms.translate(str.maketrans('','',string.punctuation)) #remove punctuations
    sms = sms.split()
    sms = [ps.stem(word) for word in sms if len(word) > 2]
    sms = ' '.join(sms)
    return sms

In [None]:
df['sms'] = df.sms.apply(process_sms) #took about 2 mins to execute

In [None]:
# convert label to a numerical variable
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

In [None]:
#Dropping unnecessary columns
df = df[df.columns.drop(list(df.filter(regex='Unnamed')))]
df.head()

## Using TF-IDF


Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining.TF means Term Frequency. It measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length as a way of normalization.

TF = (Number of times term w appears in a document) / (Total number of terms in the document)

Second part idf stands for Inverse Document Frequency. It measures how important a term is. While computing TF, all terms are equally important. 

IDF = log_e(Total number of documents / Number of documents with term w in it)


In [None]:
tfidf = TfidfVectorizer(encoding='latin-1',stop_words='english',analyzer='word',lowercase=True,smooth_idf=True)

In [None]:
#Splitting into train test set
X_train,X_test,y_train,y_test = train_test_split(df['sms'],df['label'],test_size = 0.30, random_state =7)

In [None]:
features_train = tfidf.fit_transform(X_train)
features_test = tfidf.transform(X_test)
print(type(features_train))
pd.DataFrame(features_train.todense(),columns=tfidf.get_feature_names())

# 4. Model Building

## Naive Bayes

Generally, Naive Bayes works well on text data.

Multinomial Naive Bayes calculates likelihood to be count of an word/token (random variable) unlike Gaussian Naive Bayes and hence I would use Multinomial Naive Bayes model.

In [None]:
model = MultinomialNB()
model.fit(features_train,y_train)

In [None]:

prediction = model.predict(features_test)
print(accuracy_score(y_test,prediction))
print(classification_report(y_test,prediction))

In [None]:
from sklearn import metrics 
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

nb_roc_auc = roc_auc_score(y_test,prediction)
fpr, tpr, thresholds = roc_curve(y_test,prediction)
plt.figure()
plt.plot(fpr, tpr, label='Naive Bayes(area = %0.2f)' % nb_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(features_train,y_train)
prediction_knn = model.predict(features_test)
print(accuracy_score(y_test,prediction_knn))
print(classification_report(y_test,prediction_knn))

In [None]:
knn_roc_auc = roc_auc_score(y_test,prediction_knn)
fpr, tpr, thresholds = roc_curve(y_test,prediction_knn)
plt.figure()
plt.plot(fpr, tpr, label='KNN (area = %0.2f)' % knn_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

## Decision Tree

In [None]:
model = DecisionTreeClassifier(random_state=50)
model.fit(features_train,y_train)

In [None]:
# Predicting
y_pred_dt = model.predict(features_test)

In [None]:
# Evaluating
cm = confusion_matrix(y_test, y_pred_dt)

print(cm)

In [None]:
print ("Accuracy : %0.5f \n\n" % accuracy_score(y_test, model.predict(features_test)))
print (classification_report(y_test, model.predict(features_test)))

## Final Result based on Accuracy

* Decision Tree : 95.39%
* KNN classifier : 90.43%
* Multinomial Naive Bayes:95.09%    