# SMS Spam Classifier
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

Can you use this dataset to build a prediction model that will accurately classify which texts are spam?


Kaggle link : https://www.kaggle.com/uciml/sms-spam-collection-dataset/home

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn import model_selection
from sklearn import linear_model
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

In [3]:
dataset = pd.read_csv('spam.csv', encoding='latin-1')

In [4]:
dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
# 2nd part of sms
dataset['Unnamed: 2'].value_counts().head() 

 bt not his girlfrnd... G o o d n i g h t . . .@"       3
 but dont try to prove it..\" .Gud noon...."            2
 don't miss ur best life for anything... Gud nyt..."    2
GN                                                      2
 PO Box 5249                                            2
Name: Unnamed: 2, dtype: int64

In [6]:
#3rd part of sms
dataset['Unnamed: 3'].value_counts().head() 

GE                                    2
 MK17 92H. 450Ppw 16"                 2
 \"OH No! COMPETITION\". Who knew     1
IåÕL CALL U\""                        1
whoever is the KING\"!... Gud nyt"    1
Name: Unnamed: 3, dtype: int64

In [7]:
# 4th part of sms
dataset['Unnamed: 4'].value_counts().head()  

GNT:-)"                                                     2
 one day these two will become FREINDS FOREVER!"            1
 CALL 2MWEN IM BK FRMCLOUD 9! J X\""                        1
 just Keep-in-touch\" gdeve.."                              1
 Never comfort me with a lie\" gud ni8 and sweet dreams"    1
Name: Unnamed: 4, dtype: int64

In [8]:
dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Concating

In [9]:
dataset['v2']=dataset['v2'].astype(str)+' '+dataset['Unnamed: 2'].astype(str)+' '+dataset['Unnamed: 3'].astype(str)+' '+dataset['Unnamed: 4'].astype(str)

In [10]:
dataset['v2']=dataset['v2'].astype(str)+" "+dataset["Unnamed: 2"].astype(str)+" "+dataset["Unnamed: 3"].astype(str)+" "+dataset["Unnamed: 4"].astype(str)

In [11]:
dataset['v2'][281]

'\\Wen u miss someone  the person is definitely special for u..... But if the person is so special  why to miss them  just Keep-in-touch\\" gdeve.."  the person is definitely special for u..... But if the person is so special  why to miss them  just Keep-in-touch\\" gdeve.."'

In [12]:
dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni... nan nan nan nan ...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [13]:
dataset=dataset.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'])

In [14]:
dataset.columns = ["class","sms"]

In [15]:
dataset["class"].value_counts()

ham     4825
spam     747
Name: class, dtype: int64

In [16]:
dataset.head()

Unnamed: 0,class,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni... nan nan nan nan ...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Create Validation Dataset

In [17]:
# converting dataframe to array since sklearn.model_selection takes array as input parameter
X=dataset['sms'].values
Y=dataset['class'].values

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20, random_state=100)

CountVectorizer

In [18]:
vect = CountVectorizer(ngram_range=range(1,3),stop_words=stopwords.words('english'),min_df=5)
vect.fit(X_train)  # is our vocabulary
X_trainCount=vect.transform(X_train)
X_trainCount.toarray()

X_testCount=vect.transform(X_test)
X_testCount.toarray() 

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [19]:
#vect.get_feature_names()

In [20]:
pd.DataFrame(X_trainCount.toarray(),columns=vect.get_feature_names()).head()

Unnamed: 0,00,000,000 bonus,000 cash,02,03,03 2nd,04,04 nan,06,...,yo,yoga,yr,yup,ì_,ì_ going,ì_ nan,ì_ wan,ìï,û_
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
model_LR = linear_model.LogisticRegression()
model_LR.fit(X_trainCount,Y_train)

# for Train dataset
trainResult_LR=model_LR.predict(X_trainCount)


# for test dataset
testResult_LR=model_LR.predict(X_testCount)

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report


# Train

print(accuracy_score(Y_train, trainResult_LR))
print(confusion_matrix(Y_train, trainResult_LR))
print(classification_report(Y_train, trainResult_LR))
print('--------------------------------------------------------------')

# Test
print(accuracy_score(Y_test, testResult_LR))
print(confusion_matrix(Y_test, testResult_LR))
print(classification_report(Y_test, testResult_LR))




0.9934933811981154
[[3855    0]
 [  29  573]]
              precision    recall  f1-score   support

         ham       0.99      1.00      1.00      3855
        spam       1.00      0.95      0.98       602

   micro avg       0.99      0.99      0.99      4457
   macro avg       1.00      0.98      0.99      4457
weighted avg       0.99      0.99      0.99      4457

--------------------------------------------------------------
0.9874439461883409
[[968   2]
 [ 12 133]]
              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       970
        spam       0.99      0.92      0.95       145

   micro avg       0.99      0.99      0.99      1115
   macro avg       0.99      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



In [22]:
testDF=pd.DataFrame([X_test,Y_test, testResult_LR]).T
testDF.columns=['Sms','Actual','Predicted']

Wrongly Classified Words

In [23]:
pd.set_option('display.max_colwidth',0)
testDF[testDF['Actual']!=testDF['Predicted']]

Unnamed: 0,Sms,Actual,Predicted
79,Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1 nan nan nan nan nan nan,spam,ham
89,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE MINS. INDIA CUST SERVs SED YES. L8ER GOT MEGA BILL. 3 DONT GIV A SHIT. BAILIFF DUE IN DAYS. I O å£250 3 WANT å£800 nan nan nan nan nan nan,spam,ham
118,"Hi this is Amy, we will be sending you a free phone number in a couple of days, which will give you an access to all the adult parties... nan nan nan nan nan nan",spam,ham
151,Can you call me plz. Your number shows out of coveragd area. I have urgnt call in vasai &amp; have to reach before 4'o clock so call me plz nan nan nan nan nan nan,ham,spam
291,Wot u up 2? Thout u were gonna call me!! Txt bak luv K nan nan nan nan nan nan,ham,spam
359,Urgent Ur å£500 guaranteed award is still unclaimed! Call 09066368327 NOW closingdate04/09/02 claimcode M39M51 å£1.50pmmorefrommobile2Bremoved-MobyPOBox734LS27YF nan nan nan nan nan nan,spam,ham
526,"Claim a 200 shopping spree, just call 08717895698 now! Have you won! MobStoreQuiz10ppm nan nan nan nan nan nan",spam,ham
564,"Do you realize that in about 40 years, we'll have thousands of old ladies running around with tattoos? nan nan nan nan nan nan",spam,ham
587,How come it takes so little time for a child who is afraid of the dark to become a teenager who wants to stay out all night? nan nan nan nan nan nan,spam,ham
614,FreeMsg:Feelin kinda lnly hope u like 2 keep me company! Jst got a cam moby wanna c my pic?Txt or reply DATE to 82242 Msg150p 2rcv Hlp 08712317606 stop to 82242 nan nan nan nan nan nan,spam,ham
