Dataset is collected from UCI Machine Learning Repository it is about The SMS Spam Collection v.1 (hereafter the corpus) is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. 

In [1]:
#importing modules

In [2]:
import pandas as pd

In [3]:
#Loading the dataset 
data= pd.read_csv("Dataset/SMSSpamCollection",sep='\t',names = ['label','message'])
#As data is tab seperated so using this \t 

In [4]:
#eyeballing data to get the feel of data 
data.head()
#looking into data got to know that first column is label (Spam or ham ) and next column is message

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
print("The shape of dataset is {}".format(data.shape))

The shape of dataset is (5572, 2)


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
label      5572 non-null object
message    5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB


In [7]:
#Checking for null value 
data.isna().sum()
#below result indicate that data has 0 null value 

label      0
message    0
dtype: int64

In [8]:
#checking the distribution of label
data["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [9]:
#Turns out data is imbalance 
#messages containing is only 747
#ham messages is 4825

In [10]:
#Conerting the messages into vector form so that ML algorithm can understand 
#importing the required library for it
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [11]:
#Before converting the data directly into vector it require pre-processing
#data cleaning - which includes removing the stop words,removing the numbers and converting into lower case.

In [12]:
corpus = []
#instantiating the class
lem = WordNetLemmatizer()


In [13]:
#data cleaning 
for i in range(len(data)):
    review = re.sub('[^a-zA-Z]',' ',data["message"][i])
    review = review.lower()
    review = review.split()
    review = [lem.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = " ".join(review)
    corpus.append(review)
    

In [14]:
#using TF-IDF technique to convert these words into vector
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

In [15]:
#As target column that is label is in category applying one hot encoding 
y = pd.get_dummies(data['label'],drop_first=True)

In [16]:
y.head()
#if 1 - message is tagged as spam 
#if 0 - message is tagged as ham (not a spam)

Unnamed: 0,spam
0,0
1,0
2,1
3,0
4,0


In [17]:
#splitting the dataset into train and test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=123,test_size=0.2)

In [18]:
#Training the model using Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train,y_train)

  return f(**kwargs)


MultinomialNB()

In [19]:
y_pred = model.predict(X_test)

In [20]:
y_pred

array([0, 0, 0, ..., 0, 0, 1], dtype=uint8)

In [21]:
#
print("The score on train data is {}".format(model.score(X_train,y_train)))
print("The score on test data is {}".format(model.score(X_test,y_test)))

The score on train data is 0.9930446488669509
The score on test data is 0.9766816143497757


In [22]:
#As it is class imbalance case so this overall accuracy is not good metric
#Instead look for precision and recall 

In [23]:
from sklearn.metrics import classification_report,accuracy_score
print("The classification report is\n",classification_report(y_test,y_pred))

The classification report is
               precision    recall  f1-score   support

           0       0.99      0.98      0.99       962
           1       0.89      0.94      0.92       153

    accuracy                           0.98      1115
   macro avg       0.94      0.96      0.95      1115
weighted avg       0.98      0.98      0.98      1115



In [24]:
#So in this spam and ham case precison is important metric 
#It is coming around 97 and 98 for class 0 and class 1 respectively

In [25]:
import pickle
pickel_out = open("classifier.pkl",'wb')
pickle.dump(model,pickel_out)
pickel_out.close()

In [26]:
pickel_out1 = open("Vectorizer.pkl",'wb')
pickle.dump(cv,pickel_out1)
pickel_out1.close()

In [27]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.
