# Building a Spam Classifier

In this project, we will train a naive-bayes model to classify text snippets as spam or ham. Once we have trained the model, I will deploy it online using a Flask-based web app.

In [1]:
# Loading data

import pandas as pd

df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv', encoding = "ISO-8859-1")

In [2]:
df.isnull().sum().sum()

16648

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [4]:
df.dropna().head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
281,ham,\Wen u miss someone,the person is definitely special for u..... B...,why to miss them,"just Keep-in-touch\"" gdeve.."""
1038,ham,"Edison has rightly said, \A fool can ask more ...",GN,GE,"GNT:-)"""
2255,ham,I just lov this line: \Hurt me with the truth,I don't mind,i wil tolerat.bcs ur my someone..... But,"Never comfort me with a lie\"" gud ni8 and swe..."
3525,ham,\HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...,HAD A COOL NYTHO,TX 4 FONIN HON,"CALL 2MWEN IM BK FRMCLOUD 9! J X\"""""
4668,ham,"When I was born, GOD said, \Oh No! Another IDI...",GOD said,"\""OH No! COMPETITION\"". Who knew","one day these two will become FREINDS FOREVER!"""


We will use only the columns named v1 and v2 to train the proposed model. Let us delete the other columns.

In [5]:
df.drop([col for col in df.columns if col not in ['v1', 'v2']], axis=1, inplace=True)

Let us give the v1 and v2 columns easy-to-understand names.

In [6]:
df.columns = ['label', 'message']

Next, let us encode the labels in the label column.

In [7]:
df.label = df.label.map({'ham': 0, 'spam': 1})

In [8]:
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
df.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

In [10]:
df[df.label==1].head()

Unnamed: 0,label,message
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
5,1,FreeMsg Hey there darling it's been 3 week's n...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...
11,1,"SIX chances to win CASH! From 100 to 20,000 po..."


The ratio of ham and spam messages in the dataset is ~6.5:1. Let us train a naive-bayes model on these messages.

In [11]:
# Importing relevant modules

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
# Separating independent and dependent variables

X = df.message
y = df.label

In [13]:
# Converting the messages into a matrix of token counts

transformer = CountVectorizer().fit(X)

X_dtm = transformer.transform(X)

In [14]:
# Creating training and validation subsets

X_train, X_test, y_train, y_test = train_test_split(X_dtm, y, test_size=0.33, random_state=42)

In [15]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3733, 8672), (1839, 8672), (3733,), (1839,))

In [16]:
# Training and evaluating naive-bayes classifier

nb = MultinomialNB()

nb.fit(X_train, y_train)
nb.score(X_test, y_test)

y_pred = nb.predict(X_test)

In [17]:
# Printing classification report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1587
           1       0.93      0.92      0.92       252

    accuracy                           0.98      1839
   macro avg       0.96      0.95      0.96      1839
weighted avg       0.98      0.98      0.98      1839



In [18]:
# Saving the trained model

import pickle

pickle.dump(transformer, open("tfr.pkl", "wb"))
pickle.dump(nb, open("model.pkl", "wb"))

I used the pickled files with Flask to build a spam-classifier app. You can browse through the app on [Heroku](https://isitspam.herokuapp.com/).

## Reference

https://towardsdatascience.com/develop-a-nlp-model-in-python-deploy-it-with-flask-step-by-step-744f3bdd7776