Description: This program detects if an email is spam (1) or not (0)

### Import the libraries

In [1]:
#Import libraries
import numpy as np 
import pandas as pd 
import nltk
from nltk.corpus import stopwords
import string

### Load the data and print the first 5 rows.

In [2]:
#Load the data
df = pd.read_csv('emails.csv', encoding='latin-1')
df.head(5)

Unnamed: 0,text,spam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


Let’s explore the data and get the number of rows & columns.

In [3]:
#Print the shape (Get the number of rows and cols)
df.shape

(5572, 2)

Get the column names in the data set.

In [4]:
#Get the column names
df.columns

Index(['text', 'spam'], dtype='object')

Check for duplicates and remove them.

In [5]:
#Checking for duplicates and removing them
df.drop_duplicates(inplace = True)

In [6]:
#Show the new shape (number of rows & columns)
df.shape

(5169, 2)

Show the number of missing data for each column.

In [7]:
#Show the number of missing (NAN, NaN, na) data for each column
df.isnull().sum()

text    0
spam    0
dtype: int64

Download the stop words. Stop words in natural language processing, are useless words (data).

In [8]:
#Need to download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\רועי\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuations and then removing the useless words also known as stop words.

In [9]:
#Tokenization (a list of tokens), will be used as the analyzer
#1.Punctuations are [!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
#2.Stop words in natural language processing, are useless words (data).
def process_text(text):
    '''
    What will be covered:
    1. Remove punctuation
    2. Remove stopwords
    3. Return list of clean text words
    '''
    
    #1
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    #2
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    #3
    return clean_words

The process of returning tokens from text is known as Tokenization. Show the Tokenization of the first 5 rows of text data from our data set by applying the function process_text .

In [10]:
#Show the Tokenization (a list of tokens )
df['text'].head().apply(process_text)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: text, dtype: object

Convert the text into a matrix of token counts.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
messages_bow = CountVectorizer(analyzer=process_text).fit_transform(df['text'])

Split the data into training & testing sets, and print them. We will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value.

The testing feature (independent) data set will be stored in X_test and the testing target (dependent) data set will be stored in y_test .

The training feature (independent) data set will be stored in X_train and the training target (dependent) data set will be stored in y_train .

In [12]:
#Split data into 70% training & 30% testing data sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, df['spam'],
                                                    test_size = 0.30, random_state = 0)

Get the shape of the data.

In [13]:
#Get the shape of messages_bow
messages_bow.shape

(5169, 11304)

Create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features (e.g., word counts for text classification)

In [14]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Print the classifiers prediction and actual values on the data set.

In [15]:
#Print the predictions
print(classifier.predict(X_train))
#Print the actual values
print(y_train.values)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]


See how well the model performed by evaluating the Naive Bayes classifier and showing the report, confusion matrix & accuracy score.

In [16]:
#Evaluate the model on the training data set
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = classifier.predict(X_train)
print(classification_report(y_train ,pred ))
print('Confusion Matrix: \n',confusion_matrix(y_train,pred))
print()
print('Accuracy: ', accuracy_score(y_train,pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3174
           1       0.99      0.98      0.98       444

    accuracy                           1.00      3618
   macro avg       0.99      0.99      0.99      3618
weighted avg       1.00      1.00      1.00      3618

Confusion Matrix: 
 [[3168    6]
 [   9  435]]

Accuracy:  0.9958540630182421


It looks like the model / classifier used is 99.58% accurate. Let’s test the model / classifier on the test data set (X_test& y_test) by printing the predicted value, and the actual value to see if the model can accurately classify the email text/message.

In [17]:
#Print the predictions
print('Predicted value: ',classifier.predict(X_test))
#Print Actual Label
print('Actual value: ',y_test.values)

Predicted value:  [0 0 0 ... 0 0 1]
Actual value:  [0 0 0 ... 0 0 0]


Evaluate the model on the test data set

In [18]:
#Evaluate the model on the test data set
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = classifier.predict(X_test)
print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n', confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))

              precision    recall  f1-score   support

           0       0.99      0.96      0.97      1342
           1       0.79      0.92      0.85       209

    accuracy                           0.96      1551
   macro avg       0.89      0.94      0.91      1551
weighted avg       0.96      0.96      0.96      1551

Confusion Matrix: 
 [[1291   51]
 [  17  192]]

Accuracy:  0.9561573178594455


It looks like the model / classifier used is 95.61% accurate. Let’s test the model / classifier on the test data set (X_test& y_test) by printing the predicted value, and the actual value to see if the model can accurately classify the email text/message.

The classifier accurately identified the email messages as spam or not spam with 95.6 % accuracy on the test data !