## Project- Naive Bayes Model to Predict Spam Emails

##### by Sarthak Shukla

Importing Required Libraries

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Importing the Dataset

In [23]:
df = pd.read_csv('spamraw.csv', engine = 'python')
df.head()

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or Â£10,000..."
4,spam,okmail: Dear Dave this is your final notice to...


Checking if there is Missing Data

In [24]:
df.isnull().sum()

type    0
text    0
dtype: int64

In [25]:
# see how the type class is distributed
df.type.value_counts()

ham     4812
spam     747
Name: type, dtype: int64

Converting text to numeric data

In [26]:
# map the type class to 0 and 1
df.type = df.type.map({'ham':1,'spam': 0})
df.head()

Unnamed: 0,type,text
0,1,Hope you are having a good week. Just checking in
1,1,K..give back my thanks.
2,1,Am also doing in cbe only. But have to pay.
3,0,"complimentary 4 STAR Ibiza Holiday or Â£10,000..."
4,0,okmail: Dear Dave this is your final notice to...


In [8]:
df.shape

(5559, 2)

Creating train and text data to feed in the model

In [27]:
# split data to train and test
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['type'], random_state=1)

In [10]:
X_train

2438    Yeah like if it goes like it did with my frien...
1685    We have sent JD for Customer Service cum Accou...
5520    And stop being an old man. You get to build sn...
874     Or maybe my fat fingers just press all these b...
1495        Dude ive been seeing a lotta corvettes lately
                              ...                        
905                        What year. And how many miles.
5192              Really good:)dhanush rocks once again:)
3980    Ur cash-balance is currently 500 pounds - to m...
235                         Awesome, be there in a minute
5157    Evening * v good if somewhat event laden. Will...
Name: text, Length: 4169, dtype: object

Converting a collection of text documents to a matrix of token counts

In [28]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer(stop_words = 'english')

In [12]:
count_vector

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [29]:
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

In [14]:
training_data

<4169x7199 sparse matrix of type '<class 'numpy.int64'>'
	with 32128 stored elements in Compressed Sparse Row format>

In [30]:
# Transform testing data and return the matrix
testing_data = count_vector.transform(X_test)

Importing Naive Bayes library and training the model

In [16]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)

Checking the accuracy of the model

In [17]:
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9863309352517986
Precision score:  0.9875104079933389
Recall score:  0.9966386554621849
F1 score:  0.9920535340861564


In [18]:
testing_data

<1390x7199 sparse matrix of type '<class 'numpy.int64'>'
	with 9393 stored elements in Compressed Sparse Row format>

In [19]:
predictions

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)