# Naive Bayes Spam Email Classifier

## with User Test Input 

Import dependencies

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

np.set_printoptions(suppress=True)

Use this `messsage` variable to create a test message of your choosing. Note that if the entire message contains words the classifier has never seen before, it will be on the fence whether or not it is spam. 

In [17]:
message = "Meet hot singles now"

Read the training data into a dataframe, but append the test message above so it gets included in the vectorization. We will omit it from trainining afterwards. 

In [18]:
# read training data, add input to DataFrame
df = pd.read_csv('https://bit.ly/3zQBV5y')

df.loc[len(df.index)] = [message, 1] # add record

df

Unnamed: 0,msg,spam_ind
0,Hey there! I thought you might find this inter...,1
1,Get viagra for a discount as much as 90%,1
2,Viagra prescription for less,1
3,"Even better than Viagra, try this new prescrip...",1
4,"My name is Natasha, I want to meet you",1
5,Meet the hottest singles on the #1 dating site,1
6,"Hey, I left my phone at home. Email me if you ...",0
7,Please see attachment for notes on today's mee...,0
8,An item on your Amazon wish list received a di...,0
9,Your prescription drug order is ready,0


Vectorize the training data and the test input together, counting each word occurrence for every email, and break out the `X_all` column containing the input data. 

In [19]:
# vectorize training data along with user input
cv = CountVectorizer()
X_all = cv.fit_transform(df['msg'])

# Print count vectorizer as a table 
pd.DataFrame(X_all.toarray(), columns = cv.get_feature_names_out())

Unnamed: 0,90,account,afternoon,amazon,an,anything,as,at,attachment,be,...,this,thought,to,today,try,viagra,want,wish,you,your
0,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,1,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
7,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
8,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Extract the training `X_train` and `Y_train` data, omitting the test record we appended earlier. That will be extracted as the test input as `X_test`. 

In [20]:
# extract the vectorized training data
X_train = X_all[:-1, :]
Y_train = df['spam_ind'].iloc[:-1]

# extract out the test input
X_test = X_all[-1:, :]

Fit the `MulinomialNB` model to the training data, and predict the probability of being spam for the test email. Note after we predict the probability with `predict_proba()` it will return two values, one for the probability of not being spam and the other for being spam. We want the second value so we extract it. 

In [21]:
# Create multinomial Naive Bayes and train model
model = MultinomialNB().fit(X_train, Y_train)

# Test the user input for spam
probability_of_spam = model.predict_proba(X_test).flatten()[1]
print("Spam probability: {0}%".format(probability_of_spam))

Spam probability: 0.8942002178939857%
