# Machine Learning – Spam Classifier

## Using Naive Bayes Classification

I specifically wanted to learn more about the Naive Bayes Classifier. 

Recall that a classifier is a machine learning model that is used to discriminate different objects based on certain features. A Naive Bayes classifier is a probabilistic machine learning model that's used for classification, and is based on the Bayes Theorem.

Using Bayes theorem, we can find the probability of A happening given that B has occurred. B is the evidence and A is the hypothesis. The assumption is that predictors/features are independent; that is, the presence of one particular feature does not affect the other. This is **Naive**. 

I'll be using the Naive Bayes Classifier to determine whether an email corresponds to spam or not. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Lets observe the different categories of these data, **ham** or **spam**

In [3]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


#### Make a new attribute, "spam":

In [4]:
# applies lambda function where it takes individual values and checks if spam, then 1, otherwise 0
df['spam']= df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


### For engraining this into memory, let's again examine what training versus testing data are:

X_train contains the values of the features, independent variables.

y_train contains the target output/dependent variable corresponding to X_train values 

There are also values generated after training process (predictions) which should be very close or the same with  y_train values if the model is a successful one.

X_test contains the values of the features to be tested after training 

y_test contains the target output corresponding to X_test and will be compared to prediction value with given X_test values of the model after training in order to determine how successful the model is.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)

Message column is still text though, which we need to convert to numbers. In this scheme, features and samples are defined as follows:

each individual token occurrence frequency (normalized or not) is treated as a feature. the vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

The Multinomial Naive Bayes algorithm is a Bayesian learning approach popular in Natural Language Processing (NLP). The program guesses the tag of a text, such as an email or a newspaper story, using the Bayes theorem. It calculates each tag's likelihood for a given sample and outputs the tag with the greatest chance.

In [7]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)

MultinomialNB()

Now I'm inputting **three different emails** to see if the spam classifier is working: 

In [8]:
emails = [
    'Hi Srishti, want to get pedicures together?',
    'Up to 20% discount on parking, exclusive offer just for you. Dont miss this reward!',
    'Srishti, did you get my earlier email?'
]
# Use the Count Vectorizer to get these into numeric format:
emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1, 0])

Good. This worked on some sample emails! That's really cool. Now let's check the cross-validation score:

In [9]:
from sklearn.model_selection import cross_val_score

model_scores = cross_val_score(model, X_train_count, y_train, cv=10)
model_scores.mean()

0.981095888839168

In [10]:
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

0.9856424982053122

Now let's make a **pipeline** instead to to automate these preprocessing steps:

In [11]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb',MultinomialNB())
])

In [12]:
clf.fit(X_train,y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [13]:
clf.score(X_test,y_test)

0.9856424982053122

Nearly 99% accuracy! Also see if it's similarly predicting the earlier emails:

In [14]:
clf.predict(emails)

array([0, 1, 0])

Great!!