# Multinomial Naive Bayes - Spam Detection Case Study

In this case study, we want to build a classifier that can predict whether a new SMS message is spam or not. 

We'll first create the Multinomial Naive Bayes algorithms from scratch and then compare the results against the implementation in Scikit-learn!

### Dataset

SMS Spam Collection Dataset - Collection of SMS messages tagged as spam or legitimate

UCI Machine Learning

http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/


In [1]:
import pandas as pd
import numpy as np

In [8]:
sms = pd.read_csv("SMSSpamCollection", usecols=[0,1], skiprows=1, names=["label", "message"], sep="\t")
sms

Unnamed: 0,label,message
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...
...,...,...
5566,spam,This is the 2nd time we have tried 2 contact u...
5567,ham,Will ü b going to esplanade fr home?
5568,ham,"Pity, * was in mood for that. So...any other s..."
5569,ham,The guy did some bitching but I acted like i'd...


Start with small dataset first...

In [10]:
sms = sms.head(10)
sms

Unnamed: 0,label,message
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...
5,ham,Even my brother is not like to speak with me. ...
6,ham,As per your request 'Melle Melle (Oru Minnamin...
7,spam,WINNER!! As a valued network customer you have...
8,spam,Had your mobile 11 months or more? U R entitle...
9,ham,I'm gonna be home soon and i don't want to tal...


In [11]:
X = sms['message'].to_list()
y = sms['label'].to_list()

## Scikit-Learn implementation 

In [13]:
# Scikit learn predictions

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Preprocess with CountVectorizer
cv = CountVectorizer(lowercase=True, stop_words='english')

X_train_cv = cv.fit_transform(X)

# Initilize Multinomial NB with same priors as above
MNB = MultinomialNB()
MNB.fit(X_train_cv, y)

X_test = ['Ok Im joking', 
          'You won a prize!', 
          'Need a new mobile number?', 
          'Bye brother', 
          "You are the winner", 
          "I'll be home soon"]

X_test_cv = cv.transform(X_test)
MNB.predict(X_test_cv)

array(['ham', 'spam', 'spam', 'ham', 'spam', 'ham'], dtype='<U4')