## Training a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Converted the words ham and spam to a binary indicator variable(0/1)

2.  Converted the txt to a sparse matrix of TFIDF vectors

3.  Implemented a Naive Bayes Classifier

4.  Measured success using roc_auc_score



In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

In [36]:
#Reading SMS Span Collection dataset, seperate it by tab

df= pd.read_csv("SMSSpamCollection",sep='\t', names=['spam', 'txt'])

In [37]:
#Getting an idea of the dataset from head

df.head()

Unnamed: 0,spam,txt
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [38]:
#Converting spam into binary vector variable by using pd.get_dummies

df['spam'] = pd.get_dummies(df.spam)['spam']

In [39]:
#Checking for the changes if spam is converted from text to binary numbers

df.head()

Unnamed: 0,spam,txt
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [40]:
#Getting stopset of words

stopset = set(stopwords.words('english'))

In [43]:
#TFID Vectorizer 

vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=True, lowercase=True, strip_accents='ascii')

In [44]:
#assigning spam which is our dependent variable

y=df.spam

In [45]:
X = vectorizer.fit_transform(df.txt)

In [46]:
print X.shape
print y.shape

(5572, 8605)
(5572L,)


In [47]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [48]:
#Fitting naive bayes model

clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [49]:
#Measuring score

roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.98558587451336732