In [1]:
import pandas as pd
df = pd.read_csv('messages.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


Convert Category to Numbers for algorithm the work on.

In [3]:
df['Spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,Spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.Spam, test_size=0.25)

Convert Messages also to numbers for algorithm to work on.\
This is called Bag Of Words approach.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

X_train_bow = cv.fit_transform(X_train)

X_train_bow.toarray()[:5]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(5, 7443))

We now build the model using Naive Bayes.\
Multinomial type NB is more suited here as words would occur a fixed times. 

In [6]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_bow, y_train)

In [7]:
X_test_bow = cv.transform(X_test) # convert the test data to BOW

In [8]:
model.score(X_test_bow, y_test)

0.9877961234745154

We can now test on some other sample messages.

In [9]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!',
    'You are invited to a property exhibition. Free entry!'
]
emails_bow = cv.transform(emails)
model.predict(emails_bow)

array([0, 1, 1])

Here we used CountVectorizer. There are other transformation algorithms like TFID.\
We had to perform transformation many times. We could simplify the code using Pipeline.

In [10]:
from sklearn.pipeline import Pipeline
pl = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [11]:
pl.fit(X_train, y_train) #note, we apply X_train directly as Pipeline does the transformation and model building

In [12]:
pl.score(X_test, y_test)

0.9877961234745154

In [13]:
pl.predict(emails)

array([0, 1, 1])