# Loading the Data
The SMS Spam Collection v.1 is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. The distribution is a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.

If we open the dataset, we will see that it has the format [label] [tab] [message]

spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

ham	    U dun say so early hor... U c already then say...



You can collect the data from here: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [84]:
!ls

Ex-01-Naive-bayes.ipynb  SMSSpamCollection


In [85]:
# read data into dataframe 

import pandas as pd

df = pd.read_table('SMSSpamCollection',
                   sep='\t', 
                   header=None,
                   names=['label', 'message'])

  


In [86]:
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [116]:
df.describe

<bound method NDFrame.describe of       label                                            message
0         0  go until jurong point crazi avail onli in bugi...
1         0                              ok lar joke wif u oni
2         1  free entri in 2 a wkli comp to win fa cup fina...
3         0        u dun say so earli hor u c alreadi then say
4         0  nah i dont think he goe to usf he live around ...
5         1  freemsg hey there darl it been 3 week now and ...
6         0  even my brother is not like to speak with me t...
7         0  as per your request mell mell oru minnaminungi...
8         1  winner as a valu network custom you have been ...
9         1  had your mobil 11 month or more u r entitl to ...
10        0  im gon na be home soon and i dont want to talk...
11        1  six chanc to win cash from 100 to 20000 pound ...
12        1  urgent you have won a 1 week free membership i...
13        0  ive been search for the right word to thank yo...
14        0          

# Pre-processing
Once we have our data ready, it is time to do some preprocessing. We will focus on removing useless variance for our task at hand. 


In [87]:
# First, we have to convert the labels from strings to binary values for our classifier:
df['label'] = df.label.map({'ham': 0, 'spam': 1})


In [88]:
#Second, convert all characters in the message to lower case:

df['message'] = df.message.map(lambda x: x.lower())


In [89]:
#Third, remove any punctuation:

df['message'] = df.message.str.replace('[^\w\s]', '')


In [93]:
#Fourth, tokenize the messages into into single words using nltk. 
#First, we have to import and download the tokenizer from the console:


#An installation window will appear. Go to the "Models" tab and select "punkt" from the "Identifier" column. 
#Then click "Download" and it will install the necessary files.
#Then it should work! Now we can apply the tokenization:





import nltk
nltk.download()


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [95]:
df['message'] = df['message'].apply(nltk.word_tokenize)


Fifth, we will perform some word stemming.

The idea of stemming is to normalize our text for all variations of words carry the same meaning,

regardless of the tense. One of the most popular stemming algorithms is the Porter Stemmer:

In [96]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
 
df['message'] = df['message'].apply(lambda x: [stemmer.stem(y) for y in x])

Finally, we will transform the data into occurrences,
which will be the features that we will feed into our model:

In [97]:
from sklearn.feature_extraction.text import CountVectorizer

# This converts the list of words into space-separated strings
df['message'] = df['message'].apply(lambda x: ' '.join(x))

count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['message'])

We could leave it as the simple word-count per message, 
but it is better to use Term Frequency Inverse Document Frequency, more known as tf-idf:

In [98]:
print(type(counts))

<class 'scipy.sparse.csr.csr_matrix'>


In [101]:
counts.shape

(5572, 8169)

In [102]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer().fit(counts)

counts = transformer.transform(counts)

# Training the Model
Now that we have performed feature extraction from our data, it is time to build our model. We will start by splitting our data into training and test sets:



In [103]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, df['label'], test_size=0.2, random_state=69)

Then, all that we have to do is initialize the Naive Bayes Classifier and fit the data. 
For text classification problems, the Multinomial Naive Bayes Classifier is well-suited:



In [104]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)


# Evaluating the Model
Once we have put together our classifier, we can evaluate its performance in the testing set:



In [105]:
import numpy as np

predicted = model.predict(X_test)

print(np.mean(predicted == y_test))

0.9479820627802691


In [106]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, predicted))

[[964   0]
 [ 58  93]]
