# SMS Spam Detection Filter using NLTK

## Import Libraries

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline

In [None]:
import nltk

In [None]:
nltk.download_shell()

In the above step, we have downloaded stopwords list using nltk shell

#### What are stopwords?

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

## Dataset Details

In the above results, you can see that there is a spacing between the word "ham/spam" and the actual message

Due to the spacing we can tell that this is a TSV ("tab separated values") file, where the first column is a label saying whether the given message is a normal message (commonly known as "ham") or "spam". The second column is the message itself

Using these labeled ham and spam examples, we'll train a machine learning model to learn to discriminate between ham/spam automatically. Then, with a trained model, we'll be able to classify arbitrary unlabeled messages as ham or spam.

#### Creating a dataframe from the file with two columns - labels and messages

In [None]:
messages = pd.read_csv("../input/sms-spam-collection-dataset/spam.csv", encoding='latin-1')

In [None]:
messages.head()

In [None]:
messages=messages[['v1','v2']]

In [None]:
messages.head()

In [None]:
messages.columns = ['label','message']

## Exploratory Data Analysis

In [None]:
messages.describe()

#### Exploring the data based on ham and spam labels

In [None]:
messages.groupby('label',axis=0).describe()

In [None]:
messages['length'] = messages['message'].apply(len)

In [None]:
messages.head()

In [None]:
messages['length'].plot(bins=50,kind='hist')

If you see, length of the text goes beyond 800 characters (look at the x axis). This means that there are some messages whose length is more than the others

We can take a look at it

In [None]:
messages.describe()

So from the above results we can see that there is a message with 910 characters. Lets take a look at that message to see if the particular message is spam or ham

In [None]:
messages[messages['length']==910]['message'].iloc[0]

Now lets try to find some distinguishing feature between the messages of two sets of labels - ham and spam

In [None]:
messages.hist(column='length',by='label',bins=50,figsize=(12,4))

We can clearly see from the plots that the spam messages have more characters as compared to the ham messages

## Text Preprocessing

The main issue with our data is that it is all in text format (strings). Most of the classification algorithms need some sort of numerical feature vector in order to perform the classification task. 

There are many methods to convert a corpus (A collection of texts is also sometimes called "corpus") to a vector format. The simplest is the the bag-of-words approach, where each unique word in a text will be represented by one number.

We will be converting the raw messages (sequence of characters)into vectors (sequences of numbers)

#### Creating a function to remove all the punctuations and stopwords

A text processing function is created which will take in a string, and perform the following functions:
1. Remove all the punctuations
2. Remove all stopwords
3. Returns a list of cleaned text

In [None]:
import string

Just trying to build the function step by step by trying out the idea on a string first. 

The first step will be to remove punctuations from the below string

In [None]:
mess = "Sample Message! Notice: it has punctuatuation."

In [None]:
mess

In [None]:
nopunc_mess = [char for char in mess if char not in string.punctuation]

In [None]:
nopunc_mess

In [None]:
nopunc_mess = ''.join(nopunc_mess)

In [None]:
nopunc_mess

Now we got rid of the punctuations

The second step will be to remove stopwords from the string

In [None]:
from nltk.corpus import stopwords

In [None]:
# Printing out some stopwords

In [None]:
stopwords.words('english')[0:10]

In [None]:
nopunc_mess.split()

In [None]:
# Removing the stopwords

In [None]:
clean_mess = [word for word in nopunc_mess.split() if word.lower() not in stopwords.words('english')]

In [None]:
clean_mess

Now both the concepts will be put together in a function

In [None]:
def text_process(mess):
    # Removing the punctuation from the string
    nopunc_mess = [char for char in mess if char not in string.punctuation]
    
    # Join the characters again to form the string
    nopunc_mess = ''.join(nopunc_mess)
    
    # Removing any stopwords in the list of words
    return [word for word in nopunc_mess.split() if word.lower() not in stopwords.words('english')]

In [None]:
messages.head()

#### Tokenizing Messages

Now we need to use the text_process function to tokenize the messages

Tokenization is just the term used to describe the process of converting the normal text strings in to a list of tokens (words that we actually want)

In [None]:
messages['message'].head(5).apply(text_process)

In [None]:
messages.head()

#### Vectorization

We have the messages as lists of tokens (also known as [lemmas](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)) and now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with.

Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.

We'll do that in three steps using the bag-of-words model:
1. Count how many times does a word occur in each message (Known as term frequency)
2. Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
3. Normalize the vectors to unit length, to abstract from the original text length (L2 norm)

#### Step 1:

Each vector will have as many dimensions as there are unique words in the SMS corpus.  We will first use SciKit Learn's CountVectorizer. This model will convert a collection of text documents to a matrix of token counts.

We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per text message. 

For example:

<table border = “1“>
<tr>
<th></th> <th>Message 1</th> <th>Message 2</th> <th>...</th> <th>Message N</th> 
</tr>
<tr>
<td><b>Word 1 Count</b></td><td>0</td><td>1</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word 2 Count</b></td><td>0</td><td>0</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>...</b></td> <td>1</td><td>2</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word N Count</b></td> <td>0</td><td>1</td><td>...</td><td>1</td>
</tr>
</table>


Since there are so many messages, we can expect a lot of zero counts for the presence of that word in that document. Because of this, SciKit Learn will output a [Sparse Matrix](https://en.wikipedia.org/wiki/Sparse_matrix)

#### What is a Sparse Matrix?
A sparse matrix or sparse array is a matrix in which most of the elements are zero

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
bow_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])

In [None]:
# Print total number of vocab words

In [None]:
print(len(bow_transformer.vocabulary_))

Let's take one text message and get its bag-of-words counts as a vector, putting to use our new bow_transformer

In [None]:
message4 = messages['message'][3]

In [None]:
print(message4)

Now let us take a look at its vector representation

In [None]:
bow4 = bow_transformer.transform([message4])

In [None]:
bow4.shape

The above results basically show that there is one row (which is the message row) and there are 11425 columns (which means that there are 11425 words)

In [None]:
print(bow4)

The above results show that there are seven unique words in message number 4 (after removing common stop words). Two of them appear twice, the rest only once. 

Lets analyse which words appear twice

In [None]:
print(bow_transformer.get_feature_names()[4068])
print(bow_transformer.get_feature_names()[9554])

We can now use .transform on our Bag-of-Words (bow) transformed object and transform the entire DataFrame of messages

In [None]:
messages_bow = bow_transformer.transform(messages['message'])

In [None]:
print('Shape of Sparse Matrix: ', messages_bow.shape)
print('Amount of Non-Zero occurences: ', messages_bow.nnz)

In [None]:
sparsity = (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))
print('sparsity: {}'.format((sparsity)))

After the counting, the term weighting and normalization can be done with [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf), using scikit-learn's `TfidfTransformer`.

#### What is TF-IDF?

TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

**TF: Term Frequency**, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

*TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).*

**IDF: Inverse Document Frequency**, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

*IDF(t) = log_e(Total number of documents / Number of documents with term t in it).*

See below for a simple example.

**Example:**

Consider a document containing 100 words wherein the word cat appears 3 times. 

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
# Creating an instance of tfidf transformer and fitting it to the bag of words

In [None]:
tfidf_transformer = TfidfTransformer().fit(messages_bow)

In [None]:
tfidf4 = tfidf_transformer.transform(bow4)

In [None]:
print(tfidf4)

The above result is the inverse document frequency and term frequency for a particular message. Which means that we were able to transform a simple word count to an actual tfidf

Now, we can check the inverse document frequency for a particular word. Example: If we want to check the document frequency for the word "university" we can do that in the following manner

In [None]:
tfidf_transformer.idf_[bow_transformer.vocabulary_["university"]]

Now, we can convert the entire bag of corpus into tfidf corpus

In [None]:
messages_tfidf = tfidf_transformer.transform(messages_bow)

In [None]:
print(messages_tfidf)

## Training a Model

Since the messages have been converted into vectors, we can train a model on our data. Naive based classification algorithm can be used to classify the text messages as ham or spam

In [None]:
from sklearn.naive_bayes import MultinomialNB

Creating a Naive Bayes object and trying this object to the vector data of messages. So the first argument of this fit method would be the vectors of messages and the second argument would be the actual data

In [None]:
spam_detect_model = MultinomialNB().fit(messages_tfidf,messages['label'])

Now we are try to classify single random message to see how well the algorithm is working

In [None]:
print('predicted:', spam_detect_model.predict(tfidf4)[0])
print('expected:', messages.label[3])

The above result detects that the tfidf4 message is ham and the actual label for that message also shows ham. This means that the model is able to predict the ham vs spam classification

#### Predicting the labels for all the messages

In [None]:
all_pred = spam_detect_model.predict(messages_tfidf)

In [None]:
all_pred

Now, this was just to test the model. We will divide the data into Features and Label, and also divide the data into training and testing data set and then creting a data pipeline

#### Dividing the Data into Features and Labels

In [None]:
X = messages['message']
y = messages['label']

#### Train Test Split

Dividing the data into training and testing data so that the model can be trained on one set of data and tested on the other set of data 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
X_train

### Creating a Pipeline

Let's run our model again and then predict off the test set. We will use SciKit Learn's pipeline capabilities to store a pipeline of workflow. 

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

Now we can directly pass message text data and the pipeline will do our pre-processing. We can treat it as a model/estimator API:

In [None]:
pipeline.fit(X_train,y_train)

In [None]:
predictions = pipeline.predict(X_test)

## Model Evaluation

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test,predictions))