### Assignment # 10 - Documents Classification

In [97]:
# Load the required packages
import numpy as np
import pandas as pd
import re
import random
import nltk
from nltk import word_tokenize, WordNetLemmatizer

import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics, svm
from sklearn.model_selection import train_test_split

%matplotlib inline


#### 1. Load data

This data set was downloaded to a local drive in the .\data directory and will be loaded from there.  Original data set can be found at: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection  

This data set consist of a collection of text message that have been identified as 'spam' or 'ham' (non-spam). The format of the data is one message per line with 2 columns, the first column is the identifier (ham/spam), the other column is the message. The columns are separated by space and there is no column header.  


In [98]:
# Use Pandas to load data
df = pd.read_table('data/SMSSpamCollection.txt', header=None)

# Shuffle the rows
df = df.sample(frac=1).reset_index(drop=True)

df.head()

Unnamed: 0,0,1
0,ham,Thanks love. But am i doing torch or bold.
1,ham,Nope. I just forgot. Will show next week
2,ham,"I'm back &amp; we're packing the car now, I'll..."
3,ham,I not at home now lei...
4,ham,So when do you wanna gym?


In [99]:
# Information about messages

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
0    5572 non-null object
1    5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


Let us examine the distribution of spam vs. non-spam messages.  

In [100]:
df.groupby(df[0]).count()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
ham,4825
spam,747


We have 6 times as many non-spam as spam. We will need to take this into consideration when creating training set. We will make sure to randomize the data set prior to dividing it.  We might also consider enforcing the same proportion the data when we build our training set.

#### 2. Pre-processing the messages

Prior to building our features, we will look at our messages and determine whether to pre-process them. This is not to have too many features and overfit the model.  We will start by examining the spam messages and non-spam messages.

In [101]:
df.loc[df[0] == 'spam'][0:5]

Unnamed: 0,0,1
5,spam,1st wk FREE! Gr8 tones str8 2 u each wk. Txt N...
9,spam,Fancy a shag? I do.Interested? sextextuk.com t...
12,spam,URGENT! You have won a 1 week FREE membership ...
22,spam,XMAS iscoming & ur awarded either £500 CD gift...
27,spam,FreeMsg Today's the day if you are ready! I'm ...


From observing the spam messages, we can deduct that spam messages make reference to phone numbers, monetary amounts either in Dollars or in British Pounds ($ and £ repectively), numbers, and in some instance url's.  

Since these would like to keep track of them since they seemed indicative of spam we will convert them to words. This will be done with regex.  

* url's will be set to urladdr
* monatory symbols will be set to monetarysymb  
* phone numbers will be set to phonenumr
* numbers will be set to numr


we will select a few messages that contains these expression so that we can test our processing logic to confirm that we achieve the desired results.  


In [102]:
msg1 = 'XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL'
msg2 = 'Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by replying YES or NO. If you reply NO you will not be charged'
msg3 = '07732584351 - Rodger Burns - MSG = We tried to call you re your reply to our sms for a free nokia mobile + free camcorder. Please call now 08000930705 for delivery tomorrow'
msg4 = 'Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out! '
msg5 = 'As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589'
msg6 = 'Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed £1000 cash or £5000 prize!'
msg7 = 'Are you unique enough? Find out from 30th August. www.areyouunique.co.uk'

In [103]:
# tag http[s] as urladdr
msg = re.sub(r'(http[s]?\S+)|(\www\.[A-Za-z]{2,4}\S*)', 'urladdr', msg1)

In [104]:
# tag $, £ as monetary
msg = re.sub(r'\xc2\xa3|\$', 'monatorysymb', msg6)

In [105]:
# tag phone number
msg = re.sub(r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
    'phonenumbr', msg3)

In [106]:
# tag number
msg = re.sub(r'\d+(\.\d+)?', 'numbr', msg6)

In addition, we will remove the punctuation, relace any whitespace to single space and remove leading and trailing white space, and convert to lower case.  

In [107]:
# Remove punctuation
msg = re.sub(r'[^\w\d\s]', ' ', msg2)

In [108]:
# Remove whitespace wtih single space
msg = re.sub(r'\s+', ' ', msg)

In [109]:
# Trimming leading and trailing white space
msg = re.sub(r'^\s+|\s+?$', '', msg)

In [110]:
# Convert to lower case
msg = msg.lower()

We will build a function to do all these transformation and apply them to the total corpus.  

In [111]:
# Define function for message processing 

def process_msg(msg):
    """ Function to modified message with successive regex expression to transform for processing."""
    #tag http[s] as urladdr
    processed_msg = re.sub(r'(http[s]?\S+)|(\www\.[A-Za-z]{2,4}\S*)', 'urladdr', msg)
    # tag $, £ as monetary
    processed_msg = re.sub(r'\xc2\xa3|\$', 'monatorysymb', processed_msg)
    # tag phone number
    processed_msg = re.sub(r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
    'phonenumbr', processed_msg)
    # tag number
    processed_msg = re.sub(r'\d+(\.\d+)?', 'numbr', processed_msg)
    # Remove punctuation
    processed_msg = re.sub(r'[^\w\d\s]', ' ', processed_msg)
    # Remove whitespace wtih single space
    processed_msg = re.sub(r'\s+', ' ', processed_msg)
    # Trimming leading and trailing white space
    processed_msg = re.sub(r'^\s+|\s+?$', '', processed_msg)
    # Convert to lower case
    processed_msg = processed_msg.lower()
    
    return(processed_msg)

df[1] = df[1].apply(process_msg)
    

#### 3. Tokenization & Lemmatization

We will now remove the StopWord and reduce

In [112]:
from nltk.corpus import stopwords
stoplist = stopwords.words('english')

wordnet_lemmatizer = WordNetLemmatizer()

def process_words(msg):
    tokens = word_tokenize(msg)
    tokens_tp = []
    for word in tokens:
        if word not in stoplist:
            tokens_tp.append( word )
    processed_tokens = [wordnet_lemmatizer.lemmatize(word) for word in tokens_tp]
    s_processed_tokesns = ' '.join(processed_tokens)
    return (s_processed_tokesns)

In [113]:
df[2]=df[1].apply(process_words)

In [114]:
df.head()

Unnamed: 0,0,1,2
0,ham,thanks love but am i doing torch or bold,thanks love torch bold
1,ham,nope i just forgot will show next week,nope forgot show next week
2,ham,i m back amp we re packing the car now i ll le...,back amp packing car let know room
3,ham,i not at home now lei,home lei
4,ham,so when do you wanna gym,wan na gym


#### 4. Vectorization and building features

We will first separate the messages and the labels and then use CountVectorizer from sklearn package to build features matrix.


In [115]:
# Identify processed messages and labels
messages = df[1]
labels = df[0]
# Encode labels with 0,1
# Encode the class labels as numbers
le = LabelEncoder()
labels_enc = le.fit_transform(labels)

# Use TfidVectorizer to build matrix, we will only consider 1gram
vectorizer = TfidfVectorizer(ngram_range=(1, 1))
X = vectorizer.fit_transform(messages)


In [116]:
X.shape

(5572, 7986)

#### 5. Training and Evaluating Model

In [117]:
# Prepare the training and test sets using an 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    labels_enc,
    test_size=0.2,
    random_state=42,
    stratify=labels_enc
)

We will use the Naive Bayes model.

In [118]:
clf_nb = MultinomialNB()
clf_nb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

We will use the support vector machines (SVM)

In [119]:
# Train SVM with a linear kernel on the training set
clf_svm = svm.LinearSVC(loss='hinge')
clf_svm.fit(X_train, y_train)


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)

We will now predict using both model on our text data set and we will compare the results

In [120]:
# Evaluate the classifier on the test set
y_pred_nb = clf_nb.predict(X_test)
y_pred_svm = clf_svm.predict(X_test)

We will now compute the F1 score for both.

In [121]:
# Compute the F1 scores
f1_nb = metrics.f1_score(y_test, y_pred_nb)
f1_svm = metrics.f1_score(y_test, y_pred_svm)

In [122]:
print('F1 score for Naive Bayes : ' + str(f1_nb))
print('F1 score for SVM         : ' + str(f1_svm))

F1 score for Naive Bayes : 0.866920152091
F1 score for SVM         : 0.932384341637


It will appear that the SVM model is performing much better. We will now look at the confusion matrices.

In [123]:
# Display a confusion matrices
conf_nb = pd.DataFrame(
    metrics.confusion_matrix(y_test, y_pred_nb),
    index=[['actual', 'actual'], ['spam', 'ham']],
    columns=[['predicted', 'predicted'], ['spam', 'ham']]
)
conf_svm = pd.DataFrame(
    metrics.confusion_matrix(y_test, y_pred_svm),
    index=[['actual', 'actual'], ['spam', 'ham']],
    columns=[['predicted', 'predicted'], ['spam', 'ham']]
)

In [124]:
print ('Confusion Matrix for Naive Bayes')
conf_nb

Confusion Matrix for Naive Bayes


Unnamed: 0_level_0,Unnamed: 1_level_0,predicted,predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,spam,ham
actual,spam,966,0
actual,ham,35,114


In [125]:
print ('Confusion Matrix for SVM')
conf_svm

Confusion Matrix for SVM


Unnamed: 0_level_0,Unnamed: 1_level_0,predicted,predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,spam,ham
actual,spam,965,1
actual,ham,18,131


Athough both model are very good at predicting spam and fairly good at predicting ham, SVM has a lower level of false positive than Naive Bayes (18 to 35)

#### 6. References

* Natural Language Processing with Python
* http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
* http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
* http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
* http://scikit-learn.org/stable/modules/feature_extraction.html
* https://cambridgespark.com/content/tutorials/implementing-your-own-spam-filter/index.html
* https://github.com/redwanhuq/machine_learning/blob/master/sms_spam_filter.ipynb
* https://radimrehurek.com/data_science_python/
