# Spam Message Classifier 
### built with NLTK and Scikit-learn (acheiving 99% accuracy)

## Data Source Overview
The data set is from the [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). It contains 5000 SMS labeled messages that have been collected for mobile phone spam research. 

## Project Overview
<details>
    <summary>1. Load the Dataset</summary>
    <p>Use the pandas dataframe to ensure value are properly loaded</p>
</details>

<details>
    <summary>2. Data Preprocessing</summary>
    <p>Convert the class labels (span/ham) to binray values uesing sklearn LabelEncoder. Replace email address, url, phonenumbers with generic representation. Then remove stop words and keepwing word stems</p>
</details>

<details>
    <summary>3. Features Generating</summary>
    <p>View words as features, choosing the top 500-1500 common words as features. Then process each sentence and generate the word_feature table for each sentence</p>
</details>

<details>
    <summary>4. Model Training and Performance Evaluation</summary>
    <p>Split data set to 75% training data and 25% testing data. Train models with different algorithms (including ensemble methods of classifier voting) and compare the performances.</p>
</details>

In [1]:
#Prestart: Importing major libraries import
import sys #for printing ...etc
import nltk #for handling NLP techniques
import sklearn #for training ML models
import pandas #for sotring data in dataframes
import numpy #for basic computational tasks

#checking libraries are correctly installed
#print('Python: {}'.format(sys.version))
#print('NLTK: {}'.format(nltk.__version__))
#print('Scikit-learn: {}'.format(sklearn.__version__))
#print('Pandas: {}'.format(pandas.__version__))
#print('Numpy: {}'.format(numpy.__version__))

## 1. Load the Dataset

In [2]:
import pandas as pd
import numpy as np

#load the dataset of sms messages
df = pd.read_table('SMSSpamCollection', header = None, encoding='utf-8' )

In [3]:
#useful information about the data set: ~5000 data and have two columns, 0: spam/ham, 1:actual msg body
print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5572 non-null   object
 1   1       5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
None
      0                                                  1
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [4]:
#check class distribution
classes = df[0]
print(classes.value_counts()) #prints unique values in first column

ham     4825
spam     747
Name: 0, dtype: int64


## 2. Preprocess the Data
Convert the class labels (span/ham) to binray values uesing sklearn LabelEncoder. 
Replace email address, url, phonenumbers with generic representation. 
Last, remove stop words and keepwing word stems

In [5]:
# convert class labels to binary values, 0 = ham, 1 = spam using sklearn label encoder

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder() 
Y = encoder.fit_transform(classes) #classes is the df[0]. the first column of our data
#print(classes[:10]) #printing first 10 to check if conversion is successful
#print(Y[:10]) 

In [6]:
# store the SMS message data
text_messages = df[1]
# print(text_messages[:10]) #printing first 10 to check if text_msg is stored successful

In [7]:
# use regex to replace strings with generic placeholders, to make algo understand better
# e.g. emailaddresses with 'email'  replace urls with'url', 0945234 with 'phonenumbers'
# https://regexlib.com/ can be used to search for already generated regex

# Replace email addresses with 'emailaddress'
processed = text_messages.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$',' emailaddress ', regex=True)

# Replace URLs with 'webaddress'
processed = processed.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$',' webaddress ', regex=True)

# Replace money symbols with 'moneysymb' (£ can by typed with ALT key + 156)
processed = processed.str.replace(r'£|\$', ' moneysymb ', regex=True)
    
# Replace 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phonenumber'
processed = processed.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$',' phonenumbr ', regex=True)
    
# Replace numbers with 'numbr'
processed = processed.str.replace(r'\d+(\.\d+)?', ' numbr ', regex=True)

In [8]:
# Remove punctuation
processed = processed.str.replace(r'[^\w\d\s]', ' ', regex=True)

# Replace whitespace between terms with a single space
processed = processed.str.replace(r'\s+', ' ', regex=True)

# Remove leading and trailing whitespace
processed = processed.str.replace(r'^\s+|\s+?$', '', regex=True)

In [9]:
# change words to lower case - Hello, HELLO, hello are all the same word
processed = processed.str.lower()
#print(processed)

In [10]:
from nltk.corpus import stopwords

# remove stop words from text messages
# stopwords = words generally filtered out before processing a natural langugage (e.g. for, the, ...)

stop_words = set(stopwords.words('english'))
#print(stop_words)

#loopthrough the set, x is the text msg, x.split() is the words in text msg, and join(append) all words that are NOT in stop_words
processed = processed.apply(lambda x: ' '.join(
    term for term in x.split() if term not in stop_words))

In [11]:
# Keeping only word stems using a Porter stemmer
# e.g. removing the tenses & ings... 
ps = nltk.PorterStemmer()

#for all words(term) in x.split(), we stem it (ps.stem(term)) and join(append) all the stemmed term
processed = processed.apply(lambda x: ' '.join(
    ps.stem(term) for term in x.split()))

In [12]:
print(processed) #print out the processed content after text-cleaning, removing stoplist, removing wordstems...etc

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri numbr wkli comp win fa cup final tk...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    numbr nd time tri numbr contact u u moneysymb ...
5568                              ü b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: 1, Length: 5572, dtype: object


## 3. Generating Features
Features for machine learning algorithms in NLP are generally words in each text message. For this purpose, it will be necessary to tokenize each word and choose the top 500-1500 most common words as features.

In [13]:
from nltk.tokenize import word_tokenize

# create bag-of-words
all_words = []

for message in processed:
    words = word_tokenize(message)
    for w in words:
        all_words.append(w)

#build a nltk frequency distribution list of the tokenized words
all_words = nltk.FreqDist(all_words)


In [14]:
# print the total number of words and the 15 most common words
print('Number of words: {}'.format(len(all_words)))
#print('Most common words: {}'.format(all_words.most_common(500)))

Number of words: 6318


In [15]:
# use the 1500 most common words as features
#the more features, the longer time needed for training, so just choose 500~1500 as an adequate amount
word_features = all_words.most_common(1000)
word_features = [item[0] for item in word_features]

#print(word_features)

In [16]:
# The find_features function will determine which of the 1500 word features are contained in the review
def find_features(message):
    words = word_tokenize(message)
    features = {}
    for word in word_features:
        features[word] = (word in words)

    return features

# # Checking output of the first sentence
# features = find_features(processed[0])
# for key, value in features.items():
#     if value == True:
#         print(key)

In [17]:
# Now lets find features for all the messages
messages = list(zip(processed, Y))

# define a seed for reproducibility
seed = 1
np.random.seed = seed
np.random.shuffle(messages) #shuffle the text messages so that the spam and ham are naturally distributed

# call find_features function for each SMS message
# after zip, the messages are in text, label(Y) format
# the find_features(text) will return a list of most common words with true/false value depending on whether the common word are in that sentence
featuresets = [(find_features(text), label) for (text, label) in messages]

In [18]:
# we can split the featuresets into training and testing datasets using sklearn
from sklearn import model_selection

# split the data into training and testing datasets
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25, random_state=seed)
print("Training data sets: ",len(training)) #75% of original data set
print("Testing data sets: ",len(testing)) #25% of original data set sinze we set test_size = 0.25

Training data sets:  4179
Testing data sets:  1393


## 4. Scikit-Learn Classifiers with NLTK
Train different models and compare their performance. Import all needed algo from sklearn and import some performance metrics, such as accuracy_score and classification_report.

In [19]:
# We can use sklearn algorithms in NLTK
# we will use classifier from sklearn and wrapping them with nltk and deploying it in the notebook
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC #supportvectorclassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Below code trains one model only
# model = SklearnClassifier(SVC(kernel = 'linear'))

# # train the model on the training data
# model.train(training)

# # and test on the testing dataset!
# accuracy = nltk.classify.accuracy(model, testing)*100
# print("SVC Accuracy: {}".format(accuracy))

# Below code trains multiple models at the same time 
# Define models to train
names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
         "Naive Bayes", "SVM Linear"]

classifiers = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter = 100), #need to set the max iteration for SGDClassifier model
    MultinomialNB(),
    SVC(kernel = 'linear') #set the kernal to run linear model
]

models = zip(names, classifiers)


In [20]:
#run each model and show the accuracy of each model
bestModelName = ''
bestModelAccuracy = 0.01
for name, classifiers in models:
    nltk_model = SklearnClassifier(classifiers) #define the model
    nltk_model.train(training) #train the model with the training dataset
    accuracy = nltk.classify.accuracy(nltk_model, testing)*100 # measure the accuracy of the model by passing in the testing dataset
    print("{} Accuracy: {}%".format(name, accuracy))
    if accuracy>bestModelAccuracy:
        bestModelAccuracy = accuracy
        bestModelName = name

print("--All models completed--")
print("{} Model provides the best accuracy of: {}%".format(bestModelName, bestModelAccuracy))

K Nearest Neighbors Accuracy: 95.5491744436468%
Decision Tree Accuracy: 96.98492462311557%
Random Forest Accuracy: 98.85139985642498%
Logistic Regression Accuracy: 99.06676238334529%
SGD Classifier Accuracy: 98.99497487437185%
Naive Bayes Accuracy: 98.85139985642498%
SVM Linear Accuracy: 98.92318736539842%
--All models completed--
Logistic Regression Model provides the best accuracy of: 99.06676238334529%


In [21]:
# Ensemble methods - Voting classifier
# Have all the models to vote on one text msg of its classification
#so instead of relying on only one algorithm, we can rely on all, and take the result from the most models
from sklearn.ensemble import VotingClassifier

names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
         "Naive Bayes", "SVM Linear"]

classifiers = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter = 100),
    MultinomialNB(),
    SVC(kernel = 'linear')
]

models = zip(names, classifiers)

#hard voting is a binary output of each, soft voting will output class probability
#n_jobs = -1,  means all cores are used in training the model in parallel  if set to 2 then will only use 2 cores
nltk_ensemble = SklearnClassifier(VotingClassifier(estimators = list(models), voting = 'hard', n_jobs = -1))
nltk_ensemble.train(training)
accuracy = nltk.classify.accuracy(nltk_ensemble, testing)*100
print("Voting Classifier: Accuracy: {}".format(accuracy))
if accuracy>bestModelAccuracy:
    print("Ensemble method improve overall accuracy by", accuracy - bestModelAccuracy, "%")
else:
    print("Ensemble method does not improve overall accuracy. Perhaps because all models failing at similar edge cases")


Voting Classifier: Accuracy: 99.21033740129216
Ensemble method improve overall accuracy by 0.14357501794687266 %


In [22]:
# make class label prediction for testing set
txt_features, labels = zip(*testing)

prediction = nltk_ensemble.classify_many(txt_features)

In [23]:
# print a confusion matrix and a classification report
print(classification_report(labels, prediction))

pd.DataFrame(
    confusion_matrix(labels, prediction),
    index = [['actual', 'actual'], ['ham', 'spam']],
    columns = [['predicted', 'predicted'], ['ham', 'spam']])

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      1200
           1       0.99      0.95      0.97       193

    accuracy                           0.99      1393
   macro avg       0.99      0.97      0.98      1393
weighted avg       0.99      0.99      0.99      1393



Unnamed: 0_level_0,Unnamed: 1_level_0,predicted,predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,ham,spam
actual,ham,1199,1
actual,spam,10,183
