**Description:** This notebook is used to implement a classifier to devide "SPAM" and "no SPAM" messages. The dataset `SMS Spam Collection v. 1` can be found ([here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)).

**Project Name:** Natural Language SPAM Classifier

**Author:** Silas Mederer

**Date:** 2020-12-09

# Setup

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score

import nltk

# warnings handler
import warnings
warnings.filterwarnings("ignore")

## Load dataset

In [2]:
df = pd.read_csv("data/SMSSpamCollection.txt", encoding="utf-8", header=None, delimiter="\t", names=["target", "text"])
df.head()

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# Encoding target variable
df["target"] = np.where(df["target"] == "spam", 1, 0)
df.head(11)

Unnamed: 0,target,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
5,1,FreeMsg Hey there darling it's been 3 week's n...
6,0,Even my brother is not like to speak with me. ...
7,0,As per your request 'Melle Melle (Oru Minnamin...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...


# Short EDA

In [4]:
print("Dataset contains {} instances of {} variables.".format(df.shape[0], df.shape[1]))

print(
    "It contains {} spam messages ({:.1%} of all)".format(
        df[df.target == 1].shape[0],
        df[df.target == 1].shape[0] / df.shape[0],
    )
)

Dataset contains 5572 instances of 2 variables.
It contains 747 spam messages (13.4% of all)


In [5]:
print("\nExample 'no SPAM'")
print(df.text[0])
print(df.text[1])
print("\nExample 'SPAM'")
print(df.text[2])
print(df.text[5])
print(df.text[8])


Example 'no SPAM'
Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Ok lar... Joking wif u oni...

Example 'SPAM'
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.


It seems like SPAM messages start with wordslike "winner" or "free". Let´s keep this in mind while evaluation.

# Spam classification

The model will be a "vanilla" classifier, without pouring too many thoughts about what the actual messages, spam or not, look like. To improve your model you can of course have a closer look and investigate the data more in detail. 

In [6]:
# Split dataset between train and test sets
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["target"], random_state=101)

We'll start by using a count vectorizer on our data. It will convert our text and return a sparse matrix.

## CountVectorizer

In [7]:
vect = CountVectorizer().fit(X_train)

# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
print("X_train_vectorized: ")
X_train_vectorized

X_train_vectorized: 


<4179x7523 sparse matrix of type '<class 'numpy.int64'>'
	with 56353 stored elements in Compressed Sparse Row format>

In [8]:
print("X_train shape     = {}".format(X_train.shape[0]))
print("Vocabulary length = {}".format(len(vect.vocabulary_)))

X_train shape     = 4179
Vocabulary length = 7523


In [9]:
# vect.vocabulary_

In [10]:
# Train Logistic Regression
log_reg = LogisticRegression(max_iter=1500)
log_reg.fit(X_train_vectorized, y_train)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_vectorized, y_train)

# Train SVC
svc = LinearSVC()
svc.fit(X_train_vectorized, y_train)

# Predict the transformed test documents
log_reg_predictions = log_reg.predict(vect.transform(X_test))
knn_predictions = knn.predict(vect.transform(X_test))
svc_predictions = svc.predict(vect.transform(X_test))

### Evaluation

In [11]:
print(f"LogisticRegression AUC = {round(roc_auc_score(y_test, log_reg_predictions),3)}")
print(f"KNN                AUC = {round(roc_auc_score(y_test, knn_predictions),3)}")
print(f"SVC                AUC = {round(roc_auc_score(y_test, svc_predictions),3)}")

LogisticRegression AUC = 0.917
KNN                AUC = 0.822
SVC                AUC = 0.935


The best model is a linear support vector classifier with an AUC of 0.935, second best is a logistic regression with AUC 0.917.

In [12]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
log_reg_sorted_coef_index = log_reg.coef_[0].argsort()
print("Log Reg features:\n")
print("Smallest Coefs:\n{}\n".format(feature_names[log_reg_sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[log_reg_sorted_coef_index[:-11:-1]]))

svc_sorted_coef_index = svc.coef_[0].argsort()
print("\n" + "SVC features:\n")
print("Smallest Coefs:\n{}\n".format(feature_names[svc_sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[svc_sorted_coef_index[:-11:-1]]))

Log Reg features:

Smallest Coefs:
['my' 'but' 'gt' 'lt' 'sir' 'fullonsms' 'him' 'me' 'can' 'place']

Largest Coefs: 
['txt' 'text' 'call' 'ringtone' 'reply' 'chat' 'won' 'uk' '150p' 'stop']

SVC features:

Smallest Coefs:
['liked' 'place' 'sir' 'once' 'fullonsms' 'gt' 'can' 'lunch' 'lt' 'missed']

Largest Coefs: 
['146tf150p' 'ringtoneking' '84484' 'ringtone' 'stories' 'filthy' 'txt'
 '88066' 'won' 'chat']


## TfidfVectorizer

In [13]:
vect = TfidfVectorizer(min_df=3).fit(X_train)

# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
print("X_train_vectorized: ")
X_train_vectorized

X_train_vectorized: 


<4179x2308 sparse matrix of type '<class 'numpy.float64'>'
	with 49944 stored elements in Compressed Sparse Row format>

In [14]:
# vect.vocabulary_

In [15]:
# Train Logistic Regression
log_reg = LogisticRegression(max_iter=1500)
log_reg.fit(X_train_vectorized, y_train)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_vectorized, y_train)

# Train SVC
svc = LinearSVC()
svc.fit(X_train_vectorized, y_train)

# Predict the transformed test documents
log_reg_predictions = log_reg.predict(vect.transform(X_test))
knn_predictions = knn.predict(vect.transform(X_test))
svc_predictions = svc.predict(vect.transform(X_test))

### Evaluation

In [16]:
print(f"LogisticRegression AUC = {round(roc_auc_score(y_test, log_reg_predictions),3)}")
print(f"KNN                AUC = {round(roc_auc_score(y_test, knn_predictions),3)}")
print(f"SVC                AUC = {round(roc_auc_score(y_test, svc_predictions),3)}")

LogisticRegression AUC = 0.876
KNN                AUC = 0.822
SVC                AUC = 0.944


The SVC is the best and increased to an AUC score of 0.944. Both other models performances decreased.

In [17]:
# Sort the coefficients from the model
log_reg_sorted_coef_index = log_reg.coef_[0].argsort()
print("Log Reg features:\n")
print("Smallest Coefs:\n{}\n".format(feature_names[log_reg_sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[log_reg_sorted_coef_index[:-11:-1]]))

svc_sorted_coef_index = svc.coef_[0].argsort()
print("\n" + "SVC features:\n")
print("Smallest Coefs:\n{}\n".format(feature_names[svc_sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[svc_sorted_coef_index[:-11:-1]]))

Log Reg features:

Smallest Coefs:
['beautiful' 'blacko' '2geva' 'aroundn' 'costs' 'barolla' 'alive' 'bought'
 '2stoptx' 'anything']

Largest Coefs: 
['cutest' '2morrow' 'corect' 'admit' 'bergkamp' 'competition' 'director'
 'cricketer' 'casting' 'advisors']

SVC features:

Smallest Coefs:
['affection' 'beautiful' '2stoptx' 'alive' 'desparate' 'b4280703'
 'barolla' '2geva' 'blacko' 'bsnl']

Largest Coefs: 
['cutest' 'corect' 'da' '08000930705' '382' 'bergkamp' 'director' 'child'
 '402' 'celebration']


## Stemming

It is also possible to use stemming (remove morphological affixes from words, leaving only the word stem) as a preprocessing step. 

In [18]:
# Initializing stemmer and countvectorizer 
stemmer = nltk.PorterStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

stemm_vectorizer = CountVectorizer(analyzer=stemmed_words)

# Transform X_train
X_train_stemm_vectorized = stemm_vectorizer.fit_transform(X_train)

In [19]:
# Train Logistic Regression
log_reg_stemm = LogisticRegression(max_iter=1500)
log_reg_stemm.fit(X_train_stemm_vectorized, y_train)

# Train SVC
svc_stemm = LinearSVC()
svc_stemm.fit(X_train_stemm_vectorized, y_train)

# Predict the transformed test documents
log_reg_predictions = log_reg_stemm.predict(stemm_vectorizer.transform(X_test))
svc_predictions = svc_stemm.predict(stemm_vectorizer.transform(X_test))

### Evaluation

In [20]:
print(f"LogisticRegression AUC = {round(roc_auc_score(y_test, log_reg_predictions),3)}")
print(f"SVC                AUC = {round(roc_auc_score(y_test, svc_predictions),3)}")

LogisticRegression AUC = 0.93
SVC                AUC = 0.951


Stemming improved the model.

### Lemmatization

The same way we used stemming we can also apply lemmatization (grouping together the inflected forms of a word) to the data.

In [21]:
# nltk.download('wordnet')

In [22]:
# Initialization
WNlemma = nltk.WordNetLemmatizer()
analyzer = CountVectorizer().build_analyzer()

def lemmatize_word(doc):
    return (WNlemma.lemmatize(t) for t in analyzer(doc))

lemm_vectorizer = CountVectorizer(analyzer=lemmatize_word)

# Transform X_train
X_train_lemm_vectorized = lemm_vectorizer.fit_transform(X_train)

In [23]:
# Train Logistic Regression
log_reg_lemm = LogisticRegression(max_iter=1500)
log_reg_lemm.fit(X_train_lemm_vectorized, y_train)

# Train SVC
svc_lemm = LinearSVC()
svc_lemm.fit(X_train_lemm_vectorized, y_train)

# Predict the transformed test documents
log_reg_predictions = log_reg_lemm.predict(lemm_vectorizer.transform(X_test))
svc_predictions = svc_lemm.predict(lemm_vectorizer.transform(X_test))

### Evaluation

In [24]:
print(f"LogisticRegression AUC = {round(roc_auc_score(y_test, log_reg_predictions),3)}")
print(f"SVC                AUC = {round(roc_auc_score(y_test, svc_predictions),3)}")

LogisticRegression AUC = 0.92
SVC                AUC = 0.938


Lemmatization did not work in this case. Maybe I need to research more lingustics. :)

# Conclusion

I was able to build a model (SVC with stemming) that scored an AUC of 0.951. But we need to keep in mind that the dataset was higly imbalanced. The tought we started with, that the word "winner" or "free" has a hig impact, has not been proofed jet. We just know that these are none of the top ten words. 