## **Introduction**

According to a dataset to classify Short Message Service (SMS) as spam or not, the goal will be to detect spam messages using Bidirectional Encoder Representations from Transformers (BERT) and some other Machine Learning (ML) classification algorithms.

Among these algorithms, four ML classification algorithms will be compared:

BERT

K-Nearest Neighbors (KNN)

Multinomial Naive Bayes

Support Vector Model (SVM)



### **Objectives**
To compare these algorithms, I will:

Do Feature Engineering: create the features according to the raw data.

Analyze and understand the data made available.

Pre-process these data according to the algorithm: for instance, some of these algorithms only work with numerical values.

Do Fine-Tuning: optimize the training parameters of the ML algorithm.

Compare the results obtained.

Take the best and/or simplest algorithm if there is no significant difference.

**Installing and Importing Packages**

In [None]:
!python -m pip install matplotlib
!python -m pip install nltk
!python -m pip install numpy
!python -m pip install pandas
!python -m pip install seaborn
!python -m pip install sklearn
!python -m pip install tensorflow
!python -m pip install transformers

In [None]:
import re

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import (accuracy_score, confusion_matrix, precision_score,
                             recall_score)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from transformers import TFTrainer, TFTrainingArguments

In [None]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

In [None]:
df = pd.read_csv("/content/sms_spam.csv", encoding="latin-1")
df.head(n=10)

In [None]:
df.describe()

**Preprocessing**

Binarized the spam column

In [None]:
df["type"].replace({"ham": 0, "spam": 1}, inplace=True)
df.rename({"type": "is_spam", "text": "content"}, axis=1, inplace=True)
df.head(n=10)

the data is now clearer, so we look at the shape 

In [None]:
df.shape

There are 5559 rows and 2 columns 

**Feature Engineering**

By Feature Engineering, we refer to the creation of the features according to the raw data. It is on the basis of these features that the training of the classification ML models will be done.

Among these features, we will create these:

nwords: feature that will contain the number of words in an SMS.
message_len: feature that will contain the number of characters in an SMS message.

nupperchars: feature that will contain the number of uppercase characters in an SMS.

nupperwords: feature that will contain the number of uppercase words in an SMS.

is_free_o_win: feature that will contain 1 if the SMS contains the words "free" and "win"; 0 otherwise.

is_url: feature that will contain 1 if the SMS contains a URL; 0 otherwise.


In [None]:
df["nwords"] = df["content"].apply(lambda s: len(re.findall(r"\w+", s)))
df["message_len"] = df["content"].apply(len)
df["nupperchars"] = df["content"].apply(
    lambda s: sum(1 for c in s if c.isupper())
)
df["nupperwords"] = df["content"].apply(
    lambda s: len(re.findall(r"\b[A-Z][A-Z]+\b", s))
)
df["is_free_or_win"] = df["content"].apply(
    lambda s: int("free" in s.lower() or "win" in s.lower())
)
df["is_url"] = df["content"].apply(
    lambda s: 1
    if re.search(
        r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
        s,
    )
    else 0
)
df.head(n=25)

***Analyzing and Understanding the data ***

It is important to analyze and understand the available data. Indeed, once a better understanding of the dataset is achieved, we will be able to create the necessary features for the dataset.

we will analyze the following eight aspects:

**the SMS distribution**

**the word frequency in spam and ham SMS** 

**the length of spam SMS compared to ham SMS**

**the number of words in spam SMS compared to ham SMS**

**the number of uppercase words in spam SMS compared to ham SMS**

**the number of uppercase characters in spam SMS compared to ham SMS**

**the content of the words "free" or "win" in the SMS**

**the content of a URL in the SMS.**

***SMS Distribution***

In [None]:
n_sms = pd.value_counts(df["is_spam"], sort=True)
n_sms.plot(kind="pie", labels=["ham", "spam"], autopct="%1.0f%%")

plt.title("SMS Distribution")
plt.ylabel("")
plt.show()

87% of the messages are ham and 13% are ham 

***Word Frequency***

We create two datasets, then sketch their corresponding graphs 

df1: will contain the words and their frequency in the SMS ham.

df2: will contain the words and their frequency in the SMS spam.

In [None]:
from collections import Counter

df1 = pd.DataFrame.from_dict(
    Counter(" ".join(df[df['is_spam'] == 0]["content"]).split()).most_common(20)
)
df1 = df1.rename(columns={0: "word_in_ham", 1 : "frequency"})
                 
df2 = pd.DataFrame.from_dict(
    Counter(" ".join(df[df['is_spam'] == 1]["content"]).split()).most_common(20)
)
df2 = df2.rename(columns={0: "word_in_spam", 1 : "frequency"})

In [None]:
df1.plot.bar(legend=False)
plt.xticks(np.arange(len(df1["word_in_ham"])), df1["word_in_ham"])
plt.title("Word Frequency in Ham SMS.")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()

In [None]:
df2.plot.bar(legend=False, color="orange")
plt.xticks(np.arange(len(df2["word_in_spam"])), df2["word_in_spam"])
plt.title("Word Frequency in Spam SMS.")
plt.xlabel("Word")
plt.ylabel("Frequency")
plt.show()


After sketching, we can see that stop words are the most frequent words in both spam and ham SMS.

***Length***

Lets see of text length has influences spam or ham texts 

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
sns.kdeplot(
    df.loc[df.is_spam == 0, "message_len"],
    shade=True,
    label="Ham",
    clip=(-50, 250),
)
sns.kdeplot(df.loc[df.is_spam == 1, "message_len"], shade=True, label="Spam")
ax.set(
    xlabel="Length",
    ylabel="Density",
    title="Length of SMS.",
)
ax.legend(loc="upper right")
plt.show()

We find that, spam messages are longer than ham messages (that is normal due to the number of words).
Spam messages have around 150 characters.

***Number of Words***

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
sns.kdeplot(
    df.loc[df.is_spam == 0, "nwords"],
    shade=True,
    label="Ham",
    clip=(-10, 50),
)
sns.kdeplot(df.loc[df.is_spam == 1, "nwords"], shade=True, label="Spam")
ax.set(
    xlabel="Words",
    ylabel="Density",
    title="Number of Words in SMS.",
)
ax.legend(loc="upper right")

With this plot, we  notice that spam SMS have more words than ham SMS.

Spam SMS seem to have around 30 words, where ham SMS seem to have around 10 words to 25 words and more.

***Number of Uppercased Words***

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
sns.kdeplot(
    df.loc[df.is_spam == 0, "nupperwords"],
    shade=True,
    label="Ham",
    clip=(0, 35),
)
sns.kdeplot(df.loc[df.is_spam == 1, "nupperwords"], shade=True, label="Spam")
ax.set(
    xlabel="Uppercased Words",
    ylabel="Density",
    title="Number of Uppercased Words.",
)
ax.legend(loc="upper right")
plt.show()

With this plot, we notice that there is a small pattern with the number of uppercased words. The density is lower which is normal due to the fact that there is less spam messages than ham messages.

We can also notice that the number of uppercased words is around zero for the ham messages.

***Number of Uppercase Characters***

In [None]:
_, ax = plt.subplots(figsize=(10, 5))
ax = sns.scatterplot(x="message_len", y="nupperchars", hue="is_spam", data=df)
ax.set(
    xlabel="Characters",
    ylabel="Uppercase Characters",
    title="Number of Uppercased Characters in SMS.",
)
ax.legend(loc="upper right")
plt.show()

We notice that spam messages are clustered together based on their length. But we can also see that some spam messages have more uppercased characters that others.
There is a linear pattern for ham messages that contains more uppercased character than others.

***Contains "free" or "win"***

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
grouped_data = (
    df.groupby("is_spam")["is_free_or_win"]
    .value_counts(normalize=True)
    .rename("Percentage of Group")
    .reset_index()
)
print(grouped_data)

sns.barplot(
    x="is_spam",
    y="Percentage of Group",
    hue="is_free_or_win",
    data=grouped_data,
)
plt.show()

There is **36.94%** of spam SMS that contains the words "free" or "win".

There is only **2.69%** of ham SMS that contains the words "free" or "win".

***Contains URL***

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
grouped_data = (
    df.groupby("is_spam")["is_url"]
    .value_counts(normalize=True)
    .rename("Percentage of Group")
    .reset_index()
)
print(grouped_data)

sns.barplot(
    x="is_spam",
    y="Percentage of Group",
    hue="is_url",
    data=grouped_data,
)
plt.show()

there is **2.55%** of spam that contains a URL

there is only **97.45%** of spam that doesn't contains a URL.

## **Preprocessing Data**

Preprocessing data is a process of preparing raw data to make them suitable to a model.

- Our dataset has some drawbacks like:

- presence of stop words (e.g., so, is, a)

- presence of punctuations and digits

- words are not lemmatized


Since this dataset has a lot of abbreviations, we will not apply stemming, but only lemmatization. Also, since a SMS is not a formal message, it may be wise to keep capital letters and abbreviations.


We would remove stop words, punctuations and digits and lemmatize

In [None]:
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

df["content"] = df["content"].apply(
    lambda row: re.sub(r"[^a-zA-Z]+", " ", row)  
)
df["content"] = df["content"].apply(lambda row: word_tokenize(row))
df["content"] = df["content"].apply(
    lambda row: [
        token for token in row if token not in set(stopwords.words("english"))
    ]
)
df["content"] = df["content"].apply(
    lambda row: " ".join([WordNetLemmatizer().lemmatize(word) for word in row])
)
df.head(n=25)

## **Creation of Training and Testing Datasets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop("is_spam", axis=1), df["is_spam"], stratify=df["is_spam"], test_size=0.2
)
print(f"Training data: {len(X_train)} (80%)")
print(f" Testing data: {len(X_test)} (20%)")

### **BERT**

BERT is a bidirectional transformer pretrained using a combination of Masked Language Modeling (MLM) objective and Next Sentence Prediction (NSP) on a large corpus comprising the Book Corpus and Wikipedia.

**Tokenization**

Tokenization will allow us to feed batches of sequences into the model at the same time, only if these two conditions are met:

- the SMS are padded to the same length

- the SMS are truncated to be not longer model's maximum input length

To do the tokenization of our datasets, we also need to choose a pre-trained model. For this dataset, the basic model (bert-base-uncased) will be sufficient:

In [None]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
tokenizer


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Before we can encode our datasets with BERT, it is important to decide on a maximum sentence length for padding/truncating to. This will allow us to have a better speed for training and evaluation.

To do this, we will perform one tokenization pass of the datasets in order to measure the maximum sentence length:

In [None]:
max_len = 0
for row in X_train["content"]:
    max_len = max(max_len, len(tokenizer.encode(row)))
print(f"Max sentence length (train): {max_len}")

max_len = 0
for row in X_test["content"]:
    max_len = max(max_len, len(tokenizer.encode(row)))
print(f"Max sentence length (test): {max_len}")

Since we have that the maximum length sentence is 92 for the training dataset and 93 for the testing dataset, we will take a maximum length of 96 characters for both datasets.



Based on this pre-trained model, the encodings for our training and testing datasets are generated as follows:

In [None]:
train_encodings = tokenizer(
    X_train["content"].tolist(),
    max_length=96,
    padding="max_length",
    truncation=True,
)
test_encodings = tokenizer(
    X_test["content"].tolist(),
    max_length=96,
    padding="max_length",
    truncation=True,
)

**Transformation of Labels and Encodings**

Before we can Fine-Tuning and training our model, we must batched these encodings to a TensorSliceDataset object, so that each key in the batch encoding corresponds to a hyper-parameters named according to the model we are going to train:

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices(
    (dict(train_encodings), y_train)
)
test_dataset = tf.data.Dataset.from_tensor_slices(
    (dict(test_encodings), y_test)
)

We are now ready to Fine-Tuning and training our BERT model!

**Fine-Tuning and Training**

Fine-tuning consists of generating embeddings specific to a task. Since we would like to create embeddings specifically for a classification task, we will have to train our data only for this task. However, for a pre-trained BERT model that is best suited for multiple tasks, fine-tuning will not be possible. It will therefore be necessary to generate the BERT embeddings as features and pass them through an independent classifier (e.g., RandomForest).

Using the TFTrainingArguments class present in the huggingface/transformers module, the Fine-Tuning can be done this way:

In [None]:
training_args = TFTrainingArguments(
    output_dir="/kaggle/working/sms/results/bert",
    num_train_epochs=8,
    per_device_train_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/kaggle/working/sms/logs/bert",
    logging_steps=10,
)

Following Fine-Tuning and our datasets, the training of the BERT model can be done as follows:

In [None]:
from transformers import TFBertForSequenceClassification

with training_args.strategy.scope():
    model = TFBertForSequenceClassification.from_pretrained(
        "bert-base-uncased"
    )

trainer = TFTrainer(
    model=model, args=training_args, train_dataset=train_dataset
)


In [None]:
trainer.save_model("/kaggle/working/sms/models/bert")
tokenizer.save_pretrained(training_args.output_dir)

**Measurement of Predictions**

The measurement of SMS predictions present in our test dataset as spam or ham, will allow us to make sure that the model is well trained.

Predictions

Our BERT model being trained, we can now use it to predict if the SMS in our test dataset are spams or not:

In [None]:
preds, label_ids, metrics = trainer.predict(test_dataset)
preds[:5]

In [None]:
print(f"Test dataset size: {len(y_test)}")
print(f" Predictions size: {len(preds)}")


In [None]:
preds = np.argmax(preds, axis=1)
preds

**Confusion Matrix**

Sketch the confusion matrix will allow us to measures the quality of the classification system:

In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")

# **KNN**

K-Nearest Neighbors (KNN) is an approach to data classification that estimates how likely a data point is to be a member of one group or the other depending on what group the data points nearest to it are in.

***Fine-Tuning and Training***

As we said before, let's use grid search techniques using cross-validation to determine the hyper-parameters of our model and train this model on them:

To tune the hyper-parameters of the KNN, it is recommended to use grid search techniques using cross-validation (SEE: scikit-learn's documentation) to evaluate the performance of the model on the data at each value.

Let's use this technique to train our model according to the optimal value of the neighbors hyper-parameter:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = GridSearchCV(
    Pipeline(
        [
            ("bow", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", KNeighborsClassifier()),
        ]
    ),
    {
        "clf__n_neighbors": (8, 15, 20, 25, 40, 55),
    }
)
knn.fit(X=X_train["content"], y=y_train)

In [None]:
preds = knn.predict(X_test["content"])
preds


In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

Through the confusion matrix, we have:

- 959 SMS being ham were well predicted: True Negative (TN)

- 54 ham SMS have been detected as spam: False Negative (FN)

- 4 spam SMS have been detected as ham: False Positive (FP)

- 95 spam SMS have been detected as spam: True Positive (TP).

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")

# **Multinomial Naive Bayes Classifier**
As the features of our dataset have discrete frequency counts, we will use the Multinomial type of Naive Bayes Model.

To detect if a SMS is consider as spam or not, the Multinomial Naive Bayes classifier will use word counts in the content of the SMS with the help of the Bag-of-Words (BoW) method. This method, will elaborate a matrix of rows according to words, where each intersection corresponds to the frequency of occurrence of these words.

***Fine-Tuning and Training***

As we said before, let's use grid search techniques using cross-validation to determine the optimal value of the  𝛼  hyper-parameter of our model and train this model on this hyper-parameter:

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnbayes = GridSearchCV(
    Pipeline(
        [
            ("bow", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", MultinomialNB()),
        ]
    ),
    {
        "tfidf__use_idf": (True, False),
        "clf__alpha": (0.1, 1e-2, 1e-3),
        "clf__fit_prior": (True, False),
    },
)
mnbayes.fit(X=X_train["content"], y=y_train)

In [None]:
mnbayes.best_params_


In [None]:
print(f"{mnbayes.best_score_ * 100:.3f}%") 

The mean cross-validated score is therefore 98.223%

***Predictions***

Our Multinomial Naive Bayes model being trained, we can now use it to predict if the SMS in our test dataset are spams or not:

In [None]:
preds = mnbayes.predict(X_test["content"])
preds


In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

Through the confusion matrix, we have:

- 961 SMS being ham were well predicted: True Negative (TN);
- 16 ham SMS have been detected as spam: False Negative (FN);
- 2 spam SMS have been detected as ham: False Positive (FP);
- 133 spam SMS have been detected as spam: True Positive (TP).

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")

# **SVM**

Fine-Tuning and Training 

As we said before, let's use grid search techniques using cross-validation to determine the hyper-parameters of our model and train this model on them:

In [None]:
from sklearn.svm import SVC

svc = GridSearchCV(
    Pipeline(
        [
            ("bow", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", SVC(gamma="auto", C=1000)),
        ]
    ),
    dict(tfidf=[None, TfidfTransformer()], clf__C=[500, 1000, 1500]),
)
svc.fit(X=X_train["content"], y=y_train)

In [None]:
svc.best_params_

For our training dataset,  𝐶  must be equal to 1000 and we shouldn't transform the count matrix to a normalized term-frequency (tf) representation or for a term-frequency times inverse document-frequency (tf-idf) representation.

In addition, we can get the mean cross-validated score of the estimator that was chosen by the search:

In [None]:
print(f"{svc.best_score_ * 100:.3f}%") 

The mean cross-validated score is therefore 98.111%

Measurement of Predictions

Predictions
Our SVM model being trained, we can now use it to predict if the SMS in our test dataset are spams or not:

In [None]:
preds = svc.predict(X_test["content"])
preds

**Confusion Matrix**

Using the confusion matrix, measures of the quality of the classification system are given:

In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

Through the confusion matrix, we have:

- 960 SMS being ham were well predicted: True Negative (TN);
- 16 ham SMS have been detected as spam: False Negative (FN);
- 3 spam SMS have been detected as ham: False Positive (FP);
- 133 spam SMS have been detected as spam: True Positive (TP).

Scores

Let's look at the score obtained by the predictions:

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")