# Introduction

This Notebook is **mainly for educational purposes**. According to a dataset to classify Short Message Service (SMS) as spam or not, the goal will be to evaluate Bidirectional Encoder Representations from Transformers (BERT) with other Machine Learning (ML) classification algorithms.

Among these algorithms, four ML classification algorithms will be compared:

1.   **DistilBERT**;
2.   **K-Nearest Neighbors (KNN)**;
3.   **Multinomial Naive Bayes**;
4.   **Support Vector Model (SVM)**.

We had the opportunity to test those algorithms with [another dataset](https://www.kaggle.com/rememberyou/comparison-of-bert-and-other-ml-classification-ii) containing SMS samples as well.

# Objectives

To compare these algorithms, we will:

*   **Do Feature Engineering**: create the features according to the raw data.
*   **Analyze and understand the data** made available.
*   **Pre-process these data according to the algorithm**: for instance, some of these algorithms only work with numerical values.
*   **Do Fine-Tuning**: optimize the training parameters of the ML algorithm.
*   **Compare the results** obtained.
*   **Apply the Ockham's razor principle**: take the best and/or simplest algorithm if there is no significant difference.

# Installing and Importing Packages


Using the `pip` Python package manager, let's install all the necessary packages for this Notebook:

In [None]:
!python -m pip install matplotlib
!python -m pip install nltk
!python -m pip install numpy
!python -m pip install pandas
!python -m pip install seaborn
!python -m pip install sklearn
!python -m pip install tensorflow
!python -m pip install transformers

The necessary packages being installed, let's already import most of the packages for this Notebook:

In [None]:
import re

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import (accuracy_score, confusion_matrix, precision_score,
                             recall_score)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from transformers import TFTrainer, TFTrainingArguments

With the NLTK's data downloader, we will install the following corpora and trained models:
*   `punkt`: Punkt Tokenizer Models.
*   `stopwords`: Stopwords Corpus.
*   `wordnet`: WordNet-InfoContent.

In [None]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

# Dataset

For this Notebook, the [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset) dataset has been chosen. The main reasons for such a choice is that it contains a lot of data, and its columns are suitable for the comparison of the ML algorithms we want to make.

In this section we will load the dataset and apply minor preprocessing to the columns and values to make it easier to use.

## Loading

Let's start by loading our dataset and looking at the columns available to us:

In [None]:
df = pd.read_csv(
    "../input/sms-spam-collection-dataset/spam.csv", encoding="latin-1"
)
df.head(n=10)

After loading this dataset, you can directly see some modifications to be made:

*   **three** "Unnamed" **columns can be deleted**;
*   **the spam column** (`v1`) a**nd the SMS content column** (`v2`) **can be renamed** to be more explicit;
*  **the content of the spam column** (`v1`) **can be binarized** for better processing ease for ML algorithms.



## Preprocessing

In order to have a better ease of use, a first pre-processing of these data would be to apply the modifications mentioned above:

In [None]:
df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], inplace=True, axis=1)
df.rename({"v1": "is_spam", "v2": "content"}, axis=1, inplace=True)
df["is_spam"].replace({"ham": 0, "spam": 1}, inplace=True)
df.head(n=10)

We can see that the dataset has become much clearer.

Finally, to get a better idea on the amount of data made available, we can look at the shape of the DataFrame that defines the dataset:

In [None]:
df.shape

So we have a **dataset that contains 5572 rows and 2 columns**.

# Feature Engineering

By Feature Engineering, we **refer to the creation of the features according to the raw data**. It is on the basis of these features that the training of the classification ML models will be done.

Among these features, we will create these:

*   `nwords`: feature that will **contain the number of words** in an SMS.
*   `message_len`: feature that will **contain the number of characters** in an SMS message.
*   `nupperchars`: feature that will **contain the number of uppercase characters** in an SMS.
*   `nupperwords`: feature that will **contain the number of uppercase words** in an SMS.
*   `is_free_o_win`: feature that will **contain 1 if the SMS contains the words "free" and "win"; 0 otherwise**.
*   `is_url`: feature that will **contain 1 if the SMS contains a URL; 0 otherwise**.

This translates as follows:

In [None]:
df["nwords"] = df["content"].apply(lambda s: len(re.findall(r"\w+", s)))
df["message_len"] = df["content"].apply(len)
df["nupperchars"] = df["content"].apply(
    lambda s: sum(1 for c in s if c.isupper())
)
df["nupperwords"] = df["content"].apply(
    lambda s: len(re.findall(r"\b[A-Z][A-Z]+\b", s))
)
df["is_free_or_win"] = df["content"].apply(
    lambda s: int("free" in s.lower() or "win" in s.lower())
)
df["is_url"] = df["content"].apply(
    lambda s: 1
    if re.search(
        r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
        s,
    )
    else 0
)
df.head(n=25)

We can see that the columns corresponding to our features have been created.

# Analyze and Understanding Data

Above all, it is **important to analyze and understand the data** made available. Indeed, once a better understanding of the dataset is achieved, **we will be able to create the necessary features** for the dataset.

The dataset being loaded, we will analyze the following seven aspects:

1.   **the SMS distribution**;
2.   **the word frequency in spam and ham SMS**;
3.   **the length of spam SMS compared to ham SMS**;
4.   **the number of words in spam SMS compared to ham SMS**;
5.   **the number of uppercase words in spam SMS compared to ham SMS**;
6.   **the number of uppercase characters in spam SMS compared to ham SMS**;
7.   **the content of the words "free" or "win" in the SMS**;
8.   **the content of a URL in the SMS**.

## SMS Distribution

Now that we know a bit more about the organization of the dataset, it is good to know the percentage of spam SMS and ham SMS:

In [None]:
n_sms = pd.value_counts(df["is_spam"], sort=True)
n_sms.plot(kind="pie", labels=["ham", "spam"], autopct="%1.0f%%")

plt.title("SMS Distribution")
plt.ylabel("")
plt.show()

Above, **87% of these SMS are ham and 13% of them are spam**.

## Word Frequency

Let's start by creating two DataFrames: 

1.   `df1`: will contain the words and their frequency in the SMS ham.
2.   `df2`: will contain the words and their frequency in the SMS spam.


In [None]:
from collections import Counter

df1 = pd.DataFrame.from_dict(
    Counter(" ".join(df[df['is_spam'] == 0]["content"]).split()).most_common(20)
)
df1 = df1.rename(columns={0: "word_in_ham", 1 : "frequency"})
                 
df2 = pd.DataFrame.from_dict(
    Counter(" ".join(df[df['is_spam'] == 1]["content"]).split()).most_common(20)
)
df2 = df2.rename(columns={0: "word_in_spam", 1 : "frequency"})

Now that the DataFrames have been created, let's sketch their corresponding graphs in order to look at their respective word frequencies:

In [None]:
df1.plot.bar(legend=False)
plt.xticks(np.arange(len(df1["word_in_ham"])), df1["word_in_ham"])
plt.title("Word Frequency in Ham SMS.")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()

In [None]:
df2.plot.bar(legend=False, color="orange")
plt.xticks(np.arange(len(df2["word_in_spam"])), df2["word_in_spam"])
plt.title("Word Frequency in Spam SMS.")
plt.xlabel("Word")
plt.ylabel("Frequency")
plt.show()

After sketching, we can see that stop words are the most frequent words in both spam and ham SMS.

## Length

Let's see if the length has an influence on SMS spam or ham:

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
sns.kdeplot(
    df.loc[df.is_spam == 0, "message_len"],
    shade=True,
    label="Ham",
    clip=(-50, 250),
)
sns.kdeplot(df.loc[df.is_spam == 1, "message_len"], shade=True, label="Spam")
ax.set(
    xlabel="Length",
    ylabel="Density",
    title="Length of SMS.",
)
ax.legend(loc="upper right")
plt.show()

With this plot, we notice two things:
1.   In general, **spam** messages are **longer** than **ham** messages (that is normal due to the number of words).
2.   **Spam** messages have around **150 characters**.



## Number of Words

Let's see if the number of words has an influence on SMS spam or ham:

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
sns.kdeplot(
    df.loc[df.is_spam == 0, "nwords"],
    shade=True,
    label="Ham",
    clip=(-10, 50),
)
sns.kdeplot(df.loc[df.is_spam == 1, "nwords"], shade=True, label="Spam")
ax.set(
    xlabel="Words",
    ylabel="Density",
    title="Number of Words in SMS.",
)
ax.legend(loc="upper right")

With this plot, we can notice that **spam** SMS have more words than **ham** SMS.

**Spam** SMS seem to have around **30 words**, where **ham** SMS seem to have around **10 words** to **25 words** and more.

## Number of Uppercased Words

Let's see if the number of uppercased words has an influence on SMS spam or ham:

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
sns.kdeplot(
    df.loc[df.is_spam == 0, "nupperwords"],
    shade=True,
    label="Ham",
    clip=(0, 35),
)
sns.kdeplot(df.loc[df.is_spam == 1, "nupperwords"], shade=True, label="Spam")
ax.set(
    xlabel="Uppercased Words",
    ylabel="Density",
    title="Number of Uppercased Words.",
)
ax.legend(loc="upper right")

With this plot, we can notice that there is a small pattern with the number of **uppercased words**. The **density is lower** which is normal due to the fact that there is **less spam** messages than **ham** messages.

We can also notice that the number of **uppercased words** is around **zero** for the **ham** messages.

## Number of Uppercased Characters

Let's see if the number of uppercased characters has an influence on SMS spam or ham:

In [None]:
_, ax = plt.subplots(figsize=(10, 5))
ax = sns.scatterplot(x="message_len", y="nupperchars", hue="is_spam", data=df)
ax.set(
    xlabel="Characters",
    ylabel="Uppercase Characters",
    title="Number of Uppercased Characters in SMS.",
)
ax.legend(loc="upper right")
plt.show()

With this plot we can notice two things:
1.   **Spam** messages are clustered together based on their length. But we can also see that some **spam** messages have more uppercased characters that others.
2.   There is a linear pattern for **ham** messages that contains more uppercased character than others. 



## Contains "free" or "win"

Let's see if the "free" and "win" words has an influence on SMS spam or ham:

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
grouped_data = (
    df.groupby("is_spam")["is_free_or_win"]
    .value_counts(normalize=True)
    .rename("Percentage of Group")
    .reset_index()
)
print(grouped_data)

ax.set(
    title="Distribution of FREE/WIN Words Between Spam and Ham"
)

sns.barplot(
    x="is_spam",
    y="Percentage of Group",
    hue="is_free_or_win",
    data=grouped_data,
)
plt.show()

With this plot, we can notice two things:
1.   There is **36.94% of spam** SMS that contains the words **"free"** or **"win"**.
2.   There is only **2.69% of ham** SMS that contains the words **"free"** or **"win"**.



## Contains URL

Let's see if a URL has an influence on SMS spam or ham:

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
grouped_data = (
    df.groupby("is_spam")["is_url"]
    .value_counts(normalize=True)
    .rename("Percentage of Group")
    .reset_index()
)
print(grouped_data)

ax.set(
    title="Distribution of URL Between Spam and Ham"
)

sns.barplot(
    x="is_spam",
    y="Percentage of Group",
    hue="is_url",
    data=grouped_data,
)
plt.show()

With this plot, we can notice two things:
1.   there is **2.55% of spam** that contains a URL;
2.   there is only **97.45% of spam** that doesn't contains a URL.

# Preprocessing Data

In ML, preprocessing data is a process of preparing raw data to make them suitable to a ML model. 

From a semantic point of view, our dataset has some drawbacks for a ML model:

*   **presence of stop words** (e.g., so, is, a);
*   **presence of punctuations and digits**;
*   **words are not lemmatized.**

Since this dataset has a lot of abbreviations, we will not apply stemming, but only lemmatization.

As a quick reminder:

*   **Stemming**: NLP algorithm that **cuts the end or the beginning of a word** based on a list of common prefixes that can be found in an inflected word (e.g., `Stemming[change, changing, changes]` ➡️ chang).
*   **Lemmatization**: NLP algorithm that **looks at the morphological analysis of words** based on detailed dictionaries, in order to relate the shape of a word to its lemma (e.g., `Lemmatization[change, changing, changes]` ➡️ change).

As a second pre-processing of these data, let's remove the stop words, punctuation and digits from each SMS, without forgetting to apply lemmatization to them:

In [None]:
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

df["content"] = df["content"].apply(
    lambda row: re.sub(r"[^a-zA-Z]+", " ", row)  
)
df["content"] = df["content"].apply(lambda row: word_tokenize(row))
df["content"] = df["content"].apply(
    lambda row: [
        token for token in row if token not in set(stopwords.words("english"))
    ]
)
df["content"] = df["content"].apply(
    lambda row: " ".join([WordNetLemmatizer().lemmatize(word) for word in row])
)
df.head(n=25)

It would still be possible to speculate on more pre-processing to be done (e.g., finding the original words based on abbreviations), but since a SMS is not a formal message, it may be wise to keep capital letters and abbreviations.

# Creation of Training and Testing Datasets

Before being able to train our model, it is necessary to split our dataset into a training and testing dataset:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df["content"], df["is_spam"], stratify=df["is_spam"],test_size=0.2
)

Our dataset initially composed of 5572 lines is now split into two smaller datasets according to the following proportions:

In [None]:
print(f"Training data: {len(X_train)} (80%)")
print(f" Testing data: {len(X_test)} (20%)")

**NOTE:** in some use-cases, it can be interesting to split again the training dataset in order to create a validation dataset.  The validation dataset could be useful when we want to stop training a model when a certain precision is reached, to avoid overlearning. In our case, it may be preferable to use the training data set to train the model and achieve better accuracy.


# BERT

BERT is a bidirectional transformer pretrained using a combination of Masked Language Modeling (MLM) objective and Next Sentence Prediction (NSP) on a large corpus comprising the Book Corpus and Wikipedia.

## Tokenization

Tokenization will allow us to feed batches of sequences into the model at the same time, only if these two conditions are met:

1.   **the SMS are padded to the same length**;
2.   **the SMS are truncated to be not longer model's maximum input length**.

To do the tokenization of our datasets, we also need to choose a pre-trained model. 
For this dataset, the basic model (`bert-base-uncased`) will be sufficient:


In [None]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
tokenizer

Before we can encode our datasets with BERT, **it is important to decide on a maximum sentence length for padding/truncating to**. This will allow us to have a better speed for training and evaluation.

To do this, we will perform one tokenization pass of the datasets in order to measure the maximum sentence length:

In [None]:
max_len = 0
for row in X_train:
    max_len = max(max_len, len(tokenizer.encode(row)))
print(f"Max sentence length (train): {max_len}")

max_len = 0
for row in X_test:
    max_len = max(max_len, len(tokenizer.encode(row)))
print(f"Max sentence length (test): {max_len}")

Since we have that the maximum length sentence is 93 for the training dataset and 93 for the testing dataset, **we will take a maximum length of 96 characters for both datasets**.

Based on this pre-trained model, the encodings for our training and testing  datasets are generated as follows:

In [None]:
train_encodings = tokenizer(
    X_train.tolist(),
    max_length=96,
    padding="max_length",
    truncation=True,
)
test_encodings = tokenizer(
    X_test.tolist(),
    max_length=96,
    padding="max_length",
    truncation=True,
)

## Transformation of Labels and Encodings

Before we can Fine-Tuning and training our model, we must batched these encodings to a `TensorSliceDataset` object, so that each key in the batch encoding corresponds to a hyper-parameters named according to the model we are going to train:

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices(
    (dict(train_encodings), y_train)
)
test_dataset = tf.data.Dataset.from_tensor_slices(
    (dict(test_encodings), y_test)
)

We are now ready to Fine-Tuning and training our BERT model!

## Fine-Tuning and Training

**Fine-tuning consists of generating embeddings specific to a task**. Since we would like to create embeddings specifically for a classification task, we will have to train our data only for this task. However, for a pre-trained BERT model that is best suited for multiple tasks, fine-tuning will not be possible. It will therefore be necessary to generate the BERT embeddings as features and pass them through an independent classifier (e.g., RandomForest).

Using the `TFTrainingArguments` class present in the `huggingface/transformers` module, the Fine-Tuning can be done this way:

Following Fine-Tuning and our datasets, the training of the BERT model can be done as follows:

In [None]:
training_args = TFTrainingArguments(
    output_dir="/kaggle/working/sms/results/bert",
    num_train_epochs=8,
    per_device_train_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/kaggle/working/sms/logs/bert",
    logging_steps=10,
)

Following Fine-Tuning and our datasets, the training of the BERT model can be done as follows:

In [None]:
from transformers import TFBertForSequenceClassification

with training_args.strategy.scope():
    model = TFBertForSequenceClassification.from_pretrained(
        "bert-base-uncased"
    )

trainer = TFTrainer(
    model=model, args=training_args, train_dataset=train_dataset
)
trainer.train()

**NOTE:** you can ignore the warnings.

This model being trained, let's save it, as well as its configuration to be able to load it directly when needed.

In [None]:
trainer.save_model("/kaggle/working/sms/models/bert")
tokenizer.save_pretrained(training_args.output_dir)

## Measurement of Predictions

The measurement of SMS predictions present in our test dataset as spam or ham, will allow us to make sure that the model is well trained.

### Predictions

Our BERT model being trained, we can now use it to predict if the SMS in our test dataset are spams or not:

In [None]:
preds, label_ids, metrics = trainer.predict(test_dataset)
preds[:5]

It should be noted above that we have what are called **logits** (i.e. 	$\exists n, n \in ]-\infty, \infty[$).

Another thing we have to be careful of, is that no additional embeddings have been generated after the predictions:

In [None]:
print(f"Test dataset size: {len(y_test)}")
print(f" Predictions size: {len(preds)}")

Here it is the case, we have 13 additional embeddings. Let's make sure to delete them:

In [None]:
preds = preds[: len(y_test)]
len(preds)

The reason for these extra embeddings after training is due to a `huggingface/transformers` bug, which should be fixed in the next releases.

### Normalization

To get rid of these logits, the vector of raw (non-normalized) predictions generated by the classification model should be passed to a normalization function to convert logits to probabilities. As we use a binary classification, we should use the `sigmoid` function and then the conversion of the probabilities into final predictions is done by taking the label for which the probability is highest.

With the help of the `argmax` function from `numpy`, we can make a two-shot stone:

In [None]:
preds = np.argmax(preds, axis=1)
preds

**NOTE:** in a multi-class classification problem, these logits would be normalized with a `softmax` function.

### Confusion Matrix

Sketch the confusion matrix will allow us to measures the quality of the classification system:

In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

Above, each row of the confusion matrix corresponds to a real class and each column corresponds to an estimated class.

Through the matrix of confusion, we have:

*  **966 SMS being ham were well predicted**: True Negative (TN);
*  **0 ham SMS have been detected as spam** False Positive (FP);
*  **1 spam SMS have been detected as ham** False Negative (FN);
*  **148 spam SMS have been detected as spam** True Positive (TP).

### Scores

Let's look at the score obtained by the predictions.

As a quick reminder:

1.   **Precision:** is the ratio between the True Positives and all the Positives.
2.   **Recall:** is the measure of our model correctly identifying True Positives.
3.   **Accuracy:** is the ratio of the total number of correct predictions and the total number of predictions.

Which gives us:

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")

# DistilBERT


DistilBERT is a distilled version of BERT, which is smaller, faster, cheaper and lighter. This variant should have performance close to BERT.

## Tokenization

As for BERT, let's tokenize our dataset so that we can feed batches of sequences into the model at the same time.

For this dataset, the basic model (`distilbert-base-uncased`) will be sufficient:

In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokenizer

As with BERT, we will take a maximum length of 96 characters for both datasets to have a better speed for training and evaluation. 

Based on this pre-trained model, the encodings for our training and testing datasets are generated as follows:

In [None]:
train_encodings = tokenizer(
    X_train.tolist(),
    max_length=96,
    padding="max_length",
    truncation=True,
)
test_encodings = tokenizer(
    X_test.tolist(),
    max_length=96,
    padding="max_length",
    truncation=True,
)

## Transformation of Labels and Encodings

Similar to what we saw before, let's associate these codings to a `TensorSliceDataset` object in order to Fine-Tuning and train our model.

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices(
    (dict(train_encodings), y_train)
)
test_dataset = tf.data.Dataset.from_tensor_slices(
    (dict(test_encodings), y_test)
)

We are now ready to fine-tuning and training our DistilBERT model!

## Fine-Tuning and Training

The fine-tuning for DistilBERT is identical to BERT:

In [None]:
training_args = TFTrainingArguments(
    output_dir="/kaggle/working/sms/results/distilbert",
    num_train_epochs=8,
    per_device_train_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/kaggle/working/sms/logs/distilbert",
    logging_steps=10,
)

Following Fine-Tuning and our datasets, the training of the DistilBERT model can be done as follows:

In [None]:
from transformers import TFDistilBertForSequenceClassification

with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained(
        "distilbert-base-uncased"
    )

trainer = TFTrainer(
    model=model, args=training_args, train_dataset=train_dataset
)
trainer.train()

**NOTE:** you can ignore the warnings.

Let's also save our DistilBERT model and its configuration in a persistent way:

In [None]:
trainer.save_model("/kaggle/working/sms/models/distilbert")
tokenizer.save_pretrained(training_args.output_dir)

## Measurement of Predictions

As seen previously, let's measure SMS predictions as spam or ham.

### Predictions

Our DistilBERT model being trained, we can now use it to predict if the SMS in our test dataset are spams or not:

In [None]:
preds, label_ids, metrics = trainer.predict(test_dataset)
preds[:5]

With DistilBERT, we also have logits.

Let's see if we also have additional embeddings:

In [None]:
print(f"Test dataset size: {len(y_test)}")
print(f" Predictions size: {len(preds)}")

Here it is the case, we have 13 additional embeddings. Let's make sure to delete them:

In [None]:
preds = preds[: len(y_test)]
len(preds)

### Normalization

Let's convert these logits into probabilities and the latter into final predictions by taking the label for which the probability is highest:

In [None]:
preds = np.argmax(preds, axis=1)

### Confusion Matrix

Using the confusion matrix, measures of the quality of the classification system are given: 

In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

Above, each row of the confusion matrix corresponds to a real class and each column corresponds to an estimated class.

Through the matrix of confusion, we have:

*  **964 SMS being ham were well predicted**: True Negative (TN);
*  **2 ham SMS have been detected as spam** False Positive (FP);
*  **2 spam SMS have been detected as ham** False Negative (FN);
*  **147 spam SMS have been detected as spam** True Positive (TP).

### Scores

Let's look at the score obtained by the predictions:

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")

# KNN

K-Nearest Neighbors (KNN) is an approach to data classification that estimates how likely a data point is to be a member of one group or the other depending on what group the data points nearest to it are in.

## Fine-Tuning and Training

As we said before, let's use grid search techniques using cross-validation to determine the hyper-parameters of our model and train this model on them:

To tune the hyper-parameters of the KNN, it is recommended to use grid search techniques using cross-validation (**SEE:** [scikit-learn's documentation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)) to evaluate the performance of the model on the data at each value.

Let's use this technique to train our model according to the optimal value of the neighbors hyper-parameter:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = GridSearchCV(
    Pipeline(
        [
            ("bow", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", KNeighborsClassifier()),
        ]
    ),
    {
        "clf__n_neighbors": (8, 15, 20, 25, 40, 55),
    }
)
knn.fit(X=X_train, y=y_train)

## Measurement of Predictions

### Predictions

Our KNN model being trained, we can now use it to predict if the SMS in our test dataset are spams or not:

In [None]:
preds = knn.predict(X_test)
preds

Here, we already have the final predictions given by the logit probabilities.

### Confusion Matrix

Using the confusion matrix, measures of the quality of the classification system are given:

In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

Through the confusion matrix, we have:

*   **964 SMS being ham were well predicted**: True Negative (TN);
*   **2 ham SMS have been detected as spam**: False Positive (FP);
*   **46 spam SMS have been detected as ham**: False Negative (FN);
*   **103 spam SMS have been detected as spam**: True Positive (TP).

### Scores

Let's look at the score obtained by the predictions:

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")

# Multinomial Naive Bayes Classifier

As the features of our dataset have discrete frequency counts, we will use the Multinomial type of Naive Bayes Model.

To detect if a SMS is consider as spam or not, the Multinomial Naive Bayes classifier will use word counts in the content of the SMS with the help of the Bag-of-Words (BoW) method. This method, will elaborate a matrix of rows according to words, where each intersection corresponds to the frequency of occurrence of these words.

## Fine-Tuning and Training

As we said before, let's use grid search techniques using cross-validation to determine the optimal value of the $\alpha$ hyper-parameter of our model and train this model on this hyper-parameter:

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnbayes = GridSearchCV(
    Pipeline(
        [
            ("bow", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", MultinomialNB()),
        ]
    ),
    {
        "tfidf__use_idf": (True, False),
        "clf__alpha": (0.1, 1e-2, 1e-3),
        "clf__fit_prior": (True, False),
    },
)
mnbayes.fit(X=X_train, y=y_train)

Out of curiosity, let us look at which were the hyper-parameters to be privileged for the training of the model with respect to our training dataset:

In [None]:
mnbayes.best_params_

For our training dataset, $\alpha$ must be equal to $10^{-2}$.

In [None]:
print(f"{mnbayes.best_score_ * 100:.3f}%") 

The mean cross-validated score is therefore 98.587%

## Measurement of Predictions

As seen previously, let's measure SMS predictions as spam or ham.

### Predictions

Our Multinomial Naive Bayes model being trained, we can now use it to predict if the SMS in our test dataset are spams or not:

In [None]:
preds = mnbayes.predict(X_test)
preds

Here, we already have the final predictions given by the logit probabilities.

### Confusion Matrix

Using the confusion matrix, measures of the quality of the classification system are given:

In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

Through the confusion matrix, we have:

*   **959 SMS being ham were well predicted**: True Negative (TN);
*   **7 ham SMS have been detected as spam**: False Positive (FP);
*   **14 spam SMS have been detected as ham**: False Negative (FN);
*   **135 spam SMS have been detected as spam**: True Positive (TP).

### Scores

Let's look at the score obtained by the predictions:

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")

# SVM

## Fine-Tuning and Training

As we said before, let's use grid search techniques using cross-validation to determine the hyper-parameters of our model and train this model on them:

In [None]:
from sklearn.svm import SVC

svc = GridSearchCV(
    Pipeline(
        [
            ("bow", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", SVC(gamma="auto", C=1000)),
        ]
    ),
    dict(tfidf=[None, TfidfTransformer()], clf__C=[500, 1000, 1500]),
)
svc.fit(X=X_train, y=y_train)

Out of curiosity, let us look at which were the hyper-parameters to be privileged for the training of the model with respect to our training dataset:

In [None]:
svc.best_params_

For our training dataset, $C$ must be equal to 1000 and we shouldn't transform the count matrix to a normalized term-frequency (tf) representation or for a term-frequency times inverse document-frequency (tf-idf) representation.

In addition, we can get the mean cross-validated score of the estimator that was chosen by the search:

In [None]:
print(f"{svc.best_score_ * 100:.3f}%") 

The mean cross-validated score is therefore 98.519%

## Measurement of Predictions

### Predictions

Our SVM model being trained, we can now use it to predict if the SMS in our test dataset are spams or not:

In [None]:
preds = svc.predict(X_test)
preds

Here, we already have the final predictions given by the logit probabilities.

### Confusion Matrix

Using the confusion matrix, measures of the quality of the classification system are given:

In [None]:
plt.figure(figsize=(10, 4))

heatmap = sns.heatmap(
    data=pd.DataFrame(confusion_matrix(y_test, preds)),
    annot=True,
    fmt="d",
    cmap=sns.color_palette("Blues", 50),
)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), fontsize=14)
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, fontsize=14
)

plt.title("Confusion Matrix")
plt.ylabel("Ground Truth")
plt.xlabel("Prediction")

Through the confusion matrix, we have:

*   **964 SMS being ham were well predicted**: True Negative (TN);
*   **2 ham SMS have been detected as spam**: False Positive (FP);
*   **13 spam SMS have been detected as ham**: False Negative (FN);
*   **136 spam SMS have been detected as spam**: True Positive (TP).

### Scores

Let's look at the score obtained by the predictions:

In [None]:
print(f"Precision: {precision_score(y_test, preds) * 100:.3f}%")
print(f"   Recall: {recall_score(y_test, preds) * 100 :.3f}%")
print(f" Accuracy: {accuracy_score(y_test, preds) * 100:.3f}%")

# Results

If we summarize the results obtained, here is what we get:

  | ML Algo                  |                                                                  Accuracy |  Precision |      Recall |
  | -----------------------: | ------------------------------------------------------------------------: | ---------: | ----------: |
  | BERT                     |                                                                  **99.910** |**100.000** |  **99.329** |
  | DistilBERT               |                                                                    99.641 |     98.658 |      98.658 |
  | KNN                      |                                                                    95.695 |     98.095 |      69.128 |
  | Multinomial Naive Bayes  |                                                                    98.117 |     95.070 |      90.604 |
  | SVM                      |                                                                    98.655 |     98.551 |      91.275 |

We can see that BERT and DistilBERT are the ML classification algorithms that provide the best results. However, based on the scores, we can see that there is no significant difference in precision and accuracy between these algorithms, it is only a few percent!

# Conclusion

From this Notebook, we started by loading a dataset of Spam SMS and created our features on the raw data using Feature Engineering. 

Once our features were created, we analyzed the data made available on the basis of these features, before being able to do data preprocessing which consisted in removing the presence of stop words, punctuation, digits and lemmatize the words.

In addition, we learned how to fine-tuning different Machine Learning classification algorithms. To do this, it was useful for fine-tuning some of these algorithms to use search grid techniques using cross-validation to evaluate the performance of the model.

BERT and DistilBERT are to be preferred when we would like to push performance to its maximum and to optimize the avoidance of True Positive misclassification (given by the recall score).

However, even these algorithms are the best according to the scores, we can still apply Okhalm's razor principle. Indeed, if these few percent more can be neglected, classical classification algorithms such as Multinomial Naive Bayes and SVM can still be preferred because of their simplicity of understanding and implementation.

# References

[BERT Fine-Tuning Tutorial with PyTorch](http://mccormickml.com/2019/07/22/BERT-fine-tuning/)

[Naive Bayes & SVM Spam Filtering](https://www.kaggle.com/pablovargas/naive-bayes-svm-spam-filtering)

[Starter: Neural Net w/ 0.97 ROC-AUC - 99% accuracy](https://www.kaggle.com/mrlucasfischer/starter-neural-net-w-0-97-roc-auc-99-accuracy)