<a href="https://colab.research.google.com/github/victorgalleto/spam-detection/blob/main/Prompt_Engineering_Case.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Detection Based On Text Classification

In this case, we are going to develop a text classification model using BERT to verify if a SMS message is a spam or not. For this, we will use a dataset of messages sent by SMS's that contains two types of classifications: spam or ham (SMS that is not spam).

We will use Keras to facilitate the training of our neural network be prepared to detect spams in messages.

In [None]:
# Installation of tensorflow, tensorflow_hub and tensorflow_text
!pip install tensorflow

In [None]:
!pip install tensorflow_hub

In [None]:
!pip install tensorflow_text

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

**The dataset was downloaded from Kaggle following the link below:**

https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

In [None]:
# Uploading the dataset to the drive
from google.colab import files
uploaded = files.upload()

In [None]:
# Reading the dataset and analysing the first five lines of content
import pandas as pd

df = pd.read_csv("/content/spam.csv", encoding = "ISO8859-1")
df.head(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


**Performing a simple analysis, we can observe how many hams and spams are in the dataset by counting the elements in the first column "v1":**

In [None]:
df['v1'].value_counts()

ham     4825
spam     747
Name: v1, dtype: int64

In [None]:
# Selecting all the values that are equal to "spam" from the dataframe
df_spam = df[df['v1']=='spam']
df_spam.shape

(747, 5)

In [None]:
# Selecting all the values that are equal to "ham" from the dataframe
df_ham = df[df['v1']=='ham']
df_ham.shape

(4825, 5)

**Now we can observe that there is a huge difference of quantities between hams and spams. To gather a more uniform amount, we will make the two vectors have the same size by taking a sample from the ham vector that has the same size of the spam vector:**

In [None]:
df_ham_downsampled = df_ham.sample(df_spam.shape[0])
df_ham_downsampled.shape

(747, 5)

**WIth that, we can now concatenate the two vectors into one that has the double of their size:**

In [None]:
# Concatenating the two vectors
df_balanced = pd.concat([df_ham_downsampled, df_spam])
df_balanced.shape

(1494, 5)

In [None]:
# Observing the amount of hams and spams registered in de column "v1". They must be equal:
df_balanced['v1'].value_counts()

ham     747
spam    747
Name: v1, dtype: int64

**Now, we can create a new column called "spam" that will translate into ones and zeros if the message is spam or not respectively. With this, it's possible to use those values to calculate the results.**

In [None]:
# Lambda function applied to create an expression that adds zero or one into the 'spam' column:
df_balanced['spam']=df_balanced['v1'].apply(lambda x: 1 if x=='spam' else 0)

# Observe if the function was applyed correctly:
df_balanced.sample(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4,spam
3760,ham,Was just about to ask. Will keep this one. May...,,,,0
3405,ham,\HEY DAS COOL... IKNOW ALL 2 WELLDA PERIL OF S...,,,,0
838,spam,We tried to contact you re our offer of New Vi...,,,,1
1357,ham,Good afternoon loverboy ! How goes you day ? A...,,,,0
2294,spam,You have 1 new message. Please call 08718738034.,,,,1


**With a uniform dataset and represented with zeros and ones, we are able to split the data into two sets: one for training the model and other for testing it's accuracy. For this, the sets need to be evenly distribuited:**

In [None]:
# Using the "stratify" property from the train_test_split function to split uniformily the data based on the values of the "spam" column:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_balanced['v2'],df_balanced['spam'], stratify=df_balanced['spam'])

**Now we need to import BERT to allow the preprocessing and the encoder steps of our fine-tuning using a Functional Model:**

In [None]:
# Importing BERT to execute the preprocessing and encoding tasks:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

###**Building a Functional Model**
 
 The Functional Model was chosen because of its facility and flexibility compared to the Sequential Model based on the arcticle in the link below:

 https://becominghuman.ai/sequential-vs-functional-model-in-keras-20684f766057

**Based on this idea, BERT layers consisting in preprocessing and enconding are necessary. After that, the neural network layers use the outputs obtained previously, to prepare the model to be trained:**

In [None]:
# BERT layers based on an input (text_input)
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
layer = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
layer = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(layer)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [layer])

In [None]:
# Verifying the length of the vector that will be used to train the model:
len(X_train)

1120

In [None]:
# Metrics that will allow us to see if the results are consistent
METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall')
]

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=METRICS)

## **Training the model**

Now the model will use the X and Y vectors to be trained. 10 epochs were chosen to give accuracy enough to prove the value of the concept:

In [None]:
model.fit(X_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f64b4db7df0>

## **Testing the model**

The model is ready to be tested using the respectives X and Y vectors and the outputs consist in loss, accuracy, precision and recall. All of them resume the performance of the fine-tuned model:

In [None]:
# Testing the model
model.evaluate(X_test, y_test)



[0.28770604729652405,
 0.9064171314239502,
 0.8762376308441162,
 0.9465240836143494]

**With the model reliable enough, it's possible to veirify in a practical example if the SMS is a spam or not. For this, we create five texts on a list called "reviews" and the output will return the probability of each message being a spam as we can see below:**

In [None]:
reviews = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your ."
]
model.predict(reviews)



array([[0.78814435],
       [0.8367806 ],
       [0.80551696],
       [0.26463196],
       [0.13120382]], dtype=float32)

**Observing the results, we can conclude that the model worked satisfactorily, indicating that only the last two SMS's are not spams.**

## **References**

Project reference:

https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/47_BERT_text_classification/BERT_email_classification-handle-imbalance.ipynb

Model reference:

https://becominghuman.ai/sequential-vs-functional-model-in-keras-20684f766057