# BERT Spam Classification using Tensorflow

Try to classify spam messages using BERT and Tensorflow.

Credit to : https://www.analyticsvidhya.com/blog/2021/12/text-classification-using-bert-and-tensorflow/

In [None]:
import sys
!{sys.executable} -m pip install tensorflow-text==2.8.1
!{sys.executable} -m pip install scikit-learn==1.3.1

In [12]:
import pandas as pd

df = pd.read_csv("data/SMSSpamCollection", sep="\t", names=["label", 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Convert label ham and spam into binary values 0 and 1

In [2]:
df['spam'] = df['label'].apply(lambda x: 1 if x == "spam" else 0)
df.head()

Unnamed: 0,label,message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


- **Split dataset into train and test using `train_test_split`.**  
- Using stratified sampling to ensure that train and test have equal proportion of `ham` and `spam`. 
- Stratified sampling is a method of sampling from a population which can be partitioned into subpopulations.


In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.message, df.spam, stratify=df.spam)
X_train.head()

4209    Or i go home first lar ü wait 4 me lor.. I put...
788     Ever thought about living a good life with a p...
799                       Ok i msg u b4 i leave my house.
3479    I can ask around but there's not a lot in term...
2208    Usually the body takes care of it buy making s...
Name: message, dtype: object

- **We need to tokenize the text data using bert_preprocess.** 
- After text data tokenized, then we will encode it using bert_encoder

In [4]:
import tensorflow as tf 
import tensorflow_hub as hub 
import tensorflow_text as text

bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

2023-10-13 14:23:35.628915: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-13 14:23:35.629183: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-13 14:23:35.651949: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-13 14:23:35.652216: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-13 14:23:35.652421: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from S

- **Sentence embedding is to convert the text input into a vector.**
- Transformer or BERT not able to process raw data, it needs to be converted into a vector.
- We will use bert_encoder to convert the text input into a vector.
- its return pooled_output which is [CLS] token embedding and sequence_output which is embedding of all tokens in BERT
- Imagine CLS token is try to make single representation over all the input text 

In [5]:
def get_sentence_embedding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    encoded_text = bert_encoder(preprocessed_text)
    return encoded_text['pooled_output']

get_sentence_embedding(["$500 discount. Hurry up!", "Bhavin, are you up for volley game tomorrow?"])

2023-10-13 14:23:43.410221: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.8814409 , -0.4795407 , -0.92536545, ..., -0.7865841 ,
        -0.75873065,  0.91710496],
       [-0.8657644 , -0.48833352, -0.93009543, ..., -0.85933286,
        -0.72032976,  0.8754201 ]], dtype=float32)>

- **We create a new layer for input and output.**
- Input layer is scalar () and type is string that support UTF-8
- Output layer from bert encoder
- We create Dropout layer to prevent overfitting
- One dense layer for sigmoid activation function

In [6]:
# Bert layer
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name="text")
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# neural network
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation="sigmoid", name="output")(l)

# use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs=[l])

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128)}                                                  

- Trainable params is weights that will be updated during training
- Non-trainable params is not updated


- We choose `adam` as optimizer which SGD extension. Its common and popular optimizer.
- `Adam` will adjust the learning rate automatically.
- We choose `binary_crossentropy` as loss function because we have binary classification problem.
- The success metrics is accuracy (there are another like F1, Precision, Recall, etc)

In [7]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=['accuracy'])

model.fit(X_train, y_train, epochs=2, batch_size=32)

Epoch 1/2


Epoch 2/2


<keras.callbacks.History at 0x7f0a00626ad0>

- Evaluating the model using test data and print the accuracy

In [8]:
model.evaluate(X_test, y_test)



[0.19320189952850342, 0.929648220539093]

- Using the trained model to predict the test data
- The difference with evaluation, its not seeing the label (truth)

In [9]:
y_predicted = model.predict(X_test)
y_predicted = y_predicted.flatten()
print(y_predicted)

[0.03933839 0.01286093 0.36083266 ... 0.45198444 0.7612681  0.32288635]


- **The result prediction is not clear spam or not spam**
- We converting with whatever value more than 0.5 considered as spam

In [10]:
import numpy as np 

y_predicted = np.where(y_predicted > 0.5, 1, 0)
y_predicted


array([0, 0, 0, ..., 0, 1, 0])