# Spam Classification with BERT

Deep learning has been revolutionized by transformer models. Transformer based models like BERT are heavily used in NLP to solve tasks due to the rich numerical representations of text they provide. Here we will be discussing how to download a transformer model and then adapt it to solve a spam classification problem.


<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch12/12.1_Spam_Classification_with_BERT.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>


## Import libraries

In [20]:
import os
import pandas as pd
import tensorflow as tf
import numpy as np
from official import nlp
from official.nlp import bert
import official.nlp.optimization
import official.nlp.bert.bert_models
import official.nlp.bert.configs

print("TensorFlow: {} installed".format(tf.__version__))

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except:
        print("Couldn't set memory_growth")
        pass
    
    
def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")

# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)

TensorFlow: 2.4.1 installed


## Download and read the data

For this, we will be using the spam classification dataset available [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip). It is a tab separated file with two columns. First column is a single word (ham/spam), where the second column contains the SMS message.

In [18]:
# Downloading the data

import os
import requests
import zipfile

import shutil

if not os.path.exists('data'):
    os.mkdir('data')
        
# Retrieve the data
if not os.path.exists(os.path.join('data', 'smsspamcollection.zip')):
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
    # Get the file from web
    r = requests.get(url)
    
    # Write to a file
    with open(os.path.join('data', 'smsspamcollection.zip'), 'wb') as f:
        f.write(r.content)
          
else:
    print("The zip file already exists.")
    
if not os.path.exists(os.path.join('data', 'SMSSpamCollection')):
    with zipfile.ZipFile(os.path.join('data', 'smsspamcollection.zip'), 'r') as zip_ref:
        zip_ref.extractall('data')  
else:
    print("The extracted data already exists")


    

The zip file already exists.
The extracted data already exists


In [22]:
import numpy as np

# Inputs and labels will be stored in this
inputs = []
labels = []
# Total number of instances for spam and ham
n_ham, n_spam = 0,0
with open(os.path.join('data', 'SMSSpamCollection'), 'r') as f:
    for r in f:        
        # Ham input
        if r.startswith('ham'):
            label = 0
            txt = r[4:]
            n_ham += 1
        # Spam input
        elif r.startswith('spam'):
            label = 1
            txt = r[5:]
            n_spam += 1
        inputs.append(txt)
        labels.append(label)
        
print("Found {} ham and {} spam".format(n_ham, n_spam))
print(inputs[:5])

# Convert them to arrays
inputs = np.array(inputs).reshape(-1,1)
labels = np.array(labels)

Found 4827 ham and 747 spam
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n', 'Ok lar... Joking wif u oni...\n', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", 'U dun say so early hor... U c already then say...\n', "Nah I don't think he goes to usf, he lives around here though\n"]


## Splitting train/valid/test

Here we will split the data to three sets using `imbalanced-learn` library. Specifically we,

* Create a balanced test set with 100 spam and 100 ham (Random)
* Create a balanced validation set with 100 spam and 100 ham (Random)
* Create a balanced train set from the left over instances (undersampled using Near miss algorithm)

In [23]:
from imblearn.under_sampling import  NearMiss, RandomUnderSampler


n=100 # Number of instances for each class for train/validation sets
rus = RandomUnderSampler(sampling_strategy={0:n, 1:n}, random_state=random_seed)
rus.fit_resample(inputs, labels)

# Get test indices
test_inds = rus.sample_indices_
test_x, test_y = inputs[test_inds], np.array(labels)[test_inds]
print("Test statistics")
print(pd.Series(test_y).value_counts())

# Get rest (train + valid)
rest_inds = [i for i in range(inputs.shape[0]) if i not in test_inds]
rest_x, rest_y = inputs[rest_inds], labels[rest_inds]

# Get valid indices
rus.fit_resample(rest_x, rest_y)
valid_inds = rus.sample_indices_
valid_x, valid_y = rest_x[valid_inds], rest_y[valid_inds]
print("Valid statistics")
print(pd.Series(valid_y).value_counts())

# Rest goes in training
train_inds = [i for i in range(rest_x.shape[0]) if i not in valid_inds]
train_x, train_y = rest_x[train_inds], rest_y[train_inds]
print("Train statistics")
print(pd.Series(train_y).value_counts())

Test statistics
1    100
0    100
dtype: int64
Valid statistics
1    100
0    100
dtype: int64
Train statistics
0    4627
1     547
dtype: int64


In [24]:
from sklearn.feature_extraction.text import CountVectorizer

# To use near miss algorithm, we need a numerical representation of the messages
# We will use the bag of words representation
countvec = CountVectorizer()
train_bow = countvec.fit_transform(train_x.reshape(-1).tolist())

# NearMiss is a common undersampling technique
oss = NearMiss()
X_res, y_res = oss.fit_resample(train_bow, train_y)
train_inds = oss.sample_indices_

train_x, train_y = train_x[train_inds], train_y[train_inds]

print(pd.Series(train_y).value_counts())

1    547
0    547
dtype: int64


## Analyse some of the removed samples

## Downloading the BERT model

Here we download the BERT model from the TensorFlow hub. and create a Keras layer from that.

In [30]:
import tensorflow_hub as hub

bert_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2"
bert_layer = hub.KerasLayer(bert_url)

## Defining the inputs for the BERT model

BERT model needs three inputs,

* Input word IDs - These are the input tokens generated from text and padded to `max_seq_length` with zeros
* Input mask - A matrix of the shape of `input_word_ids` that represents whether an element is a token of a padded value (0s and 1s)
* Segment IDs - A matrix of the shape of `input_word_ids` that represents which sentence/sequence each token belongs to (0s and 1s)

In [26]:
max_seq_length = 128  # Your choice here.

# Contains input token ids
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
# Contains input mask values
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
# Contains input type (whether token belongs to sequence A or B) values
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2",
                            trainable=True)

# Check the outputs of the Bert layer
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
print(pooled_output.shape)
print(sequence_output.shape)

(None, 768)
(None, 128, 768)


## Analysing the vocabulary of BERT

In [46]:
import official.nlp.bert.tokenization as tokenization

# Get the vocab file path from the BERT layer
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
print("Vocabulary file is at: {}".format(vocab_file))
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
print("Configuration {} of BERT: {}".format("do_lower_case", do_lower_case))

# Define a tokenizer
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

Vocabulary file is at: b'C:\\Users\\thush\\AppData\\Local\\Temp\\tfhub_modules\\ce53fe6769d2ac3a260e92555120c54e1aecbea6\\assets\\vocab.txt'
Configuration do_lower_case of BERT: True


## Understanding tokenization in BERT

In [44]:
text = "She sells seashells by the seashore"
print("Original text: {}".format(text))
tokens = tokenizer.tokenize(text)
print("Tokens generated by BERT: {}".format(tokens))
ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs generated by BERT: {}".format(ids))

Original text: She sells seashells by the seashore
Tokens generated by BERT: ['she', 'sells', 'seas', '##hell', '##s', 'by', 'the', 'seas', '##hore']
Token IDs generated by BERT: [2016, 15187, 11915, 18223, 2015, 2011, 1996, 11915, 16892]


## Special tokens used by BERT

In [47]:
special_tokens = ['[CLS]', '[SEP]', '[MASK]', '[PAD]']
ids = tokenizer.convert_tokens_to_ids(special_tokens)
for t, i in zip(special_tokens, ids):
    print("Token: {} has ID: {}".format(t, i))

Token: [CLS] has ID: 101
Token: [SEP] has ID: 102
Token: [MASK] has ID: 103
Token: [PAD] has ID: 0


## Encoding the sentence in a way suitable to BERT

Here we will add the `[CLS]` token to the front of the input and `[SEP]` to the back of the input

In [None]:
def encode_sentence(s):
    """ Encode a given sentence by tokenizing it and adding special tokens """
    
    tokens = ['[CLS]'] + list(tokenizer.tokenize(s)) + ['[SEP]']
    return tokenizer.convert_tokens_to_ids(tokens)

## Analysing sequence length

In [37]:
seq_lengths = pd.Series([len(encode_sentence(str(s))) for s in train_x])
seq_lengths.describe(percentiles=[0.25, 0.5, 0.75, 0.9])

count    1094.000000
mean       33.248629
std        19.922181
min         9.000000
25%        15.000000
50%        22.000000
75%        53.000000
90%        60.000000
max        76.000000
dtype: float64

## Generating the correct input format for BERT

BERT model needs three inputs. These are formed into a dictionary having the following keys.

* `input_word_ids` - These are the input tokens generated from text and padded to `max_seq_length` with zeros
* `input_mask` - A matrix of the shape of `input_word_ids` that represents whether an element is a token of a padded value (0s and 1s)
* `input_type_ids` - A matrix of the shape of `input_word_ids` that represents which sentence/sequence each token belongs to (0s and 1s)

In [38]:
def get_bert_inputs(docs,max_seq_len=None):
    """ Generate inputs for BERT using a set of documents """
    
    tokens = tf.ragged.constant([encode_sentence(str(s)) for s in docs])
    
    print("Shape of the ragged input: {}".format(tokens.shape))
    tokens_pad = tokens.to_tensor()
    
    if max_seq_len and max_seq_len - tokens_pad.shape[1]>0:
        # If the specified max_seq_len is greater than the size of the padded tensor
        more_pad = tf.zeros(shape=[tokens_pad.shape[0], max_seq_len - tokens_pad.shape[1]], dtype='int32')
        tokens_pad = tf.concat([tokens_pad, more_pad], axis=1)
    elif max_seq_len and max_seq_len - tokens_pad.shape[1]<0:
        # If the specified max_seq_len is smaller than the size of the padded tensor
        tokens_pad = tokens_pad[:, :max_seq_len]
        
    # Which are actual words
    tokens_mask = tf.cast((tokens_pad != 0), 'float32')
    # Which sentence each token belongs to
    tokens_type = tf.zeros_like(tokens_pad)
    print("Shape of the transformed input: {}".format(tokens_pad.shape))
    
    # Final output
    return {
        'input_word_ids': tokens_pad,
        'input_mask': tokens_mask,
        'input_type_ids': tokens_type
    }

# Creating train/validation/test data
train_inputs = get_bert_inputs(train_x, max_seq_len=60)
valid_inputs = get_bert_inputs(valid_x, max_seq_len=60)
test_inputs = get_bert_inputs(test_x, max_seq_len=60)
    

Shape of the ragged input: (1094, None)
Shape of the transformed input: (1094, 60)
Shape of the ragged input: (200, None)
Shape of the transformed input: (200, 60)
Shape of the ragged input: (200, None)
Shape of the transformed input: (200, 60)


## Creating a downstream classifier from BERT

In [48]:
from official.nlp import bert
import yaml

# https://github.com/tensorflow/models/blob/master/official/nlp/configs/models/bert_en_uncased_base.yaml
with open(os.path.join("data", "bert_en_uncased_base.yaml"), 'r') as stream:
    config_dict = yaml.safe_load(stream)['task']['model']['encoder']['bert']

# Generate BERT config
bert_config = bert.configs.BertConfig.from_dict(config_dict)

# Print BERT config
print("BERT Config")
print(bert_config.to_dict())

# Generating a classifier and the encoder
hub_classifier, hub_encoder = bert.bert_models.classifier_model(
    # Caution: Most of `bert_config` is ignored if you pass a hub url.
    bert_config=bert_config, hub_module_url=bert_url, num_labels=2
)


BERT Config
{'vocab_size': 30522, 'hidden_size': 768, 'num_hidden_layers': 12, 'num_attention_heads': 12, 'hidden_act': 'gelu', 'intermediate_size': 3072, 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02, 'embedding_size': None, 'backward_compatible': True, 'attention_dropout_rate': 0.1, 'dropout_rate': 0.1, 'hidden_activation': 'gelu', 'num_layers': 12}


## Defining the optimizer

In [39]:
# Set up epochs and steps
epochs = 3
batch_size = 64
eval_batch_size = 64

train_data_size = train_x.shape[0]
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)

# creates an optimizer with learning rate schedule
optimizer = nlp.optimization.create_optimizer(
    2e-6, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)


## Finetuning BERT and the classifier

In [40]:
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile the model
hub_classifier.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metrics)

# Train the model
hub_classifier.fit(
      x=train_inputs, 
      y=train_y,
      validation_data=(valid_inputs, valid_y),
      validation_batch_size=eval_batch_size,
      batch_size=batch_size,
      epochs=epochs)



Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x216b64bd0b8>

## Save the model

In [42]:
os.makedirs('models', exist_ok=True)
tf.keras.models.save_model(hub_classifier, os.path.join('models', 'bert_spam.h5'))

## Testing the data

In [43]:
hub_classifier.evaluate(test_inputs, test_y)



[0.6432445049285889, 0.7950000166893005]