# Text classification of clickbait headlines
## Transformer models: DistilBERT

Transformer models represent words through taking into account their meaning (through word embeddings), their position in a sequence, and the amount of attention the model needs to pay to other words in the sequence in order to represent its contextual meaning. BERT and its offshoots are general purpose language models which can be fine tuned to do a number of natural language tasks such as text summarisation, question answering, grammar correction and text classification.

## Load in dependencies and data

In [1]:
# https://medium.com/geekculture/hugging-face-distilbert-tensorflow-for-custom-text-classification-1ad4a49e26a7
import pandas as pd
import numpy as np

from transformers import (
    DataCollatorWithPadding,
    create_optimizer,
    TFAutoModelForSequenceClassification
)

In [2]:
cwd = "Users/jodie.burchell/Documents/git/text-to-vectors"

In [29]:
# Read in train and validation sets
clickbait_train = pd.read_csv(f"{cwd}/data/clickbait_train.csv", sep="\t", header=0)
clickbait_val = pd.read_csv(f"{cwd}/data/clickbait_val.csv", sep="\t", header=0)

## Convert Pandas DataFrame into Dataset format

In [30]:
from datasets import Dataset
clickbait_train_ds = Dataset.from_pandas(clickbait_train)
clickbait_val_ds = Dataset.from_pandas(clickbait_val)

## Tokenise data

In Transformer models, raw text is taken in, tokenised, and converted to an ID which matches a vocabulary value in the pretrained model. This is done by calling the `Autotokenizer` method with the corresponding model you want to fine tune. We will be using [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert), a smaller, lighter version of BERT that preserves 95% of BERT's performance on many NLP tasks.

In [31]:
# Read in the AutoTokeniser associated with our model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(rows):
    return tokenizer(rows["text"], padding=True)

In [6]:
# Tokenise the train and validation data using the AutoTokeniser
tokenized_train = clickbait_train_ds.map(preprocess_function, batched=True)
tokenized_val = clickbait_val_ds.map(preprocess_function, batched=True)

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

In [34]:
# Print out an example of the tokenised text.
tokenized_train[0]

{'text': 'New insulin-resistance discovery may help diabetes sufferers',
 'label': 0,
 'input_ids': [101,
  2047,
  22597,
  1011,
  5012,
  5456,
  2089,
  2393,
  14671,
  9015,
  2545,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

The two values output during tokenisation that we need are the `input_ids` and the `attention_mask`. These are explained in more detail in [this](https://www.youtube.com/watch?v=Yffk5aydLzg&t=16s) and [this](https://www.youtube.com/watch?v=M6adb1j2jPI&t=166s) video. Let's start by examining the `input_ids`.

Our DistilBERT model accepts raw text as an input, and retains punctuation as tokens. In addition, it splits some words into stems and prefixes, a little like what we did with lemmatisation.

In [8]:
# Tokenising a headline into words
tokens = tokenizer.tokenize(clickbait_train["text"][0])
print(tokens)

['new', 'insulin', '-', 'resistance', 'discovery', 'may', 'help', 'diabetes', 'suffer', '##ers']


These tokens are then mapped to an ID, based on a dictionary that was created during DistilBERT's training.

In [9]:
# Mapping the tokens to their IDs in the model
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[2047, 22597, 1011, 5012, 5456, 2089, 2393, 14671, 9015, 2545]


However, as we know from previous methods we've talked about with text processing, all inputs must be the same length. It doesn't take long for us to find an example of two sentences which are very different lengths.

In [10]:
# Extract two sentences with very different lengths
different_length_sentences = clickbait_train["text"][3:5]

# Print out their tokens
print("Raw sentences")
tokens = [tokenizer.tokenize(sentence) for sentence in different_length_sentences.tolist()]
print(tokens[0])
print(tokens[1])

# Print out the input ID mappings
print("\nConverted to IDs")
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]
print(ids[0])
print(ids[1])

Raw sentences
['irish', 'developer', 'found', 'dead', 'in', 'his', 'home']
['boat', 'accident', 'in', 'democratic', 'republic', 'of', 'the', 'congo', 'kills', 'at', 'least', '73']

Converted to IDs
[3493, 9722, 2179, 2757, 1999, 2010, 2188]
[4049, 4926, 1999, 3537, 3072, 1997, 1996, 9030, 8563, 2012, 2560, 6421]


What the tokeniser allows you to do is apply padding so that shorter sequences have the same length as longer ones. As you can see here, the shorter sentences has been padded out with zeros to make it the same length as the longer one.

In [11]:
# Show the two sentences above with padding added - they are now the same length!
padded_tokenizer = tokenizer(different_length_sentences.tolist(), padding = True)
print(padded_tokenizer["input_ids"][0])
print(padded_tokenizer["input_ids"][1])

[101, 3493, 9722, 2179, 2757, 1999, 2010, 2188, 102, 0, 0, 0, 0, 0]
[101, 4049, 4926, 1999, 3537, 3072, 1997, 1996, 9030, 8563, 2012, 2560, 6421, 102]


However, there is one remaining issue. The attention mechanism within the model doesn't understand that these padded IDs don't mean anything, and if we don't instruct the model to ignore them, it will distort the model predictions. As such, when padding is applied, attention masks are generated for each sentence. These are vectors of the same length as the input vector, with 1's to tell the model to use this token, and 0's to tell it to ignore it. We can see that the attention mask for sentence 1 is instructing the model to ignore all the padded tokens.

In [12]:
# Show how the attention mask works
print("Sentence 1")
print(padded_tokenizer["input_ids"][0])
print(padded_tokenizer["attention_mask"][0])

print("\nSentence 2")
print(padded_tokenizer["input_ids"][1])
print(padded_tokenizer["attention_mask"][1])

Sentence 1
[101, 3493, 9722, 2179, 2757, 1999, 2010, 2188, 102, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Sentence 2
[101, 4049, 4926, 1999, 3537, 3072, 1997, 1996, 9030, 8563, 2012, 2560, 6421, 102]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Convert Dataset into Tensors

As we're using a Tensorflow model, we need to convert the Hugging Face Dataset into something that tensorflow can understand. That means we need to convert each sentence, with the input IDs, attention masks and labels into Tensorflow tensors. In order to make sure that the padding length is consistent across all three datasets, we can use a `DataCollatorWithPadding` to even this out before we get to model training.

In [13]:
# Create data collator, which will standardise the padding across all datasets to make
# sure all inputs are the same length
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [14]:
# Convert the train and validation sets into tensors
tf_train_set = tokenized_train.to_tf_dataset(
    columns=["attention_mask", "input_ids", "label"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator
)

tf_val_set = tokenized_val.to_tf_dataset(
    columns=["attention_mask", "input_ids", "label"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator
)

2022-09-13 11:23:43.955441: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


When comparing this to the information contained in the Dataset, we have all the same fields: `input_ids`, `attention_mask` and `labels`. When we print out the first example, you can see that the information is also the same.

In [15]:
tf_train_set

<PrefetchDataset element_spec={'input_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'labels': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>

In [16]:
for sentence in tf_train_set.take(1):
    print(sentence["input_ids"][0])
    print(sentence["attention_mask"][0])
    print(sentence["labels"][0])

tf.Tensor(
[  101  2047 22597  1011  5012  5456  2089  2393 14671  9015  2545   102
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0], shape=(28,), dtype=int64)
tf.Tensor([1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(28,), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)


## Fine tuning the DistilBERT model

We can now get to fine-tuning our DistilBERT model. We first read in the model using the `TFAutoModelForSequenceClassification` for sequence classification. What this tells the trainer to do is drop the final layer of the original DistilBERT model and add a layer with two outcomes. We'll train this layer in order to create our BERT-based clickbait classifier.

In [17]:
# Import the pretrained distilBERT model with a new final layer which we'll use for classifying clickbait titles
bert_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

In order to train the model, we need an optimiser. The below code allows us to create an optimiser that will decay the learning rate in line with the number of planned epochs.

In [18]:
batch_size = 16
num_epochs = 3
batches_per_epoch = len(tokenized_train) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

In [19]:
bert_model.compile(optimizer=optimizer,
                   metrics=["accuracy"])

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Finally, we can fine-tune our model for clickbait classification!

In [20]:
# Fine tune our model using the training data
bert_model.fit(x=tf_train_set,
               validation_data=tf_val_set,
               epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x105218340>

In [21]:
# Generate model predictions
preds = bert_model.predict(tf_val_set).logits
pred_val_labels = np.argmax(preds, axis=1)



In [23]:
# Add a column with the predictions to the validation data
clickbait_val["bert_pred"] = pred_val_labels

In [3]:
# Headlines the model thought were not clickbait, but which are
pd.read_csv(f"{cwd}/data/bert_incorrect_prediction_not_clickbait.csv",
            sep = "\t",
            header = 0)

Unnamed: 0,text
0,Photographer Gregory Crewdson Releases Hauntin...
1,Oscar-Nominated Movie Posters With White Actor...
2,Richard Madden Looking Attractive On A Horse
3,"Inside China's Memefacturing Factories, Where ..."
4,A Dutch Organization Is Providing Free Abortio...


In [4]:
# Headlines the model thought were clickbait, but which are not
pd.read_csv(f"{cwd}/data/bert_incorrect_prediction_clickbait.csv",
            sep = "\t",
            header = 0)

Unnamed: 0,text
0,"Avenged Sevenfold drummer James ""The Rev"" Sull..."
1,Dolls Resembling Daughters Displease First Lady
2,A Note to Readers
3,"Add Nuts to Your Diet With Sauces, Not Snacks"
4,How Bethpage Black Was Mastered (For a Day) By...
