# Text Analytics Lab 5: Pretrained Language Models

This notebook introduces the Transformers library from HuggingFace, which we can use to access a wide range of pretrained language models. The sections are:

   1. **Introducing Transformers:** This section introduces the Transformers library from HuggingFace, showing you how to use it to obtain contextualised embeddings from pretrained transformer models.
   1. **Transformers for Text Classification:** Here we show you how to construct a classifier using Transformers.
   1. **OPTIONAL: More on Transformers:** Some pointers to other materials if you want to learn more about transformers, e.g., if using them in your summer project. 

Example code for all the tasks has been tested on a four-year old MacBook Pro, and the longest training process took under 10 minutes. If you find that the code takes too long to run on your own machine, a good alternative is to use [Google Colab](https://colab.research.google.com/), Amazon Sagemaker Studio, or hte lab machines on campus. 

## Learning Outcomes

These sections will contain tutorial-like instructions, as you have seen in previous text analytics labs. On completing these sections, the intended learning outcomes are that you will be able to...
1. Use pretrained transformers to obtain contextualised word and sentence embeddings.
1. Apply a pretrained QA model to a new dataset. 
1. Construct classifiers with pretrained transformers. 
1. Find documentation on pretrained models in the Transformers library.

In [1]:
import numpy as np
import torch 
from datasets import load_dataset

cache_dir = "./data_cache"

# 1. Introducing Transformers 

HuggingFace is a company that has developed an open source library for loading pretrained transformer models. They also distribute many models that have been pretrained using language modelling tasks, or fine-tuned to specific downstream NLP tasks.  It is currently the best library to use to create NLP models on top of large, deep neural networks. This is especially useful for tasks where simpler, feature-based methods or smaller LSTM models do not perform well enough, for example, when complex processing of syntax and semantics is required (natural language 'understanding'). 

The larger models often give great performance, but the trade-off is that they require a lot of memory and compute. When building a model for a new dataset, it is a good idea to compare faster models with transformers to determine whether the performance/cost trade-off is worth it on that particular dataset. 

Let's start by looking at two key types of object in the transformers library: models and tokenizers.

## 1.1. Models

The neural network models available in the Transformers library are accessed through wrapper classes such as `AutoModel`. If we want to load a pretrained model, we can simply pass its name to the `from_pretrained` function, and the pretrained model weights will be downloaded from HuggingFace and a neural network model will be created with those weights. For example:

In [2]:
from transformers import AutoModel

model = AutoModel.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

This code loads the TinyBERT model, which is a compressed version of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While TinyBERT will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D).

<!--the RoBERTa variant of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While RoBERTa-tiny will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/arampacha/roberta-tiny).  -->

The same functions can be used to load other models from HuggingFace's repository simply by changing the model's name. Take a look at [the Models page](https://huggingface.co/models) so see what there is on offer. Do you recognise any of the models' names?

# 1.2. Tokenizers

Before we can apply a model to some text, we need to a create Tokenizer object. In Transfomers, Tokenizer objects convert raw text to a sequence of numbers. First, the tokenizer actually performs tokenization, then it maps each token to its numerical ID. There are lots of different tokenizers that we can use to preprocess text. If we are loading a pretrained model, we will need to choose the tokenizer that corresponds to that model. 

**TO-DO 1:** Why is it necessary to choose a matching tokenizer for a pretrained model?

We can load the right tokenizer as follows, in the same way we loaded the model itself:

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

Let's see what the TinyBERT tokenizer does to an example sentence:

In [4]:
sentence = "The transformer architecture has transformed the field of NLP."

tokens = tokenizer.tokenize(sentence)
print(tokens)

['the', 'transform', '##er', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'nl', '##p', '.']


Let's compare with the NLTK tokenizer we have seen before:

In [5]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)

['The', 'transformer', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'NLP', '.']


While NLTK keeps whole words as tokens, the BERT tokenizer splits some words into sub-words and inserts some special characters into the tokens. Splitting is applied to words with low frequency in the training set, such as 'transformer'. 

Rather than following a set of hand-crafted rules, the BERT tokenizer is learned from a large dataset. It starts by adding individual characters to its vocabulary. Then, it adds the most frequently occurring pairs of characters as tokens in the vocabulary. This repeats by adding the most frequent pairs of tokens to the vocabulary until the desired size of dictionary is reached. When tokenizing a document, words that are not in the vocabulary are matched against the shorter sub-word tokens.

**TO-DO 2:** What is the benefit of splitting some words into sub-word tokens? 

WRITE YOUR ANSWER HERE.

Rare/out-of-vocabulary words can often be broken into constituent parts, like stems/root forms of a verb, suffixes, prefixes, and other parts of words. The meaning can be composed from these parts, and these parts may convey syntactic or semantic information themselves that is useful for processing the whole sentence. Using subwords to represent rare words also allows us to limit vocabulary size.

---

After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary), so that we can pass them as input to a neural network:

In [6]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1996, 10938, 2121, 4294, 2038, 8590, 1996, 2492, 1997, 17953, 2361, 1012]


Let's load up a dataset that we can use for our experiments later on. We will use the [TweetEval hate speech](https://huggingface.co/datasets/tweet_eval) dataset to train and test a classifier. The task is to classify tweets into one of  0: non-hate or 1: hate.

In [7]:
from datasets import load_dataset

#cache_dir = './data_cache/'

# Load up the emotion dataset...
train_dataset = load_dataset(
    "tweet_eval",
    name="hate",
    split="train",
    #cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

val_dataset = load_dataset(
    "tweet_eval",
    name="hate",
    split="validation",
    #cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="hate",
    split="test",
    #cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")

Training dataset with 9000 instances loaded
Validation dataset with 1000 instances loaded
Test dataset with 2970 instances loaded


Now, let's see apply our tokenizer to the dataset, using the map function to run it on all samples:

In [8]:
# tokenize...
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

def tokenize_function(dataset):
    model_inputs = tokenizer(dataset['text'], padding="max_length", max_length=128, truncation=True)
    return model_inputs

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [9]:
train_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 9000
})

## 1.3. Contextualised Embeddings

Now that we have a sequence of tokens, we are almost ready to process the sequence using the pretrained model. 

Our model takes as input a PyTorch `tensor` (a muli-dimensional array). Here, we need a two-dimensional matrix, where each row is a sequence of input tokens corresponding to a single sentence or document. Let's convert our list of IDs to a 2-D tensor with a single row:

In [10]:
ids_tensor = torch.tensor([ids])

print(ids_tensor)

tensor([[ 1996, 10938,  2121,  4294,  2038,  8590,  1996,  2492,  1997, 17953,
          2361,  1012]])


Now we can process the sequence using our model. The pretrained transformer model maps the sequence of input IDs to a sequence of output vectors, which are contextualised word embeddings. The hidden state values produced in the last hidden layer of the model are used as the contextualised embeddings:

In [11]:
model_outputs = model(ids_tensor)
print('The complete model outputs: ')
print(model_outputs)

print()
print('The last hidden state sequence for the first sentence in our batch (we only have one sentence in the batch): ')
embeddings = model_outputs['last_hidden_state'][0]
print(embeddings)

The complete model outputs: 
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3608,  0.2862, -0.1549,  ..., -0.2064,  0.2663, -0.0109],
         [ 0.0149,  0.7223, -0.0508,  ..., -0.5505,  0.2355, -0.2962],
         [ 0.1531,  0.5903, -0.1244,  ..., -0.4263,  0.0417, -0.1839],
         ...,
         [ 0.1742, -0.1091, -0.1963,  ..., -0.6736,  0.0472, -0.1840],
         [ 0.2434,  0.1021, -0.2241,  ..., -0.5400, -0.1691, -0.1314],
         [ 0.0854,  0.3272, -0.3016,  ..., -0.2154, -0.5632, -0.1921]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-1.1380e-02, -6.3005e-03,  1.8521e-02,  7.1139e-03, -3.1795e-02,
          1.3882e-02, -1.5459e-02, -1.0610e-03, -1.8263e-02, -3.6515e-02,
         -2.1257e-02, -1.5479e-02, -2.8094e-04, -4.1092e-02, -2.5315e-02,
         -4.3338e-02, -1.1617e-03, -1.3931e-02,  6.0733e-03,  4.3790e-03,
          2.7094e-04, -2.1810e-02, -4.8026e-02,  2.5493e-02, -1.6502e-02,
         -1.2034e-03,  4.2757e-02,  3.

We can retrieve the embedding vector for "transform" like this ("transform" is the second token in the sequence):

In [12]:
emb = embeddings[1]  # get second embedding in the sequence

# convert it to a numpy array so we can perform various operations on it later on
emb = emb.detach().numpy()

print(emb)
print(f'The TinyBERT embeddings have {emb.shape[0]} dimensions.')

[ 1.49152540e-02  7.22318172e-01 -5.07856756e-02 -2.74205208e-01
 -1.38931945e-01  1.00099719e+00  7.11460225e-03  2.71391630e-01
 -3.92814502e-02  6.04100786e-02  1.25740275e-01  4.60631132e-01
  6.25288114e-03  1.61929965e-01  1.23913512e-01 -4.08096790e-01
  1.24868281e-01 -4.71536934e-01  2.24768654e-01  6.35188073e-02
  8.56176019e-02 -1.88044831e-01  1.77257672e-01  3.40048403e-01
 -1.95545748e-01  1.58553362e-01  9.62866545e-02  1.12649694e-01
  2.21045166e-01 -9.56114054e-01 -3.85948956e-01  1.39220521e-01
  5.90011775e-01 -8.06727529e-01 -1.34288132e-01  2.35691771e-01
 -1.02274150e-01  2.78303713e-01  7.94321418e-01 -2.49362856e-01
  1.72771528e-01 -2.07582936e-01  3.00157130e-01 -8.59332681e-02
 -2.25284532e-01 -9.75404009e-02 -3.52349520e-01  3.81161213e-01
 -3.87681633e-01 -1.77613512e-01 -4.13684934e-01  1.38047546e-01
  1.29874498e-02  6.52685225e-01  1.16502643e-01 -5.10779560e-01
 -8.30415636e-02 -2.67047882e-02  3.12862933e-01 -2.62848467e-01
 -1.43285245e-01  1.10270

**TO-DO 3:** Retrieve the embedding for "architecture".

In [13]:
# WRITE YOUR ANSWER HERE

print(embeddings[3])

tensor([ 2.7139e-01,  7.7458e-01, -3.2426e-01, -7.1433e-02, -4.9510e-04,
         9.3731e-01, -4.4026e-03, -4.2692e-02,  1.2740e-02,  1.8927e-02,
         1.0253e-01,  4.5466e-01,  2.7044e-01,  2.3099e-01,  4.0370e-03,
        -1.0899e-01, -4.5991e-02, -3.5115e-01, -1.3471e-01,  8.2940e-02,
         1.8650e-01,  5.0027e-02,  7.2166e-02,  2.2866e-01, -2.1970e-01,
         9.4020e-02,  1.6554e-01,  1.8579e-01,  3.1778e-01, -5.0937e-01,
        -5.0095e-01,  1.5249e-01,  4.5800e-01, -8.5188e-01, -1.5863e-01,
         1.5896e-01,  4.1620e-02,  2.3100e-01,  8.7850e-01, -6.2316e-02,
         1.8722e-01, -1.2338e-02,  2.1008e-01,  3.4806e-02, -2.5124e-01,
        -1.3791e-01, -3.8870e-01,  2.9819e-01, -2.9203e-01, -3.1950e-01,
        -1.9843e-01,  1.3203e-01, -6.4638e-02,  7.4318e-01,  7.1424e-02,
        -3.0212e-01,  3.4978e-01, -5.8179e-02,  2.8507e-01, -4.0958e-01,
        -1.0330e-01,  1.0377e-01, -2.2290e-01,  8.8632e-02, -4.3360e-01,
         2.1787e-01, -2.7699e-01,  3.9959e-01,  1.6

Sentences and documents usually have varying lengths. So, to put multiple sentences into a single tensor, we need to pad the sequences up to a maximum length. Luckily, the tokenizer class takes care of this for us. When we pass in a list of sentences, the tokenizer creates a matrix, where each row is a sequence of the same length:

In [14]:
sentences = [
    "I can book tickets for the concert next week.",
    "Many readers find the first book of A Tale of Two Cities to be confusing.",
    "She opened the book to page 37 and began to read aloud.",
    "The police wanted to book him for driving too fast.",
    "I can reserve tickets for the concert next week."
]

model_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")  

print(model_inputs)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': tensor([[  101,  1045,  2064,  2338,  9735,  2005,  1996,  4164,  2279,  2733,
          1012,   102,     0,     0,     0,     0,     0,     0],
        [  101,  2116,  8141,  2424,  1996,  2034,  2338,  1997,  1037,  6925,
          1997,  2048,  3655,  2000,  2022, 16801,  1012,   102],
        [  101,  2016,  2441,  1996,  2338,  2000,  3931,  4261,  1998,  2211,
          2000,  3191, 12575,  1012,   102,     0,     0,     0],
        [  101,  1996,  2610,  2359,  2000,  2338,  2032,  2005,  4439,  2205,
          3435,  1012,   102,     0,     0,     0,     0,     0],
        [  101,  1045,  2064,  3914,  9735,  2005,  1996,  4164,  2279,  2733,
          1012,   102,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

`model_inputs` is a dictionary containing three objects:
 * The `input_ids` are the list of token IDs in the input sequences. 
 * The `attention_mask` records which tokens are special padding tokens and which are real tokens. Tokens with a 0 in the attention mask will be ignored.
 * `token_type_ids` is needed when two sequences are passed together as input to the model for tasks such as next sentence prediction that involve comparing two sentences. Here, each input is a single sentence, so we have only one type of token in the output above. 
 
**TO-DO 4:** Look at the outputs above and work out which value the special padding tokens have? 

ANSWER: Pad tokens are 0. 

---

Notice that the input_ids all start with the same token ID, 101, even though they have different first words. They also have token ID 102 before the padding tokens. This is because the tokenizer inserts two special tokens, which are used in some applicaions of BERT. 101 is the '[CLS]' token, which is a dummy token whose embedding can be trained to represent the whole sequence. The [CLS] token's embedding can then be used as input to a text classifier to classify a sentence or document. Token 102 is '[SEP]', which can be used to separate multiple input sequences in a single example. This is needed in tasks where multiple pieces of text are provided as input, e.g., a to build a classifier that can determine whether two sentences contradict each other. 

We can now pass all of the model inputs to the model to produce a set of contextualised embeddings:

In [15]:
# model_inputs is a dictionary, so to provide the arguments to model(), 
# we use the double star to unpack the dictionary so that each key in the dictionary is
# an argument to model() and each value is the value of the argument. 
model_outputs = model(**model_inputs) 

**TO-DO 5:** The first four example sentences above all contain the word "book", and the last example contains "reserve". Obtain a list of contextualised word embeddings for 'book' and 'reserve' in the example sentences using our model. 

Hint: you may need to convert tensors to numpy arrays. Don't forget that the sequence of embeddings contains [CLS] and [SEP] embeddings. 

In [16]:
book_tok_id = tokenizer.convert_tokens_to_ids(['book'])

#WRITE YOUR OWN CODE HERE
embeddings = model_outputs['last_hidden_state']

book_embs = []
for i in range(4):
    book_idx_in_sen = np.argwhere(model_inputs["input_ids"][i].numpy() == book_tok_id)[0][0]
    book_embs.append(embeddings[i][book_idx_in_sen].detach().numpy())

reserve_tok_id = tokenizer.convert_tokens_to_ids(['reserve'])
reserve_idx_in_sen = np.argwhere(model_inputs["input_ids"][4].numpy() == reserve_tok_id)[0][0]
reserve_emb = embeddings[4][reserve_idx_in_sen].detach().numpy()

**TO-DO 6:** Compute the similarities between these embeddings in the cell below, and show the results. How do the similarities relate to the meaning of the word "book" or "reserve" in each sentence?

ANSWER 

The occurrences of 'book' with different meanings have larger cosine distances. 'reserve' has a similar meaning to 'book' in the first sentence, so has high similarity. 'book' in the third and second usages has the same meaning but the first and third are different. The fourth sentence contains 'book' as a verb rather than a noun, so has slightly lower similarity with the first.  This shows that the contextualised embeddings change depending on the sentence the word is used in, and its intended meaning. 

In [17]:
from scipy.spatial.distance import cdist  # you may find this function useful for computing distances

### WRITE YOUR ANSWER HERE
book_embs.append(reserve_emb)

similarities = 1 - cdist(book_embs, book_embs, metric='cosine')

###

for sen in sentences:
    print(sen)
    
print()
print("The table below shows similarities between words according to their contextualised embeddings:") 
print(np.round(similarities, decimals=2))


I can book tickets for the concert next week.
Many readers find the first book of A Tale of Two Cities to be confusing.
She opened the book to page 37 and began to read aloud.
The police wanted to book him for driving too fast.
I can reserve tickets for the concert next week.

The table below shows similarities between words according to their contextualised embeddings:
[[1.   0.58 0.63 0.6  0.74]
 [0.58 1.   0.75 0.59 0.29]
 [0.63 0.75 1.   0.44 0.29]
 [0.6  0.59 0.44 1.   0.5 ]
 [0.74 0.29 0.29 0.5  1.  ]]


**TO-DO 7:** Use the BERT model to obtain an embedding of each complete sentence from the five sentences listed above. Show the similarities and discuss what you see. 

ANSWER: We can use the CLS token to represent the text OR take a mean of the word embeddings of each sentence. 

In [18]:
### WRITE YOUR ANSWER HERE

cls_embs = embeddings[:, 0].detach().numpy()
similarities = 1 - cdist(cls_embs, cls_embs, metric='cosine')

###

print(similarities)
# Let's find the most similar sentences...
similarities[range(5), range(5)] = 0  # ignore the similarity between a sentence and itself
most_similar = np.argmax(np.max(similarities, axis=1))

print(f'The most similar sentence to "{sentences[-1]}" is "{sentences[most_similar]}", according to TinyBERT.')

[[1.         0.9157727  0.912248   0.92487733 0.99678734]
 [0.9157727  1.         0.88619383 0.94227128 0.92401025]
 [0.912248   0.88619383 1.         0.88843559 0.91287952]
 [0.92487733 0.94227128 0.88843559 1.         0.92794145]
 [0.99678734 0.92401025 0.91287952 0.92794145 1.        ]]
The most similar sentence to "I can reserve tickets for the concert next week." is "I can book tickets for the concert next week.", according to TinyBERT.


# 2. Transformer-based Text Classifiers

In this section, you will learn how to construct and train a text classifier on top of a pretrained transformer. 

To begin you will need to instantiate a suitable classifier model.

**TO-DO 8:** Find an AutoModel class that constructs a text classifier from the pretrained TinyBERT model, "huawei-noah/TinyBERT_General_4L_312D". Create the `model` object in the cell below using this class. Refer to the [Hugging Face documentation for auto models](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) as needed. 

In [19]:
### WRITE YOUR ANSWER HERE ###
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=2)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Typically, sequence classification models attach a linear layer (classification head) to the outputs of the transformer. The CLS token's embedding is passed into the classification head, which makes a prediction over classes. We can see a similar structure in most neural network models. Our original text classifier from the first notebook used a fully-connected layer to produce a hidden representation of the whole sentence, whereas now we are replacing that hidden layer with a complete BERT transformer, which produces a sequence of embeddings. 

<img src="neural_text_classifier_smaller.png" alt="Neural text classifier diagram from the slides in lecture 8.1" width="400px"/>


We will need to train our model before we can use it (you may see a message in the output of the last cell telling you this). 

**TO-DO 9:** The classifier is built on top of a pretrained TinyBERT transformer, which was pretrained using masked language modelling and next sentence prediction. Why does the classifier require further training to provide accurate sentiment classifications? 

ANSWER

Only the BERT layers are pretrained. The complete classifier has additional classifier head layers on top of BERT, which are initialised randomly. The pretraining tasks did not include tweet classification, so the model does not yet encode any relationship between the text embeddings and the emotion categories.  

---

Next, let's learn how to train our model. For some tasks it is not necessary to update the weights in the BERT model itself, so we can freeze them to save a lot of computation time. We can do this as follows. Since our pretrained model is based on BERT, we can access the weights inside BERT through the variable `model.bert`.

In [20]:
for param in model.bert.parameters():
    param.requires_grad = False

To train our model, we can make use of the Trainer class, which encapsulates a lot of the complex training steps and avoids the need to define our own training function, as we did in the previous notebook (we don't need to write our own `train_nn`).

First, define some settings for the training process. This is where we can set training hyperparameters:

In [21]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=3, # A sensible and sufficient number to use for the to-dos below
    per_device_train_batch_size=16,  # you can decrease this if memory usage is too high while training
    logging_steps=50,  # how often to print progress during training
)

Next, create a trainer object:

In [22]:
from transformers import Trainer
from torch import nn

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

To train the model, you will need to call `trainer.train()`.

Once the model is trained, we can obtain predictions using the function below. Notice that it is simpler than obtaining the spans for QA -- we simply get the logits for each tweet in the test set, then apply argmax over the classes to find the most probable class for each tweet:

In [23]:
from torch.utils.data import TensorDataset, DataLoader

# device to run computation on
if torch.backends.mps.is_built():
    device = torch.device("mps")  # for mac use with MPS
elif torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
else:
    device = torch.device("cpu")  # Default to CPU

def predict_nn(trained_model, test_dataset):
    
    # Switch off dropout
    trained_model.eval()
    
    # Convert the dataset into tensors and create a DataLoader
    batch_size = 16  # Adjust based on available memory
    test_dataset_tensors = TensorDataset(
        torch.tensor(test_dataset["input_ids"]), 
        torch.tensor(test_dataset["attention_mask"])
    )
    test_loader = DataLoader(test_dataset_tensors, batch_size=batch_size, shuffle=False)
    
    # Store predictions
    pred_labs = []
    
    with torch.no_grad():  # Disable gradient calculation
        for batch in test_loader:
            input_ids, attention_mask = [x.to(device) for x in batch]
            
            # Forward pass
            output = trained_model(input_ids=input_ids, attention_mask=attention_mask)
            
            # Get predicted labels
            preds = np.argmax(output["logits"].detach().cpu().numpy(), axis=1)
            pred_labs.extend(preds)
    
    # Convert to NumPy array 
    pred_labs = np.array(pred_labs)
    
    return pred_labs

You should now have all the bits and pieces needed to build and train a text classifier. Let's put them all together...

**TO-DO 10:** Train and test your sequence classifier on the [Sentiment](https://huggingface.co/datasets/tweet_eval) dataset using a pretrained transformer. Choose a suitable evaluation metric and compare the result with the simpler neural network classifiers from the previous lab. 

You may wish to 'unfreeze' the BERT model to see if this boosts performance, but note that it will require a lot more computation time to fine-tune the whole transformer model. Increasing the number of epochs could also boost performance, but again requires much more computation time.

In [24]:
cache_dir = './data_cache/'

# Load up the emotion dataset...
train_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="train",
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

val_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="validation",
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="test",
    #cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")

Training dataset with 45615 instances loaded
Validation dataset with 2000 instances loaded
Test dataset with 12284 instances loaded


In [25]:
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=3)

# you'll get better performance on this task if you don't freeze the BERT model, or perhaps if you train for more epochs
for param in model.bert.parameters():
    param.requires_grad = False

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

In [28]:
# train...
trainer.train()

  0%|          | 0/8553 [00:00<?, ?it/s]

{'loss': 1.096, 'grad_norm': 0.15513628721237183, 'learning_rate': 4.97077048988659e-05, 'epoch': 0.02}
{'loss': 1.0906, 'grad_norm': 0.287805438041687, 'learning_rate': 4.9415409797731794e-05, 'epoch': 0.04}
{'loss': 1.0871, 'grad_norm': 0.23804907500743866, 'learning_rate': 4.912311469659769e-05, 'epoch': 0.05}
{'loss': 1.0824, 'grad_norm': 0.23868264257907867, 'learning_rate': 4.8830819595463585e-05, 'epoch': 0.07}
{'loss': 1.0809, 'grad_norm': 0.14542195200920105, 'learning_rate': 4.853852449432948e-05, 'epoch': 0.09}
{'loss': 1.0746, 'grad_norm': 0.3537963330745697, 'learning_rate': 4.8246229393195376e-05, 'epoch': 0.11}
{'loss': 1.0741, 'grad_norm': 0.23777759075164795, 'learning_rate': 4.795393429206127e-05, 'epoch': 0.12}
{'loss': 1.0685, 'grad_norm': 0.31881213188171387, 'learning_rate': 4.766163919092716e-05, 'epoch': 0.14}
{'loss': 1.064, 'grad_norm': 0.19587695598602295, 'learning_rate': 4.7369344089793056e-05, 'epoch': 0.16}
{'loss': 1.0629, 'grad_norm': 0.1578907221555709

TrainOutput(global_step=8553, training_loss=1.0150284127130935, metrics={'train_runtime': 203.9058, 'train_samples_per_second': 671.119, 'train_steps_per_second': 41.946, 'total_flos': 490587879916800.0, 'train_loss': 1.0150284127130935, 'epoch': 3.0})

In [29]:
# Run the prediction function to get the results:
pred_labs_frozen = predict_nn(model, test_dataset)

gold_labs = test_dataset["label"]

In [30]:
from sklearn.metrics import f1_score

f1 = f1_score(np.array(gold_labs).flatten(), pred_labs_frozen.flatten(), average='macro')
print(f'FROZEN MODEL F1 = {f1}')

FROZEN MODEL F1 = 0.2576930538958548


In [33]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(np.array(gold_labs).flatten(), pred_labs_frozen.flatten())
print(f'FROZEN MODEL ACCURACY = {acc}')

FROZEN MODEL ACCURACY = 0.4937316834907196


**TO-DO 11:** What kinds of _transfer_ did your sentiment classifier use and what benefit do they provide? 

ANSWER

The BERT layers of the model are first pretrained on MLM and NSP tasks on a different dataset. With frozen BERT, we perform direct transfer of the BERT model to our emotion classification task. When BERT is unfrozen, we fine-tune the BERT layers, which performs inductive transfer learning. The benefit is that knowledgeabout how to process sequences of text to extract embeddings is transferred from the pretraining task (which had lots of data available) to the downstream target task (hate classification, with only a few thousand examples). 

---

The model currently outputs logits, rather than probabilities, which are much more useful for most applications of a text classifier.  To compute the probability of each class for a test sentence, we need to pass the logits through the softmax function. Complete the function below to obtain a probability distribution for a sentence of your choice.

In [34]:
sentences = ["A very joyful and happy day"]

model.eval()
output = model(**tokenizer(sentences,  max_length=128, padding="max_length", truncation=True, return_tensors="pt").to(device))
        
# the output dictionary contains logits, which are the unnormalised scores for each class for each example:
logits = output["logits"]

#### WRITE YOUR ANSWER HERE   
probs = torch.nn.Softmax(dim=1)(output["logits"])
####

print(f'The probability of each sentiment class is:')
classes = ['non-hate', 'hate']
for c, category in enumerate(classes):
    print(f'probability of {category} = {probs[0][c].detach().cpu().numpy()}')

The probability of each sentiment class is:
probability of non-hate = 0.15498974919319153
probability of hate = 0.39903175830841064


# 3. OPTIONAL: More on Transformers

There are many great resources out there to show you how to use this kind of model in practice:
* Use a Transformer for sequence tagging by following the [Token Classification tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=vc0BSBLIIrJQ) from HuggingFace. This is a little more involved than sequence classification because the tags provided in the training dataset require the text to be tokenized in a particular way, which often differs from what a particular pretrained transformer requires.
* An extensive online course is provided by HuggingFace: https://huggingface.co/course/chapter1/1. The pages linked from the HuggingFace course website have an 'open in Colab' button on the top right. You can open the notebook and run it on a Google server there to access GPUs.
* Chapters that may be particularly useful: 
   * Transformers, what can they do? https://huggingface.co/course/chapter1/3?fw=pt
   * Using Transformers: https://huggingface.co/course/chapter2/2?fw=pt
* They provide information on fine-tuning the transformer models here: https://huggingface.co/docs/transformers/training. Fine-tuning updates the weights inside the pretrained network and requires extensive GPU or TPU computing. 
* Text Generation: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb. This topic goes way beyond data analytics on this unit and shows you another powerful feature of pretrained transformers.


