# Advanced Text Analytics Lab 2

This notebook is the second of two lab notebooks that you will submit as part of your assessment for the Advanced Data Analytics unit. The notebook contains three sections:
1. **Introducing Transformers:** This section introduces the Transformers library from HuggingFace, showing you how to use it to obtain contextualised embeddings from pretrained transformer models.
2. **Classification with transformers:** Here we show you how to construct a neural network classifier using Transformers, and give you the task of adapting it to sentiment classification.
3. **OPTIONAL: More on Transformers:** Some pointers to other materials if you want to learn more about transformers, e.g., if using them in your summer project. 

## Learning Outcomes

These sections will contain tutorial-like instructions, as you have seen in previous text analytics labs. On completing these sections, the intended learning outcomes are that you will be able to...
1. Use pretrained transformers to obtain contextualised word embeddings.
1. Construct classifiers with pretrained transformers. 
1. Find documentation on pretrained models in the Transformers library.

## Your Tasks

Inside each of these sections there are several **'To-do's**, which you must complete for your summative assessment. Your marks will be based on your answers to these to-dos. Please make sure to:
1. Include the output of your code in the saved notebook. Plots and printed output should be visible without re-running the code. 
1. Include all code needed to generate your answers.
1. Provide sufficient comments to understand how your method works.
1. Write text in a cell in markdown format where a written answer is required. You can convert a cell to markdown format by pressing Escape-M. 

There are also some unmarked 'to-do's that are part of the tutorial to help you learn how to implement and use the methods studied here.

## Marking Criteria

1. The coursework (both notebooks) is worth 30% of the unit in total. 
1. There is a total of 100 marks available for both lab notebooks. 
1. This notebook is worth 34 of those marks.
1. The number of marks for each to-do out of 100 is shown alongside each to-do.
1. For to-dos that require you to write code, a good solution would meet the following criteria (in order of importance):
   1. Solves the task or answers the question asked in the to-do. This means, if the code cells in the notebook are executed in order, we will get the output shown in your notebook.
   1. The code is easy to follow and does not contain unnecessary steps.
   1. The comments show that you understand how your solution works.
   1. A very good answer will also provide code that is computationally efficient but easy to read.
1. You can use any suitable publicly available libraries. Unless the task explicitly asks you to implement something from scratch, there is no penalty for using libraries to implement some steps.

## Support

The lecturer will help you with questions about the lectures, the code provided for you in this notebook, and general questions about the topics we cover. For the marked 'to-dos', they can only answer clarifying questions about what you have to do. 

Office hours: You can book office hours with Edwin on Tuesdays 3pm-5pm by sending him an email (edwin.simpson@bristol.ac.uk). If those times are not possible for you, please contact him by email to request an alternative. 

## Deadline

The notebook must be submitted along with the second notebook on Blackboard before **7th August at 13.00**. 

## Submission

You will need to zip up this notebook with the previous notebook into a single .zip file, which you will submit to Blackboard through the 'assessment, submission and feedback' link on the left sidebar. 

Please name your files like this:
   * Name this notebook ADA2_<student_number>.ipynb
   * Name the zip file <student_number>.zip
   * Please don't use your name anywhere as we want to mark anonymously. 

# 4. Pretrained Transformers (max. 12 marks)

HuggingFace is a company that has developed an open source library for loading pretrained transformer models. They also distribute many models themselves.  It is therefore the best library to use to create NLP models on top of large, deep neural networks. This is especially useful for tasks where simpler, feature-based methods or smaller LSTM models do not perform well enough, for example, when complex processing of syntax and semantics is required (natural language 'understanding'). 

Let's start by looking at two key types of object in the transformers library: models and tokenizers.

## 4.1. Models

The neural network models available in the Transformers library are accessed through wrapper classes such as `AutoModel`. If we want to load a pretrained model, we can simply pass its name to the `from_pretrained` function, and the pretrained model weights will be downloaded from HuggingFace and a neural network model will be created with those weights. For example:

In [None]:
import numpy as np  # we will need this later

from transformers import AutoModel # For BERTs

model = AutoModel.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

This code loads the tinyBERT model, which is a compressed version of BERT with 14.5 million parameters, compared to the standard version of BERT, called 'BERT-base', which has 110 million parameters. While BERT-tiny will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D).  

The same functions can be used to load other models from HuggingFace's repository simply by changing the model's name. Take a look at [the Models page](https://huggingface.co/models) so see what there is on offer. Do you recognise any of the models' names?

# 4.2. Tokenizers

Before we can apply a model to some text, we need to a create Tokenizer object. In Transfomers, Tokenizer objects convert raw text to a sequence of numbers. First, the tokenizer actually performs tokenization, then it maps each token to its numerical ID. There are lots of different tokenizers that we can use to preprocess text. If we are loading a pretrained model, we will need to choose the tokenizer that corresponds to that model. 

We can load the right tokenizer as follows, in the same way we loaded the model itself:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

Let's see what the BERT-tiny tokenizer does to an example sentence:

In [None]:
sentence = "The transformer architecture is widely used in NLP."

tokens = tokenizer.tokenize(sentence)
print(tokens)

Let's compare with the NLTK tokenizer we have seen before:

In [None]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)

While NLTK keeps whole words as tokens, the BERT tokenizer splits some words into sub-words. Splitting is applied to words with low frequency in the training set, such as 'transformer'. This helps to avoid out-of-vocabulary problems, as rare words are split into smaller, recognisable components and the size of vocabulary can be limited. 

**TO-DO 4.2a:** Can we use any tokenizer with any pretrained model? Explain your answer. **(2 marks)**

WRITE YOUR ANSWER HERE.

---

After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary):

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

## 4.3. Contextualised Embeddings

Now that we have a sequence of tokens, we are almost ready to process the sequence using the pretrained model. 

Our model takes as input a PyTorch `tensor` object. In PyTorch, `tensor` is a muli-dimensional matrix. Here, we need a two-dimensional matrix, where each row is a sequence of input tokens corresponding to a single sentence or document. Let's convert our list of IDs to a 2-D tensor with a single row:

In [None]:
import torch 

ids_tensor = torch.tensor([ids])

print(ids_tensor)

Now we can process the sequence using our model. The model maps the sequence of input IDs to a sequence of output vectors, which are contextualised word embeddings. The hidden state values produced in the last hidden layer of the model are used as the contextualised embeddings:

In [None]:
model_outputs = model(ids_tensor)
print(model_outputs)
embeddings = model_outputs['last_hidden_state'][0]
print(embeddings)

We can retrieve the embedding vector for "transform" like this:

In [None]:
emb = embeddings[1]

# convert it to a numpy array so we can perform various operations on it later on
emb = emb.detach().numpy()

print(emb)
print(f'The BERT-tiny embeddings have {emb.shape[0]} dimensions.')

TO-DO 4.3a: Retrieve the embedding for "architecture" (this to-do will not be marked).

In [None]:
# WRITE YOUR ANSWER HERE


Sentences and documents usually have varying lengths. So, to put multiple sentences into a single tensor, we need to pad the sequences up to a maximum length. Luckily, the tokenizer class takes care of this for us. When we pass in a list of sentences, the tokenizer creates a matrix, where each row is a sequence of the same length:

In [None]:
sentences = [
    "She opened the book to page 37 and began to read aloud.",
    "Many readers find the first book of A Tale of Two Cities to be confusing.",
    "I can book tickets for the concert next week.",
    "The police wanted to book him for driving too fast.",
    "I can reserve tickets for the concert next week."
]

model_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")  

print(model_inputs)

`model_inputs` is a dictionary containing three objects. The `input_ids` are the list of token IDs in the input sequences. 

TO-DO 4.3b: What value do the special padding tokens have? (this to-do is unmarked)

ANSWER: 

---

Notice that the input_ids all start with the same token ID, 101, even though they have different first words. They also have token ID 102 before the padding tokens. This is because the tokenizer inserts two special tokens, which are used in some applicaions of BERT. 101 is the '[CLS]' token, which is a dummy token whose embedding can be trained to represent the whole sequence. the [CLS] token's embedding can then be used as input to a text classifier to classify a sentence or document. Token 102 is '[SEP]', which can be used to separate multiple input sequences in a single example. This is needed in tasks where multiple pieces of text are provided as input, e.g., a to build a classifier that can determine whether two sentences contradict each other. 

The `attention_mask` records which tokens are special padding tokens and which are real tokens. Tokens with a 0 in the attention mask will be ignored.

`token_type_ids` is needed when two sequences are passed together as input to the model. Here, each input is a single sentence, so we have only one type of token in the output above. 

We can now pass all of the model inputs to the model to produce a set of contextualised embeddings:

In [None]:
# model_inputs is a dictionary, so to provide the arguments to model(), 
# we use the double star to unpack the dictionary so that each key in the dictionary is
# an argument to model() and each value is the value of the argument. 
model_outputs = model(**model_inputs) 

**TO-DO 4.3c:** In a few sentences, explain how the contextualised embeddings produced by the transformer model differ from the word embeddings produced by word2vec or GloVe, which we used in the first Jupyter notebook. **(3 marks)**

WRITE YOUR ANSWER HERE

---

**TO-DO 4.3d:** Use the model_outputs to obtain an embedding that can represent each sentence. Comment your code so that your method is clear. **(4 marks)**

Hint: you may need to convert tensors to numpy arrays.

In [None]:
#WRITE YOUR OWN CODE HERE



**TO-DO 4.3e:** Write code in the cell below to compare the embeddings of different sentences. **(3 marks)**

In [27]:
# WRITE YOUR ANSWER HERE



# 5. Transformer-basd Text Classifiers (max. 22 marks)

The previous section showed us how to obtain a sequence of contextualised word embeddings using a pretrained transformer. How can we use a pretrained model to build a classifier?

First, let's load up the [Tweet Eval](https://huggingface.co/datasets/tweet_eval) stance_hillary dataset, which we will use to train and test a classifier. The stance_hillary dataset is relatively small compared to the sentiment dataset we used earlier. The task is to classify tweets according to their stance towards Hillary Clinton, where 0 = no stance, 1 = against, 2 = favor.After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary):

In [None]:
from datasets import load_dataset

cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="stance_hillary",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

val_dataset = load_dataset(
    "tweet_eval",
    name="stance_hillary",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="stance_hillary",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")

num_classes = np.unique(train_dataset['label']).size

Now we are working with a proper dataset, which uses the datasets library. We can use our tokenizer to tokenize the examples in the dataset using the code in the next cell. Here, we use the ``map()`` method again to apply the tokenizer to each example in the dataset. 

In [None]:
def tokenize_function(dataset):
    model_inputs = tokenizer(dataset['text'], padding="max_length", max_length=100, truncation=True)
    return model_inputs

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Now, we have the dataset in the right format, let's see how to create a classifier based on a pretrained transformer.

Our original text classifier from the first notebook used a fully-connected layer to produce a hidden representation of the whole sentence. This hidden representation was then fed to an output layer to produce a probability distribution over class labels:

<img src="neural_text_classifier_smaller.png" alt="Neural text classifier diagram from the slides in lecture 8.1" width="400px"/>

With transformers, we can do something very similar, by connecting the transfomer's output to a fully-connected layer. However, with BERT, we do not need to pass the embedding of each individual word to the fully-connected layer because there is a special [CLS] token that represents the whole sentence:

<img src="bert_text_classifier.png" alt="BERT text classifier diagram from the slides in lecture 9.2" width="400px"/>

Diagram from ["BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"](https://teaching.bb-ai.net/Student-Projects/Winograd-Challenge-Papers/2018-Devlin-BERT.pdf), Devlin et al., 2018.

The code below shows how to access a tensor containing the [CLS] embeddings:

In [None]:
cls_embs = model(**model_inputs)['last_hidden_state'][:, 0]

print(cls_embs.shape)

So, given the pretrained BERT model, we need to put a classifier 'head' (fully connected layers that map the CLS embedding to a class probability) onto it, and train the classifier head for emotion classification. 

The transformers library provides some useful wrappers around the pretrained models that construct complete models for typical tasks such as text classification. These auto classes are documented here: https://huggingface.co/docs/transformers/model_doc/auto

The code below will create a complete model for sequence classification, based on the BERT-tiny model:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=num_classes)

**TO-DO 5a:** What does the AutoModelForSequenceClassification class connect to the BERT-tiny model to make it suitable for sequence classification (think about the network architecture)?  **(2 marks)**

WRITE YOUR ANSWER HERE

---

**TO-DO 5b:** How would a model created using the AutoModelForTokenClassification class differ from one created using  AutoModelForSequenceClassification? Give some examples of tasks that AutoModelForTokenClassification could be used for. Reference the documentation for the chosen class in your answer. **(4 marks)** 

WRITE YOUR ANSWER HERE

---

Next, we are going to train our model. Sometimes it is not necessary to update the weights in the BERT model itself, so we can freeze them. This can save a lot of computation time. We can do this as follows:

In [None]:
for param in model.bert.parameters():
    param.requires_grad = False

To train our model on the emotion data, we can make use of the Trainer class. This class encapsulates a lot of the complex training steps and avoids the need to define our own training function (``train_nn`` in the previous notebook).

Run the code below to train the model.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=3,  # change this if it is taking too long on your computer
)  

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

Let's make some predictions with our model on the test dataset:

In [None]:
def predict_nn(trained_model, test_dataset):

    # Pass the required items from the dataset to the model    
    output = trained_model(attention_mask=torch.tensor(test_dataset["attention_mask"]), input_ids=torch.tensor(test_dataset["input_ids"]))
        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    pred_labs = np.argmax(output["logits"].detach().numpy(), axis=1)
    
    gold_labs = test_dataset["label"]
    
    return gold_labs, pred_labs

# Run the prediction function to get the results:
gold_labs, pred_labs = predict_nn(model, test_dataset)

**TO-DO 5c:** 
Implement and test a classifier for the "emotion" subset of the [Tweet_eval dataset](https://huggingface.co/datasets/tweet_eval) using a pretrained transformer. Evaluate the classifier with both frozen and unfrozen (i.e., fine-tuned) BERT layers using a suitable evaluation metric. Discuss your results below, explaining what happens during training in the frozen and unfrozen variants, and how this leads to different results. Make sure to comment your code.  **(8 marks)**

Note: you may implement any classifier that uses a pretrained transformer model. 

WRITE YOUR ANSWER HERE   


---

In [None]:
# WRITE YOUR ANSWER HERE


**TO-DO 5d:** Did you use transfer learning in your approach to 5c? What kinds of transfer learning did you use? Can you think of a way to improve the pretraining step that might help improve your Twitter hate speech classifier?  **(4 marks)**

WRITE YOUR ANSWER HERE


---

**TO-DO 5e:** Use your model to compute the probability of each emotion for a tweet or sentence of your choosing. Comment your code, print the sentence with its probabilities and class labels. **(4 marks)**

In [None]:
# WRITE YOUR ANSWER HERE   
