<a href="https://colab.research.google.com/github/uceeatz/lab-2-transformers/blob/main/ELEC0141_Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets
!pip install transformers
!pip install evaluate


---
#**Section 1 - Introduction**
___

*Transformer, a powerful tool,*

*With Hugging Face, an online jewel.*

*Pre-trained models, a great base,*

*Fine-tune them, you'll win the race.*

*Self-attention, multi-head,*

*It's the future, stay ahead.*

*Encoder, decoder blocks,*

*In NLP, it unlocks,*

*The state-of-the-art, it's true,*

*With Transformer, you'll shine through.*

\- **A short poem about Transformers, written by a Transformer (*ChatGPT*)**

# **1.1 - A Brief History of Transformers**

*(Expected reading time: 10 mins)*

Before getting started on your tasks, this section provides an overview of:
* The Transformer architecture
* How Transformer models achieve state-of-the-art performance in a range of NLP tasks
* Which variants of the Transformer archicture are used for different tasks



---
### **1.11 - What makes Transformers so successful?**

In Lab 1, you learnt that RNNs are useful for NLP but they have two major shortcomings:

1. **RNNs don't parallelize well** due to their dependence on sequential processing.

2. **RNNs struggle to capture long-range dependencies** due to the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) and the limited memory of the hidden state.

Since their introduction in 2018, Transformer models have overcome these limitations of RNNs and achieved state-of-the-art performance in NLP tasks. The table below summarises the Transformer's key innovations.


|  RNN Problem             | Transformer Solution |
|-------------------------|----------------------|
| <img width=100/> Parallelization <img width=100/> | <img width=100/> Positional encoding <img width=100/> |
| <img width=100/> Long-range dependencies <img width=100/> | <img width=100/> Sef-attention mechanism <img width=100/> |


Attention mechanisms are so central to the success of Transformers that the original 2018 paper that introduced the architecture was called [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf).

While an in-depth understanding of the Transformer architecture is not necessary to complete the tasks in this lab, it is useful to have an understanding of the core concepts that make Tranformer models the state-of-the-art in ML and NLP:


**Positional encoding** is a technique used in the Transformer architecture to give the model an understanding of the position of the elements in the input sequence, while allowing them to be processed in parallel. It's implemented by adding a learned vector to the input embedding at each position of the input sequence, which encodes the position information of the element in the input sequence.

**Attention** is a mechanism that allows a model to weigh the importance of different elements of the input when processing it. Attention mechanisms calculate a set of attention weights that indicate how much each element of the input should be considered when computing the final output.

**Self-attention** is a specific type of attention mechanism used in the encoder of Transformer models to weigh the importance of different parts of the input sequence. For NLP, this allows words to be understood in the context of phrases, sentences, paragraphs, etc.

**Multi-head self-attention** is an extension of the single-head self-attention mechanism, where the model computes multiple sets of attention weights, each using a different set of learned parameters. It helps the model capture more complex relationships between the input elements. Multi-head self-attention is used in the encoder of Transformer models.

For further reading (and pictures) on these concepts, check out the excellent guide [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/).


---

### **1.12 - What does a Transformer look like?**

The original diagram of the Transformer architecture shows how it incorporates the ideas of positional encoding and attention.

![](https://drive.google.com/uc?export=view&id=1bD8TnqfqnpnMTDwnXTDdtJ2EKvC0rEQa)


The encoder and decoder blocks serve the following purposes:

- **Encoder**: The encoder block is responsible for encoding the input sequence into a hidden representation that captures the key information in the input. The encoder block is typically composed of a stack of identical layers, each layer consisting of a multi-head self-attention mechanism and a feed-forward neural network. The self-attention mechanism allows the encoder to weigh the importance of different parts of the input sequence when encoding it, and the feed-forward neural network helps the encoder to extract more complex features from the input.

- **Decoder**: The decoder block, on the other hand, is responsible for decoding the encoded input representation into an output sequence. Like the encoder, the decoder block is also typically composed of a stack of identical layers, but in addition to the multi-head self-attention mechanism and feed-forward neural network, it also includes a multi-head attention mechanism that allows the decoder to attend to the encoded input representation when generating the output sequence.

The input embedding is processed by each layer of the encoder block, the final out of which is then processed by each layer of the decoder block, as illustrated below.

![](https://drive.google.com/uc?export=view&id=1-oieNTlJhp7owIbEzwfLppeqyunYsHA2)


A Transformer can consist of either or both blocks, depending on the required task:


| Transformer Type | Use Case                                                                                                              |
|------------------|-----------------------------------------------------------------------------------------------------------------------|
| Encoder-only     | Good for tasks that require understanding of the input,  such as sentence classification and named entity recognition |
| Decoder-only     | Good for generative tasks such as text generation                                                                     |
| Encoder-Decoder  | Good for generative tasks that require an input,  such as translation or summarization                                |
                          |

---



# **1.2 - How can we use Transformers?**

*(Expected reading time: 10 minutes)*

###**1.21 - Training a Large Language Model (LLM)**
Training a LLM  requires a vast amount of compute resources, energy, training data, technical expertise, and can last days or even weeks.

For example, training of  [BigScience's BLOOM](https://huggingface.co/bigscience/bloom) model on a dataset of 46 natural languages and 13 programming languages required 384 80GB GPUs running for 3.5 months at a cost of up to $5M. 

Clearly, training a state-of-the-art model from scratch is beyond the scope of this course.


###**1.22 - Pre-trained models**

Fortunately, many pre-trained models are available openly online. These are sometimes run on cloud infrastructure and accessible through an API or the model weights can be downloaded from a repository so you can run them locally. [Huggingface](https://huggingface.co) provides such a model repository service.

Larger models have a file size in the range of GBs and billions of parameters. BLOOM, for example, is one of the largest models available in the HuggingFace library, and has a total size of around 330GB and 175 billion parameters.

On the other hand, smaller models, such as BERT-base, can have a file size in the range of hundreads of MBs and millions of parameters. For example, BERT-base-uncased is around 400MB and 110 million parameters.


###**1.23 - NLP Tasks**

Transformers excel at a wide variety of NLP tasks. Transformer models can be general-purpose or intended for a specific task. As discussed in the next section, many models are freely available online, with 100s or 1000s of different ones available for each task.


A general-purpose model can also be "fine-tuned" (re-trained with additional data) to accomplish a desired task, as discussed in the next section.

The below image summarises the NLP tasks that Transformers have been applied to, and the number of models available for them [here](https://huggingface.co/tasks).

![](https://drive.google.com/uc?export=view&id=18LLmal_CEIn6POlFL-tSBjFyxCg_0Vz0)





# **1.3 - Hugging Face** 🤗

Hugging Face is a company that provides an easy-to-use platform for training, evaluating, and using state-of-the-art NLP models. They have a wide variety of pre-trained models available to download and use, which can save you a lot of time and resources. These models have been trained on massive amounts of data and can understand and generate human-like text.

One of the most important things they provide is Hugging Face Transformers. It's a library that allows you to easily use, fine-tune and train transformer-based language models like BERT, GPT-2, RoBERTa, etc.
It's like having a team of NLP experts at your fingertips. With Hugging Face Transformers, you can easily load pre-trained models, tokenize text, and perform various NLP tasks like text classification, language translation, and text generation with just a few lines of code. It's user-friendly, well-documented and you can use it to experiment with different models and tasks in no time!

### **1.31 - Models**
These are the pre-trained NLP models that you can use for various tasks like text classification, language translation, and text generation. These models have been trained on massive amounts of data and can understand and generate human-like text. You can use these models as is, or fine-tune them for your specific use case.

### **1.32 - Tokenizers**
These are tools that allow you to break down text into smaller units, such as words or subwords, which is an important step when working with NLP models. Hugging Face provides a wide variety of tokenizers that you can use, including BERT tokenizer, GPT tokenizer, and many more.

### **1.33 - Pipelines**
Think of pipelines as pre-built, ready-to-use NLP workflows that you can use for various tasks such as text classification, question answering, and text generation. These pipelines are built on top of pre-trained models, tokenizers, and other components, and they allow you to perform complex NLP tasks with just a few lines of code.

For example, imagine you want to classify a piece of text as positive or negative, you can use the `sentiment-analysis` pipeline, just by calling the pipeline with the text, it will tokenize the text, then it will use the pre-trained model to classify the text, and finally, it will return the sentiment. It's like having a magic wand that can do wonders with your text data.


***Try combining these aspects of the 🤗 API by running the code below:***

In [None]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

# Download pre-trained model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Create text classification pipeline
classification_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Use pipeline to classify text
result = classification_pipeline("This is an example of a positive text.")

# Print result
print(result)

#**1.4 - Fine-tuning a pre-trained model**
Fine-tuning involves adapting a pre-trained model to a specific task by training it on a smaller dataset. The process of fine-tuning a pre-trained model is typically divided into three main steps:

**1. Loading the pre-trained model**: The first step is to load the pre-trained model from a checkpoint or from a library like HuggingFace's Transformers. This can be done using the `from_pretrained()` function from the transformers library. For example, to load the pre-trained BERT model from HuggingFace, you would use the following code:

```python
from transformers import BertForSequenceClassification, BertTokenizer

model = BertForSequenceClassification.from_pretrained("bert-base-cased")

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

```
**2. Preparing the dataset**: The next step is to prepare the dataset for fine-tuning. This typically involves loading the data, tokenizing it, and converting it into a format that can be fed into the model. For example, to fine-tune a BERT model for a text classification task, you would need to tokenize the text, convert it into input ids and attention masks using the tokenizer, and also convert the labels into one-hot encoded format.
```python
# Import required libraries
import torch
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split

# Prepare the dataset
# Here, we'll assume that we have a list of texts and labels called `texts` and `labels` respectively
# We'll use sklearn's train_test_split function to split the data into train and test sets
texts_train, texts_test, labels_train, labels_test = train_test_split(texts, labels, test_size=0.2)

# Tokenize the texts and convert them into input ids and attention masks
input_ids_train = [tokenizer.encode(text, add_special_tokens=True) for text in texts_train]
attention_masks_train = [[1] * len(input_id) for input_id in input_ids_train]

# Convert labels into one-hot encoded format
labels_train = torch.tensor([[label] for label in labels_train])

# Create DataLoader
train_data = TensorDataset(torch.tensor(input_ids_train), torch.tensor(attention_masks_train), labels_train)
train_dataloader = DataLoader(train_data, batch_size=32, shuffle=True)

```

**3. Fine-tuning the model**: The final step is to fine-tune the model by training it on the prepared dataset. This is done by setting the model in training mode, defining a loss function and an optimizer, and training the model for a certain number of epochs. For example, to fine-tune a BERT model for text classification, you would use the following code:

```python
from torch.nn import CrossEntropyLoss
from transformers import AdamW

# Set model in training mode
model.train()

# Define loss function and optimizer
loss_fn = CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training loop
for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
        # Unpack the batch
        b_input_ids, b_attention_masks, b_labels = batch
        
        # Move the data to the GPU
        b_input_ids = b_input_ids.to(device)
        b_attention_masks = b_attention_masks.to(device)
        b_labels = b_labels.to(device)
        
        # Clear gradients
        optimizer.zero_grad()
        
        # Forward pass
        logits = model(b_input_ids, attention_mask=b_attention_masks)[0]
        
        # Compute loss
        loss = loss_fn(logits.view(-1, num_labels), b_labels.view(-1))
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        # Print progress
        if step % 100 == 0:
            print(f'Epoch {epoch + 1} | Step {step} | Loss: {loss.item()}')
```
---

# **Section 2 - Fine-Tuning Example Task**
---

In this section we walk through a simple example on a small dataset. We will aim to create a model that works well on the 'poem_sentiment' dataset from the HuggingFace Dataset Hub (https://huggingface.co/datasets) - this may be a useful data source for your final assessment. 


## **2.1 - The Dataset**

The first step of any Machine Learning prediction task is to analyse and understand the dataset you're working with. It is very important to only use the training dataset for this - no 'peeking' at the test dataset! It is usually useful to know:
- Distribution of labels in our dataset
- What does a random batch from the dataset 'look' like. This is helpful when building intuition about our model training. In non-NLP tasks this can include plotting histograms of distributions and analysing the data for trends that could potentially be included as features

Inspecting the dataset is usually important as non-academic datasets will require extensive cleaning and pre-processing pipelines before they are able to be used in model training.   

# **2.2 - Training Transformer Models**

**In the previous section you will have:**
- Worked with transformer pipelines 
- Loaded datasets, tokenizers and models

**In this section you will:**
- Understand the difference between fine tuning and pre-training.
- Fine-tune a pre-trained transformer model from scratch as per the example code from the hugging face tutorial on a text classification setting.
- Compare the fine-tuned model results to those directly from a pipeline.

**If you have time:**
- Learn how to create a custom dataset loader, such that you can easily adapt any dataset for their project work

In [None]:
from datasets import load_dataset

#We first load the dataset:
raw_dataset = load_dataset("poem_sentiment")
print(raw_dataset)

# We can examine individual data points by indexing the dataset object
print(raw_dataset['train'][0])
print(raw_dataset['train'].features)

#Ensure the dataset is formatted for pytorch:
raw_dataset.set_format('torch')

## **2.21 -** *TO DO:*
Do some quick rudimentary dataset analysis, this should include looking at the label distribution in the training dataset. Use whatever python libraries you're comfortable with. What do the various labels mean?

We will now load the pipeline API from the HuggingFace `transformers` API. We evaluate the model's performance on the validation dataset.

In [None]:
import numpy as np
from transformers import pipeline
from sklearn.metrics import f1_score

classifier = pipeline("sentiment-analysis")
raw_val_dataset = raw_dataset['validation']['verse_text']
pipeline_output = classifier(raw_val_dataset)

#Parse the pipeline into a similar format to the dataset:
def parse_labels(l):
  if l == 'NEGATIVE':
    return 0
  elif l == 'POSITIVE':
    return 1
  else:
    raise NotImplementedError()

#Note the pipeline already constrains us from defining labels such as 'mixed' and 'other found in our dataset:
pipeline_output_labels = np.array(list(map(lambda x: parse_labels(x['label']), pipeline_output)))
pipeline_output_labels

##**2.22 -** *TO DO:* 
Look at the pipeline model outputs and dataset labels what are some issues with using the out-of-the-box pipeline models?

How might we evaluate the performance of the pipeline model on the dataset given these issues?

In [None]:
#TODO: Evaluate the performance of the pipeline model on the validation dataset 

## **2.3 - Fine-Tuning our own Model**

In order to specifically train a model to the requirements of our dataset we will now look at how we might fine-tune a LLM on our training dataset. 

Please complete the sections marked *TODO* in the code below:  

In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding

#First create the AutoTokenizer object:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

#Apply the tokenizer to the dataset:
tokenized_dataset = raw_dataset.map(lambda data_point: tokenizer(data_point['verse_text']))

#TODO: Inspect a sample of the dataset after it has been Tokenized:

#We need to remove the non-tokenized components of our dataset:
tokenized_dataset = tokenized_dataset.remove_columns(['id', 'verse_text'])
tokenized_dataset = tokenized_dataset.rename_column('label', 'labels')

In [None]:
import torch
import pprint 
from torch.utils.data import DataLoader

#Finally we define a Data Collator:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#It is good practice to use the pytorch Dataloader API when possible
train_dataloader = DataLoader(
    tokenized_dataset['train'], shuffle=True, batch_size=8, collate_fn=data_collator
)

val_dataloader = DataLoader(
    tokenized_dataset['validation'], shuffle=True, batch_size=8, collate_fn=data_collator
)

#Lets look at a sample from the dataloader:
sample = next(iter(train_dataloader))

#Is the dimensionality of our data correct (hint ... yes but always worth checking)
pprint.pprint({k: v.shape for k, v in sample.items()}) 

In [None]:
from transformers import AutoModelForSequenceClassification

#Using the HuggingFace Documentation find the correct arguments for loading the model:
model = AutoModelForSequenceClassification.from_pretrained(###TODO###)

In [None]:
##TODO: Using our sample variable check the loaded model works. Without creating a gradient tree check the outputs of the pre-trained model 

In [None]:
import matplotlib.pyplot as plt
from torch.optim import Adam
from tqdm.auto import tqdm

#Number of hyperparameters 
num_epochs=5
progress_bar = tqdm(range(num_epochs*len(train_dataloader)))
optimizer = Adam(model.parameters(), lr=5e-5)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

#Setup recording:
epoch_losses = []

#Setup our training loop:
for epoch in range(num_epochs):

  losses = []
  for batch in train_dataloader:

    #TODO: Write a standard pytorch training loop to update the model parameters via a backward pass

    progress_bar.update(1)

    #record our loss per epoch
    losses.append(loss.cpu().detach())
    #break; #remove break to fully train model

  mean_losses = np.mean(losses)
  print(f'epoch {epoch} complete, loss: {mean_losses}')
  epoch_losses.append(mean_losses)

  fig, axs = plt.subplots()
  axs.plot(epoch_losses)

In [None]:
#Evaluate our model performance on the validation set:
validation_loss = []
predictions = []
labels = []

with torch.no_grad():
  for batch in val_dataloader:

    outputs = model(**batch)
    logits = outputs.logits 

    predictions.append(torch.argmax(logits, dim=-1))
    labels.append(batch['labels'])
    validation_loss.append(outputs.loss)

val_loss = torch.stack(validation_loss).mean()
print(val_loss)

In [None]:
##TODO: Compare suitable evaluation metrics of the two models we've worked with. Discuss the advantages and disadvantages of both methods

---
# **Section 3 - Select Dataset & Model, Fine-Tune, Evaluate**
---

For this task we will give you three separate datasets to investigate. Firstly look at each of these datasets and understand what they consist of and what NLP tasks you could fine-tune using these datasets.

Then, we ask you to do the following:


**1. Look at the three datasets below on Hugging Face and investigate them thouroughly. Understand these following aspects before you move on:**
* What features do the datasets contain?
* Are the datasets already tokenised or do they contain text?
* What are suitable tasks to train these datasets on? (e.g. Token classification, sentiment analysis, sequence classification, masked language modelling)

**2. Choose one of these datasets to finetune a model.**
> Understand what the task it is you are going to fine tune on given the dataset.

**3. Choose a model to finetune on this dataset.**

> Use the Hugging Face documentation to choose a correct model ([Hugging Face models](https://huggingface.co/models))

**4.   Pre-process the dataset to train the model.**
> Understanding exactly what task you are going to be fine-tuning the model for will help a lot here. 
Think about what is it the model needs as an input and see how you need to change the given features to these inputs.
 Use Hugging Face tokenisers, data collators and general documentation to figure this out.

**5. Train the model on this dataset.**
> Use a manual training loop here, understand the mechanics behind training and implement it yourself (you can find this in the Hugging Face documentation and in the example from section 1 of this notebook).

**6. Evaluate the new model's performance - compare with the performance of the model before fine-tuning.**
> Look at what metric you would use to measure the performance of the model, this can be tricky for some language modelling tasks with non-deterministic labels.


The datasets:

1. [tweet_eval](https://huggingface.co/datasets/tweet_eval)
2. [wikitext](https://huggingface.co/datasets/wikitext)
3. [wikiann](https://huggingface.co/datasets/wikiann)

Models available at: https://huggingface.co/models