Large language models (LLMs) can perform all kinds of tasks ranging from translation over summarization to text generation or even multi-modal tasks involving sound, images, or videos.
Usually, there are LLMs readily available for any kind of task, which can be easily found on [HuggingFace](https://huggingface.co/models).
However, if there is no available model doing just what you want, then fine-tuning is the way to go.
During fine-tuning, a pre-trained base or foundation model is further trained on a comparably small, task-specific dataset.
Fine-tuning is much faster and cheaper than pre-training a new model from scratch.

In my case, I was looking for a model to answer questions about long documents in natural language.
Most models I could find were limited to short context lengths, i.e., could not handle entire documents as input, or were not trained to output generated natural answers (see my post [here](https://stefanbschneider.github.io/blog/posts/generative-qa/)). 

Hence, in this blog post, I fine-tune a pre-trained Longformer Encoder-Decoder (LED) base model for generative question answering.



# Longformer Encoder-Decoder (LED) Base Model

As base model, I use the Longformer Encoder-Decoder (LED) from AllenAI: [allenai/led-base-16384](https://huggingface.co/allenai/led-base-16384)
This base model supports very long contexts as input but, as I understand, is not yet trained for any specific downstream tasks.
Note, there is a larger version of the LED base model with even more trainable weights/parameters: [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384)
Fine-tuning this larger model could lead to better results but also needs more resources and time to train.

There is a fine-tuned LED model for [text summarization](https://huggingface.co/allenai/led-large-16384-arxiv), but I did not see any for question answering.
Taking the (smaller) base model directly for answering questions does not work at all:

In [4]:
%%capture --no-display
%pip install -U datasets evaluate transformers accelerate rouge_score wandb

In [10]:
from transformers import pipeline

qa_pipeline = pipeline(task="text2text-generation", model="allenai/led-base-16384")

# Abstract from "Attention is all you need" by Vaswani et al.: https://arxiv.org/abs/1706.03762
abstract = """The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task...
"""
question = "What's a transformer'?"
input_text = f"question: {question} context: {abstract}"

qa_pipeline(input_text, max_length=100)[0]['generated_text']

Device set to use mps:0


"question: What's a transformer? In this context: The dominant sequence transduction models are based on complex recurrent or non-convolutional neural networks that include an encoder and a decoder. The best-performing models also connect the encoder and decoder through an attention-mechanism. We propose a new simple network architecture, the Transformer, that is based on a network architecturebased solely on attention mechanisms, dispensing with recurrence and convolutions, and integrating them"

The model's "answer" is basically just a repetition of the provided context, including the question.

Let's see if fine-tuning can improve the answers.
But first, we need a suitable dataset.

# Task-Specific Dataset (Long-Form Question Answering)

## Finding a Suitable Dataset

My task of interest is answering questions in natural language given a (potentially long) context.
Since I do not have the means of collecting and creating my own dataset, I was looking for a suitable dataset online.

The well-known [SQuAD dataset](https://huggingface.co/datasets/rajpurkar/squad_v2) is only suitable for extractive question answering, where the answer is a span text from the provided context.
The [DuoRC dataset](https://huggingface.co/datasets/ibm-research/duorc) with questions and answers about a given movie plot can be used for both extractive and generative/abstracitve Q&A.
However, I found the answers to be overly short, often just a few words, and not always very natural.

Finally, I found a [suitable dataset for long-form question answering (LFQA)](https://huggingface.co/datasets/LLukas22/lfqa_preprocessed) with natural, multi-sentence answers to questions based on provided contexts (details on this dataset).
The dataset is a successor of [facebook's ELI5 dataset](https://facebookresearch.github.io/ELI5/index.html) (explain like I'm five), which is [no longer available](https://huggingface.co/datasets/defunct-datasets/eli5).
Details are in [this blog post by the dataset's authors](https://towardsdatascience.com/long-form-qa-beyond-eli5-an-updated-dataset-and-approach-319cb841aabb/).



## Analyzing and Adjusting the Dataset

To get familiar with the dataset and understand what kind of inputs and outputs the fine-tuned model has to handle,
I visualized the length of contexts (model input, together with the questions) as well as the length of the expected answers (model output) in the dataset.
Since I am interested in the length in terms of number of tokens (relevant for fine-tuning later), I first tokenized the contexts and answers with the LED base model's tokenizer.

The context lengths are quite normally distributed with a few quite long contexts, but most rather short:

![Distribution of context lengths (in number of tokens).](images/context-lengths-tokens-original.png)

Since I am interested in long contexts, it is fine to have a few contexts that are much longer than the others.

The answer lengths have an even stronger long-tail distribution with some few answers that were overly long (up to ~6000 tokens, even longer than the context!).

![Distribution of answer lengths in the original dataset (in number of tokens).](images/answer-lengths-tokens-original.png)

Since I did not want my fine-tuned model to create overly long answers, I filtered these examples out of the dataset and made my own version of the dataset with answers only up 512 tokens.
This means the maximum answer length is roughly 12x shorter at the cost of 10% less training data.

My filtered dataset is available here: [stefanbschneider/lfqa-max-answer-length-512](https://huggingface.co/datasets/stefanbschneider/lfqa-max-answer-length-512)
The notebook I used for creating the filtered dataset as well as the plots is also in the repository: [`process-lfqa-dataset.ipynb`](https://huggingface.co/datasets/stefanbschneider/lfqa-max-answer-length-512/blob/main/process-lfqa-dataset.ipynb)

An example in the dataset looks like this:

```json
{
    "question": "what's the difference between a forest and a wood?",
    "answer": "They're used interchangeably a lot. You'll get different answers from different resources, but the ...",
    "context": [
        "Wood is divided, according to its botanical origin, into two kinds: softwoods, ...",
        "Processing and products differs especially with regard to the distinction between softwood and hardwood ..."
    ]
}
```



# Fine-Tuning

Now that the dataset is ready, the data has to be prepared and the model has to be loaded and configured for fine-tuning.

Let's start by importing the necessary libraries and loading the LED base model and tokenizer.

In [None]:
from typing import Optional
from datasets import load_dataset
from transformers import (
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    GenerationConfig,
)

# load model and tokenizer
model_name = "allenai/led-base-16384"
# Load model and enable gradient checkpointing to reduce memory during training (at the cost of speed)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.gradient_checkpointing_enable()
tokenizer = AutoTokenizer.from_pretrained(model_name)

Next, I create two functions for processing the data for training and validation.
These functions prepare the data in batches of size 2 (larger batches do not fit onto my GPU).
I found that batch size 1 did not work at all; the loss quickly dropped to zero and the model stopped learning.

Since each question is paired with a list of multiple contexts, these contexts are concatenated to the corresponding question into one single string, which is given as input to the model.
Note that the expected output length is also set to 512 tokens here.

In [2]:
BATCH_SIZE: int = 2

def process_data_to_model_inputs(batch):
    # combine context strings and questions to one input
    input = [
        f"question: {question}, context: {' '.join(context)}"
        for question, context in zip(batch["question"], batch["context"])
    ]

    # tokenize the inputs and labels
    inputs = tokenizer(
        input,
        padding="max_length",
        truncation=True,
        # Max supported article/context length + question.
        max_length=8192,
    )
    outputs = tokenizer(
        batch["answer"],
        padding="max_length",
        truncation=True,
        # Since I limit the answers to 512 tokens in the dataset, I can also limit the max_length here
        max_length=512,
    )

    # The following settings are copied from the fine-tuning notebook provided by AllenAI:
    # https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing
    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch


def load_and_process_dataset(split: str, dataset_limit: Optional[int] = None):
    """Load and process the dataset for training or validation. Optionally limit the number of samples."""
    dataset = load_dataset("stefanbschneider/lfqa-max-answer-length-512", split=split)

    # optionally reduce the data sets to a small fraction
    if dataset_limit is not None:
        dataset = dataset.select(range(dataset_limit))

    # Process the dataset with the function above. Afterwards, remove the original columns.
    dataset = dataset.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=BATCH_SIZE,
        remove_columns=["context", "question", "answer"],
    )

    # Format the dataset to torch
    dataset.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    return dataset

For development and experimentation, it is useful to only load a small fraction of the dataset, using the `dataset_limit` argument I introduced above:

In [3]:
# Load and process datasets; limit to small size for experimentation
train_data = load_and_process_dataset("train", dataset_limit=128)
val_data = load_and_process_dataset("validation", dataset_limit=64)

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

# What's Next?

Try it yourself:

- [Rent a GPU at Vast.ai (referral link)](https://cloud.vast.ai/?ref_id=202191)


Additional Resources:

- [HuggingFace Fine-Tuning Tutorial](https://huggingface.co/docs/transformers/en/training)
- [Keras Transfer Learning and Fine-Tuning Guide](https://keras.io/guides/transfer_learning/)