Large language models (LLMs) can perform all kinds of tasks ranging from translation over summarization to text generation or even multi-modal tasks involving sound, images, or videos.
Usually, there are LLMs readily available for any kind of task, which can be easily found on [HuggingFace](https://huggingface.co/models).
However, if there is no available model doing just what you want, then fine-tuning is the way to go.
During fine-tuning, a pre-trained base or foundation model is further trained on a comparably small, task-specific dataset.
Fine-tuning is much faster and cheaper than pre-training a new model from scratch.

In my case, I was looking for a model to answer questions about long documents in natural language.
Most models I could find were limited to short context lengths, i.e., could not handle entire documents as input, or were not trained to output generated natural answers (see my post [here](https://stefanbschneider.github.io/blog/posts/generative-qa/)). 

Hence, in this blog post, I fine-tune a pre-trained Longformer Encoder-Decoder (LED) base model for generative question answering.



# Longformer Encoder-Decoder (LED) Base Model

As base model, I use the Longformer Encoder-Decoder (LED) from AllenAI: [allenai/led-base-16384](https://huggingface.co/allenai/led-base-16384)
This base model supports very long contexts as input but, as I understand, is not yet trained for any specific downstream tasks.
There is a fine-tuned LED model for [text summarization](https://huggingface.co/allenai/led-large-16384-arxiv), but I did not see any for question answering.

Taking the base model directly for answering questions does not work at all:

In [4]:
%%capture --no-display
%pip install -U datasets evaluate transformers accelerate rouge_score wandb

In [10]:
from transformers import pipeline

qa_pipeline = pipeline(task="text2text-generation", model="allenai/led-base-16384")

# Abstract from "Attention is all you need" by Vaswani et al.: https://arxiv.org/abs/1706.03762
abstract = """The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task...
"""
question = "What's a transformer'?"
input_text = f"question: {question} context: {abstract}"

qa_pipeline(input_text, max_length=100)[0]['generated_text']

Device set to use mps:0


"question: What's a transformer? In this context: The dominant sequence transduction models are based on complex recurrent or non-convolutional neural networks that include an encoder and a decoder. The best-performing models also connect the encoder and decoder through an attention-mechanism. We propose a new simple network architecture, the Transformer, that is based on a network architecturebased solely on attention mechanisms, dispensing with recurrence and convolutions, and integrating them"

The model's "answer" is basically just a repetition of the provided context, including the question.

Let's see if fine-tuning can improve the answers.
But first, we need a suitable dataset.

# Task-Specific Dataset (Long-Form Question Answering)

## Finding a Suitable Dataset

My task of interest is answering questions in natural language given a (potentially long) context.
Since I do not have the means of collecting and creating my own dataset, I was looking for a suitable dataset online.

The well-known [SQuAD dataset](https://huggingface.co/datasets/rajpurkar/squad_v2) is only suitable for extractive question answering, where the answer is a span text from the provided context.
The [DuoRC dataset](https://huggingface.co/datasets/ibm-research/duorc) with questions and answers about a given movie plot can be used for both extractive and generative/abstracitve Q&A.
However, I found the answers to be overly short, often just a few words, and not always very natural.

Finally, I found a [suitable dataset for long-form question answering (LFQA)](https://huggingface.co/datasets/LLukas22/lfqa_preprocessed) with natural, multi-sentence answers to questions based on provided contexts (details on this dataset).
The dataset is a successor of [facebook's ELI5 dataset](https://facebookresearch.github.io/ELI5/index.html) (explain like I'm five), which is [no longer available](https://huggingface.co/datasets/defunct-datasets/eli5).
Details are in [this blog post by the dataset's authors](https://towardsdatascience.com/long-form-qa-beyond-eli5-an-updated-dataset-and-approach-319cb841aabb/).



## Analyzing and Adjusting the Dataset

For my task of interest, answering questions about long documents, long input contexts are 

Looking at the long-tail distribution of answer lengths in my selected dataset, I found that a few answers were overly long (up to ~6000 tokens).

![Distribution of answer lengths in the original dataset (in number of tokens).](images/answer-lengths-tokens-original.png)

Since I did not want my fine-tuned model to create overly long answers, I filtered these examples out of the dataset and made my own version of the dataset with answers only up 512 tokens.
This means the maximum answer length is roughly 12x shorter at the cost of 10% less training data.

My filtered dataset is available here: [stefanbschneider/lfqa-max-answer-length-512]

An example in the dataset looks like this:

```json
{
    "question": "what's the difference between a forest and a wood?",
    "answer": "They're used interchangeably a lot. You'll get different answers from different resources, but the ...",
    "context": [
        "Wood is divided, according to its botanical origin, into two kinds: softwoods, ...",
        "Processing and products differs especially with regard to the distinction between softwood and hardwood ..."
    ]
}
```



# Fine-Tuning

Now that the dataset is ready, the data has to be prepared and the model has to be loaded and configured for fine-tuning.

First, let's install all dependencies.

In [None]:
%%capture --no-display
%pip install -U pypdf torch transformers

In [2]:
from transformers import pipeline

extractive_qa = pipeline(task="question-answering")

# Abstract from "Attention is all you need" by Vaswani et al.: https://arxiv.org/abs/1706.03762
abstract = """The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task...
"""
question = "What's a transformer'?"

extractive_qa(question=question, context=abstract)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'score': 0.4559027850627899,
 'start': 287,
 'end': 302,
 'answer': 'the Transformer'}

# What's Next?

Try it yourself:

- [Rent a GPU at Vast.ai (referral link)](https://cloud.vast.ai/?ref_id=202191)


Additional Resources:

- [HuggingFace Fine-Tuning Tutorial](https://huggingface.co/docs/transformers/en/training)
- [Keras Transfer Learning and Fine-Tuning Guide](https://keras.io/guides/transfer_learning/)