# Large Language Model (LLM) Finetuning

This notebook contains code to evaluate an LLM on a subset of the [SQuAD dataset](https://huggingface.co/datasets/rajpurkar/squad) (Stanford Question Answering Dataset), fine-tune it on it and reevaluate to check model's performance. Along the way, we'll stop and explain several of the concepts involved across similar tasks.

## Problem Description

The problem at hand is a subfield of Natural Language Processing (NLP) called Question Answering (QA). The goal is to, asking an LLM a question given some context, receive an appropriate answer included in the beforementioned context. 

In [1]:
from transformers import DistilBertForQuestionAnswering, DistilBertTokenizerFast, Trainer, TrainingArguments, pipeline
from datasets import load_dataset
import evaluate

  from .autonotebook import tqdm as notebook_tqdm


## The Model

Nowadays, LLMs are versatile enough to address this and many other problems through prompt engineering. Within this framework, engineers tweak their prompts in order to get the best possible results to their problems. However, generalist LLMs can become unfeasible to use depending on computational, budget and response time constraints. This is why, depending on the problem at hand, a more direct approach might be better fitting.  
One personally recommended course of action is to first check the available models that aims to solve the problem of interest. The HuggingFace (HF) Hub is a well known initiative where to check many resources, including models and datasets. This way, it's easy to check the best models for a particular task.  
Besides model comparison regarding purely evaluation metrics, other very important aspect of LLM deployment is its size. Many of them are traditionally large enough to prove themselves challenging to host. A useful tool might be the [Can you run it? LLM version](https://huggingface.co/spaces/Vokturz/can-it-run-llm) from HF. It allows the user to select a model, hardware, and the web will display if it's feasible or not to run it on 1 or more GPUs depending on quantization, training adequacy, etc.  
For this particular project, which is meant to showcase how to perform fine-tuning and evaluations in a normal setup rather than finding the best possible solution, we'll start from [DistilBERT base model](https://huggingface.co/distilbert/distilbert-base-uncased) rather than already fine-tuned ones in the desired Dataset.

In [2]:
model_name = "distilbert-base-uncased"
model = DistilBertForQuestionAnswering.from_pretrained(model_name)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## The Dataset

The Dataset of interest is the Stanford Question Answering Dataset ([SQuAD](https://huggingface.co/datasets/rajpurkar/squad)). It comprehends a set of segments of text from Wikipedia (context) alongside questions and answers that can be found in the given context.

In [3]:
squad = load_dataset("squad")

### Dataset Preprocessing

Usually, Deep Learning (FL) applications require some preprocessing to their inputs. In NLP, this may involve some text cleaning, tokenization (technique that depends on the model of choice), truncation handling, etc.

In [4]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [27]:
tokenized_squad = squad.map(preprocess_function, batched=True, 
                            remove_columns=squad["train"].column_names)

## Fine-tuning

Training an LLM from scratch can be very slow and costly. Instead, one common practice in DL is to start from an already pre-trained model and start training from there (what we call fine-tuning). The code below shows the training configuration through HF: 

In [25]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
    report_to="tensorboard", 
    save_steps=500,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)



In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
)

Right after it, it's possible to evaluate the initial model against the SQuAD's *validation* subset:

In [11]:
trainer.evaluate()

100%|██████████| 661/661 [00:28<00:00, 22.93it/s]


{'eval_loss': 5.8978590965271,
 'eval_model_preparation_time': 0.001,
 'eval_runtime': 29.405,
 'eval_samples_per_second': 359.462,
 'eval_steps_per_second': 22.479}

After setting a baseline, let's start the QA model's fine-tuning.  
**Note:** This step can take a significant amount of time depending on hardware specifications:

In [None]:
trainer.train()

This training will create some output logs that can be read with TensorBoard. You can start TensorBoard with the following command:

```bash
tensorboard --logdir logs/
```

TensorBoard provides useful information in a visual way. For example, monitoring the train and eval losses while training can give information about the current's training state. For instance, if both losses are high and do not decrease over time, the model may be underfitting. On the other hand, if the train loss decreases but the eval one starts to increase, it might be overfitting. Depending on the scenario, the engineer might choose to look for more powerful architectures, more broad and representative data, or start with hyperparameter (HP) tuning. The most common one to tweak is the Learning Rate (LR): one too big might yield to quick improvements at the risk of reaching a loss plateau. On the other hand, a smaller one might make the training too slow. It's recommended to play with HP for the optimizers (e.g. Adam) or different strategies to get the best possible results

If a model has already been fine-tuned, specify the checkpoint of your choice below to load it:

In [23]:
best_checkpoint = "./results/checkpoint-10000"

finetuned_model = DistilBertForQuestionAnswering.from_pretrained(best_checkpoint)
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

## Evaluation comparison

After having available both models, the pre-trained and the fine-tuned ones, let's make a more thorough evaluation comparison:

In [28]:
# Baseline model evaluation
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
)

trainer.evaluate()

100%|██████████| 661/661 [00:29<00:00, 22.71it/s]


{'eval_loss': 5.956923484802246,
 'eval_model_preparation_time': 0.001,
 'eval_runtime': 29.3675,
 'eval_samples_per_second': 359.922,
 'eval_steps_per_second': 22.508}

In [29]:
# Fine-tuned model evaluation
trainer = Trainer(
    model=finetuned_model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
)

trainer.evaluate()

100%|██████████| 661/661 [00:29<00:00, 22.56it/s]


{'eval_loss': 1.1106374263763428,
 'eval_model_preparation_time': 0.001,
 'eval_runtime': 29.3136,
 'eval_samples_per_second': 360.584,
 'eval_steps_per_second': 22.549}

The evaluation loss difference (lower is better) already shows that the fine-tuned model is better than the baseline.  
Furthermore, it's possible to make a more thorough analysis by generating the model's responses and performing an Error Analysis (EA). For that purpose, let's use the HF's *question answering pipeline*:

In [58]:
qa_pipeline = pipeline("question-answering", 
                       model="distilbert-base-uncased", 
                       tokenizer=tokenizer, 
                       batch_size=64)

finetuned_qa_pipeline = pipeline("question-answering", 
                                 model=finetuned_model, 
                                 tokenizer=tokenizer,
                                 batch_size=64)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Then, let's reuse the SQuAD's validation dataset to store the predicted answers and scores for both models:

In [69]:
def get_prediction(example):
    baseline_prediction = qa_pipeline(question=example["question"], context=example["context"])
    finetuned_prediction = finetuned_qa_pipeline(question=example["question"], context=example["context"])
    return {
        "baseline_prediction": baseline_prediction["answer"],
        "baseline_score": baseline_prediction["score"],
        "finetuned_prediction": finetuned_prediction["answer"],
        "finetuned_score": finetuned_prediction["score"],
        "ground_truth": example["answers"]["text"][0]
    }

In [72]:
predictions = squad["validation"].map(get_prediction, batch_size=64)

Map: 100%|██████████| 10570/10570 [10:34<00:00, 16.66 examples/s]


With this data at hand, let's use the *HF's evaluate* library to check more evaluation metrics:

In [62]:
metric = evaluate.load("squad")


The SQuAD metric is expecting a list of dictionaries containing:
* **id:** ID of the sample.
* **answers:** Predicted answer or list of ground truth answers.

In [73]:
baseline_predicted_answers = [{"id": pred["id"],
                               "prediction_text": pred["baseline_prediction"]}
                               for pred in predictions]
finetuned_predicted_answers = [{"id": pred["id"],
                                "prediction_text": pred["finetuned_prediction"]}
                                for pred in predictions]
theoretical_answers = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in predictions
]

In [74]:
# Baseline model's evaluation metrics
metric.compute(predictions=baseline_predicted_answers, references=theoretical_answers)

{'exact_match': 0.8609271523178808, 'f1': 7.889010128219963}

In [75]:
# Fine-tuned model's evaluation metrics
metric.compute(predictions=finetuned_predicted_answers, references=theoretical_answers)

{'exact_match': 76.59413434247871, 'f1': 84.8074232750765}

Here, it's even more clear that the fine-tuned model is way superior than the baseline. The exact matches indicates the cases where the predictions matched precisely the ground truth answer.  
On the other hand, the F1-score is the harmonic mean of the Precision and the Recall, which can be calculated with:
* True Positive: Number of shared tokens between the prediction and the correct answer.
* False Positive: Number of tokens in the predicted sequence, excluding the shared tokens.
* False Negative: Number of tokens in the correct answer, excluding the shared tokens.

Finally, by converting the *HF dataset* into pandas, we gain more control about the data and how to analyze it:

In [None]:
df = predictions.to_pandas()

In [77]:
df.describe()

Unnamed: 0,baseline_score,finetuned_score
count,10570.0,10570.0
mean,0.000209,0.572211
std,0.000272,0.293062
min,2.3e-05,0.006408
25%,9e-05,0.325123
50%,0.000146,0.578012
75%,0.000226,0.849143
max,0.003764,0.999911


In [80]:
df.head(10).sort_values("finetuned_score", ascending=False)[["question", "context", "baseline_prediction", "finetuned_prediction", "ground_truth", "baseline_score", "finetuned_score"]]

Unnamed: 0,question,context,baseline_prediction,finetuned_prediction,ground_truth,baseline_score,finetuned_score
6,What day was the game played on?,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,"February 7, 2016","February 7, 2016",0.000123,0.862131
1,Which NFL team represented the NFC at Super Bo...,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,Carolina Panthers,Carolina Panthers,0.000121,0.766022
9,What does AFC stand for?,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,American Football Conference,American Football Conference,0.000122,0.601211
2,Where did Super Bowl 50 take place?,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,Levi's Stadium,"Santa Clara, California",0.000123,0.500625
0,Which NFL team represented the AFC at Super Bo...,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,Denver Broncos,Denver Broncos,0.000121,0.355355
3,Which NFL team won Super Bowl 50?,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,Denver Broncos,Denver Broncos,0.000123,0.240282
4,What color was used to emphasize the 50th anni...,Super Bowl 50 was an American football game to...,(NFL) for the 2015 season. The American Footba...,gold,gold,0.000121,0.227247
7,What is the AFC short for?,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,American Football Conference,American Football Conference,0.000123,0.186964
5,What was the theme of Super Bowl 50?,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,"golden anniversary""","""golden anniversary""",0.000123,0.043139
8,What was the theme of Super Bowl 50?,Super Bowl 50 was an American football game to...,earn their third Super Bowl title.,"golden anniversary""","""golden anniversary""",0.000123,0.043139


# Summary

The notebook covered this points:
* General Question Answering considerations and problem definition.
* Model selection: DistilBert as a relatively lightweight model for fine-tuning convenience.
* Model fine-tuning: Use of HF Trainer to fine-tune the model, plus HP tuning, TensorBoard logging and monitoring, etc.
* Evaluation comparison: Use of HF Evaluate to compare results for both baseline and fine-tuned models, showing that the fine-tuned one clearly outperforms the initial one.

# References

* https://huggingface.co/docs/transformers/tasks/question_answering
* https://huggingface.co/learn/nlp-course/chapter7/7