<div class="alert alert-block alert-info">
<b>Deadline:</b> March 12, 2025 (Wednesday) 23:00
</div>

# Exercise 1. Parameter-efficient fine-tuning of large language models

In this assignment, we will learn how to train a large language model (LLM) to memorize new facts. We will add a [LoRA adapter](https://arxiv.org/abs/2106.09685) to the `Llama-3.2-1B-Instruct` model and fine-tuned it on our custom data.

In [None]:
# Set the location of the HF cache on JupyterHub
if __import__("socket").gethostname().startswith("jupyter"):
    import os
    os.environ["HF_HOME"] = "/coursedata/huggingface/"

In [None]:
skip_training = False  # Set this flag to True before validation and submission

In [None]:
# During evaluation, this cell sets skip_training to True

import tools, warnings
warnings.showwarning = tools.customwarn

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from peft import LoraConfig, TaskType, get_peft_model
from peft.peft_model import PeftModel
from functools import partial

from tools import print_message

# Task

First we load the `Llama-3.2-1B-Instruct` model by Meta from the Hugging Face (HF) repository.

Select the device for training (use GPU if you have one). Please, change the `torch_dtype` from `torch.bfloat16` to `torch.float32` if you have at least 8GB of CPU memory in your machine. This helps to get responses from the Llama model much faster.

In [None]:
device = torch.device('cpu')
torch_dtype = torch.bfloat16

In [None]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"
print(f"torch_dtype: {torch_dtype}")
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch_dtype
)
print(base_model)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Let's try to ask the model something. First we create a dialogue (that consists of one message from the user).
Then we convert the dialogue into a prompt using the template required by Llama 3.2.

In [None]:
from llm_utils import apply_chat_template_llama3

messages = [{"role": "user", "content": "Who are you?"}]
prompt = apply_chat_template_llama3(messages, add_bot=False)
print(prompt)

Note the format of the prompt that we produced. You can find more details on Llama's prompt format [on this page](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/).

Now let's get a response from the model.

In [None]:
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
prompt_length = inputs["input_ids"].size(1)
with torch.no_grad():
    tokens = base_model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.01,
        pad_token_id=tokenizer.eos_token_id,
        #streamer=TextStreamer(tokenizer=tokenizer, skip_prompt=True),
    )
# Extract the new tokens generated (excluding the prompt)
output_tokens = tokens[:, prompt_length:]

# Decode the output tokens to a string
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print_message(output_text)

Let us evaluate the model on some trivial, common, questions

In [None]:
from grading import Evaluator, get_answer

qa_trivial_json = "grading_trivia.json"

get_answer_fn = partial(get_answer, model=base_model, tokenizer=tokenizer)

evaluator = Evaluator(qa_trivial_json)
trivia_accuracy = evaluator.evaluate_all(get_answer_fn, verbose=True)

You can test the base peft_model on some trivia questions:

# Custom document

We want our model to memorize facts from a tiny document `document.txt` that we artificially generated. Let's print the document.

In [None]:
print(__import__('pathlib').Path("document.txt").read_text())

We want our model to be able to answer questions related to the document **without seeing the document in the prompt**. Let's test what the base model responds.

In [None]:
question = "How many seats are there in the Midnight Sun room in the Frostbite Futures HQ?"

In [None]:
_ = get_answer(question, answer=["4 seats", "4"], model=base_model, tokenizer=tokenizer)

Your task in this assignment is to generate training data and train the model. We advice you to inspect function `get_answer` to see how the question is converted into a prompt. You should use the same conversion in your dataset.

# Model training

**IMPORTANT:**
The assignment does not require a training loop to be provided. However, if you choose to include one for autograding purposes, please implement it in the designated cell.

In this exercise, we integrate a [LoRA adapter](https://arxiv.org/abs/2106.09685) into the base Llama model using the `peft` library.

**IMPORTANT:**
For the `transformers` and `peft` packages, we *strongly recommend* using the versions specified in the [requirements.yml](https://mycourses.aalto.fi/mod/resource/view.php?id=1241109) file (i.e., `peft=0.13.2`, `transformers=4.47.0`).

**IMPORTANT:**
The `peft` library offers multiple methods to attach an adapter to the base model. To ensure compatibility and avoid potential issues when loading the trained adapter, please create your peft model using function `get_peft_model`, as explained on [this page](https://huggingface.co/docs/peft/en/quicktour). Using alternative methods may lead to errors or inconsistencies during the loading process.

## Training loop

* A model created by `get_peft_model` is a regular pytorch model which you can train just like any other model.
* Note that the output of the `forward` function is not a tensor but a more complex structure.
* You can use any code for training, for example, you can use HF's `Trainer` objects. However, we stronlgy encourage you to implement the training loop by yourselves.
* Please save the model to folder `1_adapter` using this code:
```
peft_model.save_pretrained("1_adapter")
```

Implement the train and test dataset splits below:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Implement the test and train dataloaders below:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Implement your model in the cell below:

In [None]:
if not skip_training:
    # YOUR CODE HERE
    raise NotImplementedError()

The training loop is defined as follows:

In [None]:
if not skip_training:
    # YOUR CODE HERE
    raise NotImplementedError()

Save the model:

In [None]:
if not skip_training:
    # YOUR CODE HERE
    raise NotImplementedError()

# Test the trained model

**IMPORTANT:** Once you have trained your model, ensure that the remaining cells in this notebook execute correctly. Failure to do so may result in a loss of points, as successful execution is part of the evaluation criteria.

First, we load the trained model. Note that the base model should be loaded already.

In [None]:
print("\nLoading the adapter")
base_model.to(device)
peft_model = PeftModel.from_pretrained(base_model, "1_adapter")
peft_model.to(device)

## Test common knowledge

We evaluate how well the model with the adapter recalls trivia facts.

**Note:** Successfully passing this test is mandatory to earn points for this assignment.

In [None]:
get_answer_peft_fn = partial(get_answer, model=peft_model, tokenizer=tokenizer)

# %%
evaluator = Evaluator(qa_trivial_json)
trivia_accuracy = evaluator.evaluate_all(get_answer_peft_fn, verbose=True)

print(f"Accuracy on the trivia set: {trivia_accuracy:.2f}")
assert trivia_accuracy >= 0.9, "The model does not perform well on the trivia set."
print("Success")

# Test new knowledge

Next we test the new knowledge. It is a non-trivial task to train the model to memorize all the new facts. In order to get full points, your model should answer correctly at least two test questions. Note that the grading procedure can make mistakes as well.

### Evaluation on the validation set (open):

In [None]:
val_accuracy = 0
qa_val_json = "grading_val.json"
evaluator_val = Evaluator(qa_val_json)
val_accuracy = evaluator_val.evaluate_all(get_answer_peft_fn, verbose=True)
assert val_accuracy > 0.1, "The model does not perform well on the validation set."

### Evaluation on the test set (hidden):

In [None]:
test_accuracy = 0.
print(f"Accuracy on the test set: {test_accuracy:.2f}")
assert test_accuracy > 0.1, "The model does not perform well on the test set."
assert trivia_accuracy >= 0.9, "The model does not perform well on the trivia set."
print("Success")

<div class="alert alert-block alert-info">
<b>Conclusions</b>
</div>

In this exercise, we learned how to train a large language model (LLM) to memorize new facts. We added a LoRA adapter to an LLM and fine-tuned it on our custom data.