<div class="alert alert-block alert-info">
<b>Deadline:</b> March 19, 2025 (Wednesday) 23:00
</div>

# Exercise 1. Parameter-efficient fine-tuning of large language models

In this assignment, we will learn how to train a large language model (LLM) to memorize new facts. We will add a [LoRA adapter](https://arxiv.org/abs/2106.09685) to the `Llama-3.2-1B-Instruct` model and fine-tuned it on our custom data.

In [33]:
# Set the location of the HF cache on JupyterHub
if __import__("socket").gethostname().startswith("jupyter"):
    import os
    os.environ["HF_HOME"] = "/coursedata/huggingface/"

In [34]:
skip_training = True  # Set this flag to True before validation and submission

In [35]:
# During evaluation, this cell sets skip_training to True

import tools, warnings
warnings.showwarning = tools.customwarn

In [36]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from peft import LoraConfig, TaskType, get_peft_model
from peft.peft_model import PeftModel
from functools import partial

from tools import print_message

# Task

First we load the `Llama-3.2-1B-Instruct` model by Meta from the Hugging Face (HF) repository.

Select the device for training (use GPU if you have one). Please, change the `torch_dtype` from `torch.bfloat16` to `torch.float32` if you have at least 8GB of CPU memory in your machine. This helps to get responses from the Llama model much faster.

In [37]:
device = torch.device('cpu')
torch_dtype = torch.bfloat16

In [38]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"
print(f"torch_dtype: {torch_dtype}")
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch_dtype
)
print(base_model)
tokenizer = AutoTokenizer.from_pretrained(model_id)

torch_dtype: torch.bfloat16
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-0

Let's try to ask the model something. First we create a dialogue (that consists of one message from the user).
Then we convert the dialogue into a prompt using the template required by Llama 3.2.

In [39]:
from llm_utils import apply_chat_template_llama3

messages = [{"role": "user", "content": "Who are you?"}]
prompt = apply_chat_template_llama3(messages, add_bot=False)
print(prompt)

<|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




Note the format of the prompt that we produced. You can find more details on Llama's prompt format [on this page](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/).

Now let's get a response from the model.

In [40]:
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
prompt_length = inputs["input_ids"].size(1)
with torch.no_grad():
    tokens = base_model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.01,
        pad_token_id=tokenizer.eos_token_id,
        #streamer=TextStreamer(tokenizer=tokenizer, skip_prompt=True),
    )
# Extract the new tokens generated (excluding the prompt)
output_tokens = tokens[:, prompt_length:]

# Decode the output tokens to a string
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print_message(output_text)

I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."

Let us evaluate the model on some trivial, common, questions

In [41]:
from grading import Evaluator, get_answer

qa_trivial_json = "grading_trivia.json"

get_answer_fn = partial(get_answer, model=base_model, tokenizer=tokenizer)

evaluator = Evaluator(qa_trivial_json)
trivia_accuracy = evaluator.evaluate_all(get_answer_fn, verbose=True)
print(f"Accuracy on the trivia set (original model): {trivia_accuracy:.2f}")

Q: Who wrote the play Romeo and Juliet?

GT Answer: William Shakespeare

Network answer: The play "Romeo and Juliet" was written by the renowned English playwright William Shakespeare. It is one of his most famous and iconic works, and is considered a tragedy.
Time: 50.28s, Tokens: 37, Speed: 0.74 tokens/s

Score: True

Q: What is the capital city of France?

GT Answer: Paris

Network answer: The capital city of France is Paris.
Time: 21.85s, Tokens: 9, Speed: 0.41 tokens/s

Score: True

Q: Who painted the Mona Lisa?

GT Answer: Leonardo da Vinci

KeyboardInterrupt: 

# Custom document

We want our model to memorize facts from a tiny document `document.txt` that we artificially generated. Let's print the document.

In [None]:
print(__import__('pathlib').Path("document.txt").read_text())

We want our model to be able to answer questions related to the document **without seeing the document in the prompt**. Let's test what the base model responds.

In [None]:
question = "How many seats are there in the Midnight Sun room in the Frostbite Futures HQ?"

In [None]:
_ = get_answer(question, answer=["4 seats", "4"], model=base_model, tokenizer=tokenizer)

Your task in this assignment is to generate training data and train the model. We advice you to inspect function `get_answer` to see how the question is converted into a prompt. You should use the same conversion in your dataset.

# Model training

**IMPORTANT:**
The assignment does not require a training loop to be provided. However, if you choose to include one for autograding purposes, please implement it in the designated cell.

In this exercise, we integrate a [LoRA adapter](https://arxiv.org/abs/2106.09685) into the base Llama model using the `peft` library.

**IMPORTANT:**
For the `transformers` and `peft` packages, we *strongly recommend* using the versions specified in the [requirements.yml](https://mycourses.aalto.fi/mod/resource/view.php?id=1241109) file (i.e., `peft=0.13.2`, `transformers=4.47.0`).

**IMPORTANT:**
The `peft` library offers multiple methods to attach an adapter to the base model. To ensure compatibility and avoid potential issues when loading the trained adapter, please create your peft model using function `get_peft_model`, as explained on [this page](https://huggingface.co/docs/peft/en/quicktour). Using alternative methods may lead to errors or inconsistencies during the loading process.

## Training loop

* A model created by `get_peft_model` is a regular pytorch model which you can train just like any other model.
* Note that the output of the `forward` function is not a tensor but a more complex structure.
* You can use any code for training, for example, you can use HF's `Trainer` objects. However, we stronlgy encourage you to implement the training loop by yourselves.
* Please save the model to folder `1_adapter` using this code:
```
peft_model.save_pretrained("1_adapter")
```

Implement the train and test dataset splits below:

In [None]:
# YOUR CODE HERE
import re
from pathlib import Path
from datasets import Dataset
from sklearn.model_selection import train_test_split


def parse_document(file_path="document.txt"):
    document = Path(file_path).read_text()
    companies = {}
    
    for company_block in document.split('\n\n'):
        lines = company_block.strip().split('\n')
        if not lines:
            continue
        company_name = lines[0].split(' has')[0]
        location = re.search(r'located in (.*?)[\.\n]', lines[0]).group(1)
        rooms = []
        current_floor = None
        
        for line in lines[1:]:
            if 'Floor' in line:
                current_floor = int(line.split(':')[0].split()[1])
            elif line.strip().startswith('*'):
                room_info = line.strip()[2:].split(': ')
                room_name = room_info[0]
                details = room_info[1].split(', ')
                seats = int(details[0].split()[0])
                amenities = [d for d in details[1:] if d]
                rooms.append({'floor': current_floor, 'name': room_name, 'seats': seats, 'amenities': amenities})
        
        companies[company_name] = {'location': location, 'rooms': rooms}
    return companies

# Generate question-answer pairs
def generate_qa_pairs(companies):
    qa_pairs = []
    
    for company, info in companies.items():
        # Company location
        qa_pairs.append({
            "question": f"Where is {company} located?",
            "answer": info['location']
        })
        
        # Room-specific questions
        for room in info['rooms']:
            # Seats
            qa_pairs.append({
                "question": f"How many seats does the {room['name']} room at {company} have?",
                "answer": f"{room['seats']} seats"
            })
            # Amenities
            amenities_str = ', '.join(room['amenities']) if room['amenities'] else "None"
            qa_pairs.append({
                "question": f"What amenities does the {room['name']} room at {company} have?",
                "answer": amenities_str
            })
            # Floor (if applicable)
            if room['floor'] is not None:
                qa_pairs.append({
                    "question": f"On which floor is the {room['name']} room at {company} located?",
                    "answer": f"Floor {room['floor']}"
                })
    
    return qa_pairs

# Parse and generate data
companies = parse_document("document.txt")
qa_pairs = generate_qa_pairs(companies)

train_qa, val_qa = train_test_split(qa_pairs, test_size=0.2, random_state=42)

def create_full_sequence(qa):
    messages = [
        {"role": "user", "content": qa["question"]},
        {"role": "assistant", "content": qa["answer"]}
    ]
    return apply_chat_template_llama3(messages, add_bot=True)

train_sequences = [create_full_sequence(qa) for qa in train_qa]
val_sequences = [create_full_sequence(qa) for qa in val_qa]
#raise NotImplementedError()

Implement the test and train dataloaders below:

In [None]:
# YOUR CODE HERE
# Convert to Hugging Face Datasets
train_dataset = Dataset.from_dict({"text": train_sequences})
val_dataset = Dataset.from_dict({"text": val_sequences})

# Load tokenizer and tokenize the dataset
model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

# Define the tokenization function
def tokenize_function(examples):
    """Tokenize text sequences for model input."""
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Apply tokenization to the datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True).remove_columns(["text"])
tokenized_val = val_dataset.map(tokenize_function, batched=True).remove_columns(["text"])
#raise NotImplementedError()

Implement your model in the cell below:

In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
if not skip_training:
    # YOUR CODE HERE
    
    # Lower values of r and lora_alpha showed stability and are better
    peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode = False, r = 4, lora_alpha=4, lora_dropout=0.1)
    peft_model = get_peft_model(base_model, peft_config)
    peft_model.to(device)
    peft_model.print_trainable_parameters()


    training_args = TrainingArguments(
        output_dir="./log",
        learning_rate=7e-4,
        per_device_train_batch_size=4,  # Reduced to avoid memory issues
        per_device_eval_batch_size=4,
        num_train_epochs=15,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        fp16=True if torch.cuda.is_available() else False,  # Mixed precision for GPU
        report_to="none",  # Disable W&B to prevent hanging
        logging_dir="./logs",
        logging_steps=5,
        warmup_steps = 3,
    )
    #raise NotImplementedError()

The training loop is defined as follows:

In [None]:
if not skip_training:
    # YOUR CODE HERE
    trainer = Trainer(
        model = peft_model, 
        args=training_args, 
        train_dataset=tokenized_train,  # Training dataset
        eval_dataset=tokenized_val,  # Validation dataset
        data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),  
    )

    trainer.train()
    #raise NotImplementedError()

Save the model:

In [None]:
if not skip_training:
    # YOUR CODE HERE
    peft_model.save_pretrained("1_adapter")
    #raise NotImplementedError()

# Test the trained model

**IMPORTANT:** Once you have trained your model, ensure that the remaining cells in this notebook execute correctly. Failure to do so may result in a loss of points, as successful execution is part of the evaluation criteria.

First, we load the trained model. Note that the base model should be loaded already.

In [None]:
print("\nLoading the adapter")
base_model.to(device)
peft_model = PeftModel.from_pretrained(base_model, "1_adapter")
peft_model.to(device)

## Test common knowledge

We evaluate how well the model with the adapter recalls trivia facts.

**Note:** Successfully passing this test is mandatory to earn points for this assignment.

In [45]:
get_answer_peft_fn = partial(get_answer, model=peft_model, tokenizer=tokenizer)

evaluator = Evaluator(qa_trivial_json)
trivia_accuracy_trained = evaluator.evaluate_all(get_answer_peft_fn, verbose=True)
print(f"Accuracy on the trivia set (trained model): {trivia_accuracy_trained:.2f}")

Q: Who wrote the play Romeo and Juliet?

GT Answer: William Shakespeare

Network answer: William Shakespeare wrote the play Romeo and Juliet.
Time: 22.95s, Tokens: 10, Speed: 0.44 tokens/s

Score: True

Q: What is the capital city of France?

GT Answer: Paris

Network answer: Paris. Paris is the capital of France.
Time: 23.55s, Tokens: 10, Speed: 0.42 tokens/s

Score: True

Q: Who painted the Mona Lisa?

GT Answer: Leonardo da Vinci

Network answer: Leonardo da Vinci painted the Mona Lisa.
Time: 22.11s, Tokens: 10, Speed: 0.45 tokens/s

Score: True

Q: What is the smallest planet in our solar system?

GT Answer: Mercury

Network answer: Mercury is the smallest planet in our solar system. It has only one natural satellite, which is called Phobos.
Time: 42.40s, Tokens: 26, Speed: 0.61 tokens/s

Score: True

Q: How many states in USA?

GT Answer: 50 states

Network answer: 3 states in USA. Minnesota, Michigan and Wisconsin.
Time: 23.97s, Tokens: 12, Speed: 0.50 tokens/s

Score: False

Q: What is the chemical symbol for gold?

GT Answer: Au

Network answer: Au.
Time: 15.69s, Tokens: 3, Speed: 0.19 tokens/s

Score: True

Q: Who was the first President of the United States?

GT Answer: George Washington

Network answer: George Washington was the first President of the United States.
Time: 26.35s, Tokens: 12, Speed: 0.46 tokens/s

Score: True

Q: What is the tallest mountain in the world?

GT Answer: Mount Everest

Network answer: Mount Everest is the tallest mountain in the world, with a height of 8,848 meters (29,029 feet) above sea level.
Time: 44.02s, Tokens: 30, Speed: 0.68 tokens/s

Score: True

Q: Which scientist developed the theory of general relativity?

GT Answer: Albert Einstein

Network answer: Albert Einstein developed the theory of general relativity.
Time: 25.25s, Tokens: 11, Speed: 0.44 tokens/s

Score: True

Q: What is the largest ocean on Earth?

GT Answer: Pacific Ocean

Network answer: The largest ocean on Earth is the Pacific Ocean.
Time: 23.88s, Tokens: 11, Speed: 0.46 tokens/s

Score: True

Accuracy on the trivia set (trained model): 0.90


In [None]:
# [AUTOGRADING] This cell tests the model on the public common knowledge set

# Test new knowledge

Next we test the new knowledge. It is a non-trivial task to train the model to memorize all the new facts. In order to get full points, your model should answer correctly at least two test questions. Note that the grading procedure can make mistakes as well.

### Evaluation on the validation set (open):

In [44]:
qa_val_json = "grading_val.json"
evaluator_val = Evaluator(qa_val_json)
val_accuracy = evaluator_val.evaluate_all(get_answer_peft_fn, verbose=True)
print(f"Accuracy on the validation set: {val_accuracy:.2f}")

Q: How many seats are there in the Midnight Sun room in the Frostbite Futures HQ?

GT Answer: ['4 seats', 'four seats']

Network answer: $12$ seats.
Time: 26.12s, Tokens: 6, Speed: 0.23 tokens/s

Score: False

Q: What is the location of Frostbite Futures' HQ?

GT Answer: ['Helsinki', 'Helsinki, Finland']

Network answer: None of them.
Time: 20.76s, Tokens: 5, Speed: 0.24 tokens/s

Score: False

Q: On which floor the Tunturi meeting room is located?

GT Answer: ['floor 5', '5']

Network answer: Floor 5
Time: 20.49s, Tokens: 4, Speed: 0.20 tokens/s

Score: True

Accuracy on the validation set: 0.33


In [None]:
# [AUTOGRADING] This cell is reserved for auto-grading

In [None]:
# [AUTOGRADING] This cell is reserved for auto-grading

In [None]:
# [AUTOGRADING] This cell tests the model on the public validation set

### Evaluation on the test set (hidden):

In [None]:
# [AUTOGRADING] This cell tests the model on the hidden test set

<div class="alert alert-block alert-info">
<b>Conclusions</b>
</div>

In this exercise, we learned how to train a large language model (LLM) to memorize new facts. We added a LoRA adapter to an LLM and fine-tuned it on our custom data.