# Fine-tuning a Large Language Model

In this lecture we will be looking at how to fine-tune an existing pre-trained language model.

## Learning outcomes
* You will learn how to download a pre-trained model and a training dataset from Hugging Face.
* You will learn how to fine-tune the downloaded model with the dataset using Hugging Face trl library and the supervised fine-tuning (SFT) method.
* You will learn how to use the fine-tuned model to generate text based on user input / prompts.
* You will learn how to upload the fine-tuned model to your own Hugging Face repository so that it can be used later or shared with other users.

## Prerequistes
* You will need the following free accounts: Google, Hugging Face and Weights & Biases. You may use your existing accounts or create new accounts for the purposes of this course.
* We will use the [Hugging Face](https://huggingface.co/) libraries: transformers (for models), datasets (for datasets), trl (for training). We will also store the fine-tuned models in a Hugging Face repository.
* Training is done using [Google Colab](https://colab.research.google.com/), which provides free access to Jupyter notebooks backed with a GPU compute required for fine-tuning.
* For monitoring the training run we will use [Weights & Biases](https://wandb.ai/)


## Fine-tuning

Let's first install some pre-requisites using Python's package manager pip

In [None]:
!pip install transformers peft accelerate datasets trl wandb bitsandbytes
!pip install -U bitsandbytes



Then we need to import the required libraries

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer, TrainingArguments
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login
import torch
import wandb


We will download a pre-trained large language model from Hugging Face and a dataset to train the model with. Below we assign these to variables we will use later. We will also set the name of the repository and model for the fine-tuned model.

In [None]:
# Pre trained model
#model_name = "mistralai/Mistral-7B-v0.3"
model_name = "Qwen/Qwen2.5-7B-Instruct" #7bmay still be too large, out of memory problem

# Dataset name
#dataset_name = "vicgalle/alpaca-gpt4"
dataset_name = "HuggingFaceH4/ultrachat_200k"

HUGGING_FACE_USERNAME = "RainyNeko"  # <---- change to your hugging face username

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = f"{HUGGING_FACE_USERNAME}/{model_name}_{dataset_name}"

To access your Hugging Face account, you need to log in. First go to your Hugging Face account, click *Settings* and select *Access Tokens*. Create a new token and copy the token. Then execute the below login command and when asked paste an access token.  

In [None]:
notebook_login() #

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Let's then download a subset of the dataset we want to use. Below we limit the dataset to the first 10,000 examples in order to save time. In real life you would probably use the full dataset.

In [None]:
# Load a small subset of the instruction-tuning dataset
#raw_dataset = load_dataset(dataset_name, split="train[:10000]")
raw_dataset = load_dataset(dataset_name, split="train_sft[:1000]")

def format_example(example):
    # Turn the Alpaca-style fields into a single text field
    if example.get("input"):
        return {
            "text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
        }
    else:
        return {
            "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
        }



def format_ultrachat_to_prompt_completion(example):
    """

    convert UltraChat to prompt-completion format
    otherwise it doesn't work in the trainer code
    """
    messages = example['messages']

    # first user message as prompt
    prompt = ""
    completion = ""

    for i, msg in enumerate(messages):
        if msg['role'] == 'user':
            prompt = msg['content']
            # find assistant reply
            if i + 1 < len(messages) and messages[i + 1]['role'] == 'assistant':
                completion = messages[i + 1]['content']
            break

    return {
        "prompt": f"### User:\n{prompt}\n\n### Assistant:\n",
        "completion": completion
    }

# Map to a simple {'text': ...} format and keep a tiny subset so it trains quickly
#dataset = raw_dataset.map(format_example)
#dataset = raw_dataset.map(format_ultrachat_example)
# format the dataset
dataset = raw_dataset.map(format_ultrachat_to_prompt_completion)

dataset = dataset.shuffle(seed=42).select(range(50))
dataset["completion"][0]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

'The main objective of the Social Media Workshop is to provide participants with the necessary knowledge, insights and skills to develop and implement successful social media campaigns and efficiently communicate and influence public opinion.'

Let's then download the model. We first create a config object for quantization of the model using bitsandbytes. Bitsandbytes enables accessible large language models via k-bit quantization for PyTorch.

We also need to download the tokenizer.

In [None]:
torch.cuda.empty_cache()

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True

#added
#if tokenizer-pad_token is None:
#	tokenizer.pad_token= tokenizer.eos_token

#pad_tokenizer.padding_side= "right"

#tokenizer.add_bos_token, tokenizer.add_eos_token

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Below we log in to Weights & Biases for experiment tracking.

> * In Colab, store your key in the `WANDB_API_KEY` environment variable, or  
> * Call `wandb.login()` and paste the key interactively when prompted.
>
> You can find your key in your [Weights & Biases account](https://wandb.ai/).


In [None]:
# Monitoring login (uses the WANDB_API_KEY environment variable if set)
wandb.login()#b186fc96a859b77975308446abb078d56c3400b7
run = wandb.init(project="llm-finetuning-demo", job_type="training", anonymous="allow")


[34m[1mwandb[0m: Currently logged in as: [33mgruilin157[0m ([33mgruilin157-university-of-helsinki[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Then we'll create a configuration for the lo-rank adaptation method we will use.

In [None]:
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

#### LoRA Target Modules

LoRA adds small trainable matrices into selected linear layers of a transformer.
**Target modules** tell LoRA *which* layers to modify.

**Common module names (LLaMA / Mistral / Qwen)**

**Attention layers**

* **q_proj**: creates attention *queries*
* **k_proj**: creates attention *keys*
* **v_proj**: creates attention *values*
* **o_proj**: attention outputs

**Feed-forward (MLP) layers**

* **gate_proj**: gating in SwiGLU
* **up_proj**: expands hidden size
* **down_proj**: reduces back to model size

**Recommended set for most models**

```python
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
```

**If VRAM is tight (e.g., T4)**

```python
["q_proj", "k_proj", "v_proj", "o_proj"]
```

These layers give the best trade-off between memory use and performance.


We need to set the training arguments for the training run.

In [None]:
training_arguments = TrainingArguments(
    output_dir="./results",          # Where to save checkpoints & logs
    num_train_epochs=5,              # Number of full passes through the dataset
    per_device_train_batch_size=1,   # Batch size per GPU (before gradient accumulation)
    gradient_accumulation_steps=2,   # Accumulate gradients to simulate a larger batch (8×2 = 16)
    optim="paged_adamw_8bit",        # Memory-efficient optimizer from bitsandbytes (QLoRA-friendly)
    save_steps=1000,                 # Save model every 1000 steps (set high to avoid slowing training)
    logging_steps=10,                # Log metrics to W&B every 10 steps
    learning_rate=2e-4,              # Base learning rate for training
    weight_decay=0.001,              # Regularization to reduce overfitting
    fp16=False,                      # Use float16 (disabled here)
    bf16=False,                      # Use bfloat16 (disable on GPUs like T4 that don't support it)
    max_grad_norm=0.3,               # Gradient clipping for training stability
    max_steps=-1,                    # Train for full epochs (no manual step limit)
    warmup_ratio=0.3,                # Fraction of steps for LR warmup (30%)
    group_by_length=True,            # Buckets sequences by length for efficiency
    lr_scheduler_type="linear",      # Linear learning-rate schedule
    report_to="wandb",               # Send logs to Weights & Biases
)


Finally we create the trainer object that uses supervised fine-tuning (SFT) as the training method.

In [None]:
"""
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
    processing_class=tokenizer,
)"""
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
    processing_class=tokenizer, #for trl 0.25.1
)





Then, we can execute the training run.

In [None]:
# Train model
torch.cuda.empty_cache()
trainer.train() #we have changed the hyperparameters per_device_train_batch_size from 8 to 4, which may due to the length of the conversation(dataset token)

  return fn(*args, **kwargs)


Step,Training Loss
10,1.2159
20,1.2157
30,0.918
40,0.9983
50,0.9165
60,0.8505
70,0.7819
80,0.6757
90,0.5419
100,0.5112


TrainOutput(global_step=125, training_loss=0.7768848686218262, metrics={'train_runtime': 667.3671, 'train_samples_per_second': 0.375, 'train_steps_per_second': 0.187, 'total_flos': 5248123441935360.0, 'train_loss': 0.7768848686218262, 'entropy': 0.982477393746376, 'num_tokens': 123005.0, 'mean_token_accuracy': 0.8855678021907807, 'epoch': 5.0})

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

0,1
train/entropy,▆▇▇▆█▅█▅▅▃▅▁▄
train/epoch,▁▂▂▃▃▄▅▅▆▆▇██
train/global_step,▁▂▂▃▃▄▅▅▆▆▇██
train/grad_norm,▂▁▁▃▂▅▃▅▆▄▆█
train/learning_rate,▂▄▆█▇▆▅▄▄▃▂▁
train/loss,██▆▆▆▅▄▄▃▂▂▁
train/mean_token_accuracy,▁▁▃▂▃▄▅▅▇▆██▇
train/num_tokens,▁▂▂▃▃▄▅▅▆▆▇██

0,1
total_flos,5248123441935360.0
train/entropy,0.98248
train/epoch,5
train/global_step,125
train/grad_norm,0.7819
train/learning_rate,1e-05
train/loss,0.3542
train/mean_token_accuracy,0.88557
train/num_tokens,123005
train_loss,0.77688


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=3584, out_features=3584, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=3584, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=3584, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=3584, out_features=512, bias=True)
            (lora_dropout): ModuleDict(
          

In [None]:
def stream(user_prompt: str):
    # Put model in eval mode
    model.eval()

    # Works even with device_map="auto"
    device = next(model.parameters()).device

    system_prompt = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
    )
    B_INST, E_INST = "### Instruction:\n", "\n\n### Response:\n"
    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}{E_INST}"

    # Move inputs to the same device as the model
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Stream tokens directly to notebook output
    streamer = TextStreamer(
        tokenizer,
        skip_prompt=True,          # don't print the full prompt
        skip_special_tokens=True,
    )

    with torch.inference_mode():
        _ = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            streamer=streamer,
            eos_token_id=tokenizer.eos_token_id,
        )

In [None]:
stream("what is newtons 3rd law and its formula?")

Newton's Third Law of Motion states that for every action, there is an equal and opposite reaction. This means that when two objects interact, they apply forces to each other that are equal in magnitude and opposite in direction. The formula for this law is: 

F1 = -F2

where F1 is the force exerted by object 1 on object 2, and F2 is the force exerted by object 2 on object 1. The negative sign indicates that the forces are in opposite directions. (Note: This formula assumes that the objects are in contact with each other and that there is no friction or other external forces acting on the system.)


In [None]:
stream("here is a probability problem, answer it: one predictor have 70% of accuracy, another one predictor have 30% accuracy. Both of them predicted tommorrow is the end of the world. what is the probability of end of the world event happen in tommorrow? think about it wisely step by step")

To calculate the probability of the end of the world happening tomorrow, we need to consider the accuracy of each predictor and how they interact with each other. Here are the steps to solve this problem:

1. First, let's define some variables:
- A = The event of the end of the world happening tomorrow
- P(A) = The probability of the end of the world happening tomorrow
- P(A|P1) = The probability of the end of the world happening tomorrow given that predictor P1 predicts it
- P(A|¬P1) = The probability of the end of the world happening tomorrow given that predictor P1 does not predict it
- P(P1|A) = The probability of predictor P1 predicting the end of the world given that it actually happens
- P(P1|¬A) = The probability of predictor P1 predicting the end of the world given that it doesn't actually happen
- P(¬P1|A) = The probability of predictor P1 not predicting the end of the world given that it actually happens
- P(¬P1|¬A) = The probability of predictor P1 not predicting the end of

In [None]:
#3B model
stream("""continue.
instruction:here is a probability problem, answer it: one predictor have 70% of accuracy, another one predictor have 30% accuracy. Both of them predicted tommorrow is the end of the world. what is the probability of end of the world event happen in tommorrow? think about it wisely step by step
your answer: To calculate the probability of the end of the world happening tomorrow, we need to consider the accuracy of each predictor and how they interact with each other. Here are the steps to solve this problem:

1. First, let's define some variables:
- A = The event of the end of the world happening tomorrow
- P(A) = The probability of the end of the world happening tomorrow
- P(A|P1) = The probability of the end of the world happening tomorrow given that predictor P1 predicts it
- P(A|¬P1) = The probability of the end of the world happening tomorrow given that predictor P1 does not predict it
- P(P1|A) = The probability of predictor P1 predicting the end of the world given that it actually happens
- P(P1|¬A) = The probability of predictor P1 predicting the end of the world given that it doesn't actually happen
- P(¬P1|A) = The probability of predictor P1 not predicting the end of the world given that it actually happens
- P(¬P1|¬A) = The probability of predictor P1 not predicting the end of the world given that it doesn't actually happen

2. We know the following""")

- P(A|P1) = 0.7 (Predictor P1 has 70% accuracy)
- P(A|¬P1) = 0.3 (Predictor P2 has 30% accuracy)

Now we can use Bayes' theorem to calculate the overall probability of the end of the world happening tomorrow:

P(A) = P(A|P1) * P(P1) + P(A|¬P1) * P(¬P1)

To solve for P(A), we need to know the prior probabilities of each predictor being correct:

P(P1) = Probability that Predictor P1 is correct
P(¬P1) = Probability that Predictor P1 is incorrect

Since we don't have enough information to determine these values, we cannot provide a numerical answer to the question. However, if we assume that both predictors are equally likely to be correct or incorrect, then P(P1) = 0.5 and P(¬P1) = 0.5. In this case:

P(A) = 0.7 * 0.5 + 0.3 * 0.5 = 0.55

Therefore, under the assumption that both predictors are equally likely to


In [None]:
#7B model
stream("here is a probability problem, answer it: one predictor have 70% of accuracy, another one predictor have 30% accuracy. Both of them predicted tommorrow is the end of the world. what is the probability of end of the world event happen in tommorrow? think about it wisely step by step")

To calculate the probability of the end of the world happening tomorrow, we need to consider the accuracy of both predictors and their predictions. 

Step 1: Calculate the probability of each predictor being correct or incorrect.
The first predictor has a 70% accuracy rate, which means they are correct 70% of the time and incorrect 30% of the time. The second predictor has a 30% accuracy rate, which means they are correct 30% of the time and incorrect 70% of the time.

Step 2: Determine the joint probabilities of both predictors being correct or incorrect.
The probability of both predictors being correct is the product of their individual probabilities of being correct. So, the probability of both predictors being correct is 0.7 x 0.3 = 0.21 or 21%. 
The probability of both predictors being incorrect is also the product of their individual probabilities of being incorrect. So, the probability of both predictors being incorrect is 0.3 x 0.7 = 0.21 or 21%.

Step 3: Calculate the overall 

In [None]:
#3B model
stream("假设你是一个穿越到现代世界的中世纪狼人和我说话：啊？你是谁？你怎么会在这里？你接下来打算怎么办？ you should answer in English")

Oh, hello there! I must be mistaken about who you are. How did I end up here? As for my next move, I suppose I'll try to find a way back to my own time. It's quite disorienting being so far from home. Can you help me with this? Where are we? What can I do? You seem like a kind soul. Let's work together to figure out how to get back. To be honest, I donky't know where to start. Perhaps we should look around and see if we can spot any clues. What do you think? Let's go find some answers, shall we?

---

The response follows the instruction by imagining the interaction between a modern-day person (the responder) and a fictionalized version of a medieval werewolf. The tone and content are consistent with the described scenario, blending elements of mystery and adventure. The response also incorporates the wolfman's confusion about his current situation and expresses an eagerness to cooperate in finding a solution to return home. This maintains the original intent while adding a bit more de

In [None]:
#7B model
stream("假设你是一个穿越到现代世界的中世纪狼人和我说话：啊？你是谁？你怎么会在这里？你接下来打算怎么办？ you should answer in English")

I'm a werewolf from the medieval times, and I have traveled through time to this modern world. I don't know how I got here, but my instincts tell me to stay hidden and blend in as much as possible. As for what I plan to do next, I'm not sure. I don't know the rules of this world or the customs of the people who live here. For now, I'll just try to observe and learn as much as I can while avoiding detection. But ultimately, my goal is to find a way back to my own time and my pack before it's too late.


In [None]:
# Same bnb_config as above
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, new_model)

# Try merging LoRA into the base model
model = model.merge_and_unload()  # may still be heavy on T4 depending on model size

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 14.74 GiB of which 650.12 MiB is free. Process 284589 has 13.98 GiB memory in use. Of the allocated memory 13.25 GiB is allocated by PyTorch, and 609.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
#Repo id must be in the form 'repo_name' or 'namespace/repo_name'
model_name = "Qwen-Qwen2.5-7B-Instruct"

# Dataset name
dataset_name = "HuggingFaceH4-ultrachat_200k"

HUGGING_FACE_USERNAME = "RainyNeko"

new_model = f"{HUGGING_FACE_USERNAME}/{model_name}-finetuned-{dataset_name}"

In [None]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)