# Fine-tuning a Large Language Model

In this lecture we will be looking at how to fine-tune an existing pre-trained language model.

## Learning outcomes
* You will learn how to download a pre-trained model and a training dataset from Hugging Face.
* You will learn how to fine-tune the downloaded model with the dataset using Hugging Face trl library and the supervised fine-tuning (SFT) method.
* You will learn how to use the fine-tuned model to generate text based on user input / prompts.
* You will learn how to upload the fine-tuned model to your own Hugging Face repository so that it can be used later or shared with other users.

## Prerequistes
* You will need the following free accounts: Google, Hugging Face and Weights & Biases. You may use your existing accounts or create new accounts for the purposes of this course.
* We will use the [Hugging Face](https://huggingface.co/) libraries: transformers (for models), datasets (for datasets), trl (for training). We will also store the fine-tuned models in a Hugging Face repository.
* Training is done using [Google Colab](https://colab.research.google.com/), which provides free access to Jupyter notebooks backed with a GPU compute required for fine-tuning.
* For monitoring the training run we will use [Weights & Biases](https://wandb.ai/)


## Fine-tuning

Let's first install some pre-requisites using Python's package manager pip

In [1]:
!pip install transformers peft accelerate datasets trl wandb bitsandbytes



Then we need to import the required libraries

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer, TrainingArguments
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login
import torch
import wandb


We will download a pre-trained large language model from Hugging Face and a dataset to train the model with. Below we assign these to variables we will use later. We will also set the name of the repository and model for the fine-tuned model.

In [3]:
# Pre trained model and dataset for experiments 1 & 2:
# model_name = "mistralai/Mistral-7B-v0.3"
# dataset_name = "vicgalle/alpaca-gpt4"

# # Pre trained model and dataset for experiments 3 & 4:
model_name = "Qwen/Qwen2.5-7B-Instruct"
dataset_name = "tatsu-lab/alpaca"

HUGGING_FACE_USERNAME = "tiigit"  # <---- change to your hugging face username

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
# new_model = f"{HUGGING_FACE_USERNAME}/mistral-7b-finetune"
new_model = f"{HUGGING_FACE_USERNAME}/qwen2.5-7b-finetune"

To access your Hugging Face account, you need to log in. First go to your Hugging Face account, click *Settings* and select *Access Tokens*. Create a new token and copy the token. Then execute the below login command and when asked paste an access token.  

In [4]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Let's then download a subset of the dataset we want to use. Below we limit the dataset to the first 10,000 examples in order to save time. In real life you would probably use the full dataset.

In [5]:
# Load a small subset of the instruction-tuning dataset
raw_dataset = load_dataset(dataset_name, split="train[:10000]")

def format_example(example):
    # Turn the Alpaca-style fields into a single text field
    if example.get("input"):
        return {
            "text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
        }
    else:
        return {
            "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
        }

# Map to a simple {'text': ...} format and keep a tiny subset so it trains quickly
dataset = raw_dataset.map(format_example)
dataset = dataset.shuffle(seed=42).select(range(50))
dataset["text"][0]


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-a09b74b3ef9c3b(…):   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

'### Instruction:\nSuggest ways to reduce environmental pollution\n\n### Response:\nOne of the most effective ways to reduce environmental pollution is to reduce emissions from vehicles. This can be done by carpooling, using public transportation, or using electric or hybrid vehicles. Other ways to reduce emissions include investing in renewable energy sources such as solar and wind power or increasing the use of energy-efficient appliances. Additionally, reducing the amount of waste produced can also help reduce environmental pollution. Everyone can help by composting, reusing and recycling items, and cutting down on plastic and single-use items.'

Let's then download the model. We first create a config object for quantization of the model using bitsandbytes. Bitsandbytes enables accessible large language models via k-bit quantization for PyTorch.

We also need to download the tokenizer.

In [6]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

AttributeError: Qwen2TokenizerFast has no attribute add_bos_token

Below we log in to Weights & Biases for experiment tracking.

> * In Colab, store your key in the `WANDB_API_KEY` environment variable, or  
> * Call `wandb.login()` and paste the key interactively when prompted.
>
> You can find your key in your [Weights & Biases account](https://wandb.ai/).


In [7]:
# Monitoring login (uses the WANDB_API_KEY environment variable if set)
wandb.login()
run = wandb.init(project="llm-finetuning-demo", job_type="training", anonymous="allow")


[34m[1mwandb[0m: Currently logged in as: [33mtiigit[0m ([33mtiigit-llm-test[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Then we'll create a configuration for the lo-rank adaptation method we will use.

In [8]:
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)

#### LoRA Target Modules

LoRA adds small trainable matrices into selected linear layers of a transformer.
**Target modules** tell LoRA *which* layers to modify.

**Common module names (LLaMA / Mistral / Qwen)**

**Attention layers**

* **q_proj**: creates attention *queries*
* **k_proj**: creates attention *keys*
* **v_proj**: creates attention *values*
* **o_proj**: attention outputs

**Feed-forward (MLP) layers**

* **gate_proj**: gating in SwiGLU
* **up_proj**: expands hidden size
* **down_proj**: reduces back to model size

**Recommended set for most models**

```python
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
```

**If VRAM is tight (e.g., T4)**

```python
["q_proj", "k_proj", "v_proj", "o_proj"]
```

These layers give the best trade-off between memory use and performance.


We need to set the training arguments for the training run.

In [9]:
training_arguments = TrainingArguments(
    output_dir="./results",          # Where to save checkpoints & logs
    num_train_epochs=1,              # Number of full passes through the dataset
    per_device_train_batch_size=8,   # Batch size per GPU (before gradient accumulation)
    gradient_accumulation_steps=2,   # Accumulate gradients to simulate a larger batch (8×2 = 16)
    optim="paged_adamw_8bit",        # Memory-efficient optimizer from bitsandbytes (QLoRA-friendly)
    save_steps=1000,                 # Save model every 1000 steps (set high to avoid slowing training)
    logging_steps=10,                # Log metrics to W&B every 10 steps
    learning_rate=2e-4,              # Base learning rate for training
    weight_decay=0.001,              # Regularization to reduce overfitting
    fp16=False,                      # Use float16 (disabled here)
    bf16=False,                      # Use bfloat16 (disable on GPUs like T4 that don't support it)
    max_grad_norm=0.3,               # Gradient clipping for training stability
    max_steps=-1,                    # Train for full epochs (no manual step limit)
    warmup_ratio=0.3,                # Fraction of steps for LR warmup (30%)
    group_by_length=True,            # Buckets sequences by length for efficiency
    lr_scheduler_type="linear",      # Linear learning-rate schedule
    report_to="wandb",               # Send logs to Weights & Biases
)


Finally we create the trainer object that uses supervised fine-tuning (SFT) as the training method.

In [10]:
# Setting SFT parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
    processing_class=tokenizer,
)

Adding EOS to train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Then, we can execute the training run.

In [11]:
# Train model
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151645}.
  return fn(*args, **kwargs)


Step,Training Loss


TrainOutput(global_step=4, training_loss=1.9836666584014893, metrics={'train_runtime': 32.2772, 'train_samples_per_second': 1.549, 'train_steps_per_second': 0.124, 'total_flos': 335540765091840.0, 'train_loss': 1.9836666584014893, 'entropy': 1.1834309612001692, 'num_tokens': 4211.0, 'mean_token_accuracy': 0.5832219634737287, 'epoch': 1.0})

In [12]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

0,1
train/entropy,▁
train/epoch,▁
train/global_step,▁
train/mean_token_accuracy,▁
train/num_tokens,▁

0,1
total_flos,335540765091840.0
train/entropy,1.18343
train/epoch,1.0
train/global_step,4.0
train/mean_token_accuracy,0.58322
train/num_tokens,4211.0
train_loss,1.98367
train_runtime,32.2772
train_samples_per_second,1.549
train_steps_per_second,0.124


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=3584, out_features=3584, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=3584, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=3584, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=3584, out_features=512, bias=True)
            (lora_dropout): ModuleDict(
          

In [26]:
def stream(user_prompt: str):
    # Put model in eval mode
    model.eval()

    # Works even with device_map="auto"
    device = next(model.parameters()).device

    system_prompt = (
        "You are the most helpful and cheerful assistant."  # an added system prompt for experimenting
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
    )

    B_INST, E_INST = "### Instruction:\n", "\n\n### Response:\n"
    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}{E_INST}"

    # Move inputs to the same device as the model
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Stream tokens directly to notebook output
    streamer = TextStreamer(
        tokenizer,
        skip_prompt=True,          # don't print the full prompt
        skip_special_tokens=True,
    )

    with torch.inference_mode():
        _ = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            streamer=streamer,
            eos_token_id=tokenizer.eos_token_id,
        )

In [29]:
stream("what is newtons 3rd law and its formula?")

Newton's Third Law of Motion states that for every action, there is an equal and opposite reaction. This means that whenever one object exerts a force on a second object, the second object exerts an equal and opposite force on the first object. 

The law can be expressed in the following way:

\[ \text{Force}_{\text{action}} = -\text{Force}_{\text{reaction}} \]

Here, the forces are equal in magnitude but opposite in direction. The formula does not have a single constant value like \( F = ma \) because it describes the relationship between two interacting objects rather than a specific force.

To illustrate with an example: If you push against a wall with a force of 10 N to the right, the wall pushes back on you with a force of 10 N to the left. Both forces are equal in magnitude and opposite in direction, demonstrating Newton's Third Law. 

This law is fundamental in understanding how objects interact with each other and is crucial in many areas of physics and engineering. It helps ex

In [32]:
stream("describe how to make a pie")

Making a delicious pie is a delightful culinary experience! Here’s a step-by-step guide on how to make a classic apple pie:

#### Ingredients (for a 9-inch single-crust pie):
- **Pie Crust:**
  - 2 1/4 cups all-purpose flour
  - 1 tsp salt
  - 1 tsp granulated sugar
  - 1 stick (8 tbsp) cold unsalted butter, cut into small pieces
  - 6-8 tbsp ice water
- **Filling:**
  - 5 large apples, peeled, cored, and thinly sliced
  - 3/4 cup granulated sugar
  - 2 tbsp lemon juice
  - 1 tsp ground cinnamon
  - 1/2 tsp ground nutmeg
  - 1/4 tsp salt
  - 2 tbsp all-purpose flour
  - 2 tbsp unsalted butter, melted

#### Equipment:
- Rolling pin
- Mixing bowls
- Pie dish
- Pastry brush
- Measuring cups and spoons

#### Instructions:

1. **Prepare the Dough:**
   - In a large bowl, combine the flour, salt, and sugar.
   - Add the cold butter and use


In [35]:
stream("what is the difference between cats and dogs?")

Cats and dogs are two of the most popular domesticated animals, each with unique characteristics and behaviors. Here are some key differences:

1. **Appearance**:
   - **Cats**: Generally smaller in size compared to dogs, with shorter legs and a more compact body. They have retractable claws and come in a wide variety of colors and patterns.
   - **Dogs**: Vary widely in size, from small breeds like Chihuahuas to large ones like Great Danes. They typically have longer legs and come in a range of shapes and sizes.

2. **Behavior**:
   - **Cats**: Known for being more independent and often prefer to be on their own or with just one person. They use scratching posts and may mark territory with urine or scent glands. Cats are generally quieter and can be aloof at times.
   - **Dogs**: More social and often require more attention and interaction. They are known for their loyalty and eagerness to please. Dogs bark and whine more frequently than cats and enjoy playing and fetching games.

3. 

In [25]:
# Same bnb_config as above
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, new_model)

# Try merging LoRA into the base model
model = model.merge_and_unload()  # may still be heavy on T4 depending on model size

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [26]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

HfHubHTTPError: (Request ID: Root=1-695011d5-616dbbaf0be28a3427ab3de3;b2c7f8e7-2632-43ce-a2bc-a14b09061189)

403 Forbidden: You don't have the rights to create a model under the namespace "tiigit".
Cannot access content at: https://huggingface.co/api/repos/create.
Make sure your token has the correct permissions.