# Fine-tuning a Large Language Model

In this lecture we will be looking at how to fine-tune an existing pre-trained language model.

## Learning outcomes
* You will learn how to download a pre-trained model and a training dataset from Hugging Face.
* You will learn how to fine-tune the downloaded model with the dataset using Hugging Face trl library and the supervised fine-tuning (SFT) method.
* You will learn how to use the fine-tuned model to generate text based on user input / prompts.
* You will learn how to upload the fine-tuned model to your own Hugging Face repository so that it can be used later or shared with other users.

## Prerequistes
* You will need the following free accounts: Google, Hugging Face and Weights & Biases. You may use your existing accounts or create new accounts for the purposes of this course.
* We will use the [Hugging Face](https://huggingface.co/) libraries: transformers (for models), datasets (for datasets), trl (for training). We will also store the fine-tuned models in a Hugging Face repository.
* Training is done using [Google Colab](https://colab.research.google.com/), which provides free access to Jupyter notebooks backed with a GPU compute required for fine-tuning.
* For monitoring the training run we will use [Weights & Biases](https://wandb.ai/)


## Fine-tuning

Let's first install some pre-requisites using Python's package manager pip

In [1]:
!pip install transformers peft accelerate datasets trl wandb bitsandbytes



Then we need to import the required libraries

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer, TrainingArguments
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login
import torch
import wandb


We will download a pre-trained large language model from Hugging Face and a dataset to train the model with. Below we assign these to variables we will use later. We will also set the name of the repository and model for the fine-tuned model.

In [3]:
# Pre trained model
model_name = "mistralai/Mistral-7B-v0.3"

# Dataset name
dataset_name = "vicgalle/alpaca-gpt4"

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = "aarnetalman/mistral-7b-finetune"

To access your Hugging Face account, you need to log in. First go to your Hugging Face account, click *Settings* and select *Access Tokens*. Create a new token and copy the token. Then execute the below login command and when asked paste an access token.  

In [4]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Let's then download a subset of the dataset we want to use. Below we limit the dataset to the first 10,000 examples in order to save time. In real life you would probably use the full dataset.

In [5]:
# Load a small subset of the instruction-tuning dataset
raw_dataset = load_dataset(dataset_name, split="train[:1000]")

def format_example(example):
    # Turn the Alpaca-style fields into a single text field
    if example.get("input"):
        return {
            "text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
        }
    else:
        return {
            "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
        }

# Map to a simple {'text': ...} format and keep a tiny subset so it trains quickly in class
dataset = raw_dataset.map(format_example)
dataset = dataset.shuffle(seed=42).select(range(50))
dataset["text"][0]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


'### Instruction:\nDescribe how plants look like in the winter.\n\n### Response:\nPlant appearance in winter varies based on the type of plant and the climate of the region. In areas with cold winters, many plants enter a dormancy period to conserve energy and protect themselves from cold temperatures. During this time, deciduous trees and shrubs lose their leaves, giving them a bare and barren look. Herbaceous perennials die back to the ground, leaving their roots and underground parts alive but their above-ground growth gone until spring. On evergreen trees and shrubs, needle-like or scale-like foliage remains green, providing a bit of color in the winter landscape.\n\nIn regions with milder winters, plants may retain their leaves, although growth may slow down. Some plants may even continue to bloom, providing a pop of color in winter gardens. Overall, the winter landscape tends to be dominated by muted colors and sparse foliage, as plants conserve energy and protect themselves from

Let's then download the model. We first create a config object for quantization of the model using bitsandbytes. Bitsandbytes enables accessible large language models via k-bit quantization for PyTorch.

We also need to download the tokenizer.

In [6]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

(True, True)

Below we log in to Weights & Biases for experiment tracking.

> **Important:** Don't hard‑code your API key in the notebook.
>
> * In Colab, store your key in the `WANDB_API_KEY` environment variable, or  
> * Call `wandb.login()` and paste the key interactively when prompted.
>
> You can find your key in your [Weights & Biases account](https://wandb.ai/).


In [7]:
# Monitoring login (uses the WANDB_API_KEY environment variable if set)
wandb.login()
run = wandb.init(project="llm-finetuning-demo", job_type="training", anonymous="allow")


[34m[1mwandb[0m: Currently logged in as: [33maarnetalman[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Then we'll create a configuration for the lo-rank adaptation method we will use.

In [8]:
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj"]
)

We need to set the training arguments for the training run.

In [9]:
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    optim="paged_adamw_8bit",
    save_steps=1000,
    logging_steps=10,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.3,
    group_by_length=True,
    lr_scheduler_type="linear",
    report_to="wandb",
)


Finally we create the trainer object that uses supervised fine-tuning (SFT) as the training method.

In [10]:
# Setting SFT parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
    processing_class=tokenizer,
)

Then, we can execute the training run. This will approximately 8 hours using the T4 GPU available in Colab and the dataset of 10,000 samples we downloaded.

In [11]:
# Train model
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.
  return fn(*args, **kwargs)


Step,Training Loss


TrainOutput(global_step=4, training_loss=1.350332498550415, metrics={'train_runtime': 60.9699, 'train_samples_per_second': 0.82, 'train_steps_per_second': 0.066, 'total_flos': 649085197271040.0, 'train_loss': 1.350332498550415, 'entropy': 1.6642374651772636, 'num_tokens': 9650.0, 'mean_token_accuracy': 0.6402284502983093, 'epoch': 1.0})

In [12]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

0,1
train/entropy,▁
train/epoch,▁
train/global_step,▁
train/mean_token_accuracy,▁
train/num_tokens,▁

0,1
total_flos,649085197271040.0
train/entropy,1.66424
train/epoch,1.0
train/global_step,4.0
train/mean_token_accuracy,0.64023
train/num_tokens,9650.0
train_loss,1.35033
train_runtime,60.9699
train_samples_per_second,0.82
train_steps_per_second,0.066


MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(


In [13]:
def stream(user_prompt: str):
    # Put model in eval mode
    model.eval()

    # Works even with device_map="auto"
    device = next(model.parameters()).device

    system_prompt = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
    )
    B_INST, E_INST = "### Instruction:\n", "\n\n### Response:\n"
    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}{E_INST}"

    # Move inputs to the same device as the model
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Stream tokens directly to notebook output
    streamer = TextStreamer(
        tokenizer,
        skip_prompt=True,          # don't print the full prompt
        skip_special_tokens=True,
    )

    with torch.inference_mode():
        _ = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            streamer=streamer,
            eos_token_id=tokenizer.eos_token_id,
        )

In [14]:
stream("what is newtons 3rd law and its formula?")



Newton's third law states that "for every action, there is an equal and opposite reaction." This law is used in many fields, including physics, engineering, and biology. The formula for Newton's third law is as follows:

F1 = -F2

where F1 is the force exerted on object 1 and F2 is the force exerted on object 2.

This law is used to explain the relationship between two objects that are interacting with each other. For example, if you push a book off a table, the book will push back against your hand with an equal and opposite force. This is because the force you exert on the book is equal to the force the book exerts on your hand.

Newton's third law is also used to explain the relationship between two objects that are interacting with a third object. For example, if you throw a ball into the air, the force of the ball pushing against the air is equal to the force of the air pushing back against the ball. This is because the force of the ball pushing against the air is equal to the f

In [15]:
# Same bnb_config as above
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, new_model)

# Try merging LoRA into the base model
model = model.merge_and_unload()  # may still be heavy on T4 depending on model size

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [16]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...7vjd8r0/model.safetensors:   0%|          | 30.0kB / 4.14GB            

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...pww7skte2/tokenizer.model: 100%|##########|  587kB /  587kB            

CommitInfo(commit_url='https://huggingface.co/aarnetalman/mistral-7b-finetune/commit/6ab1bed87b74540a3d4f63662903e1415388c693', commit_message='Upload tokenizer', commit_description='', oid='6ab1bed87b74540a3d4f63662903e1415388c693', pr_url=None, repo_url=RepoUrl('https://huggingface.co/aarnetalman/mistral-7b-finetune', endpoint='https://huggingface.co', repo_type='model', repo_id='aarnetalman/mistral-7b-finetune'), pr_revision=None, pr_num=None)