#License and Attribution

This notebook was developed by Emilio Serrano, Full Professor at the Department of Artificial Intelligence, Universidad Polit√©cnica de Madrid (UPM), for educational purposes in UPM courses. Personal website: https://emilioserrano.faculty.bio/

üìò License: Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)

You are free to: (1) Share ‚Äî copy and redistribute the material in any medium or format; (2) Adapt ‚Äî remix, transform, and build upon the material.

Under the following terms: (1) Attribution ‚Äî You must give appropriate credit, provide a link to the license, and indicate if changes were made; (2) NonCommercial ‚Äî You may not use the material for commercial purposes; (3) ShareAlike ‚Äî If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

üîó License details: https://creativecommons.org/licenses/by-nc-sa/4.0/

# A Simplified Guide to Fine-Tuning an Open LLM with QLoRA

This notebook provides a step-by-step guide to performing instruction-based fine-tuning on an open-source Large Language Model.

Core Objectives:
* Simplicity: Use high-level libraries from the Hugging Face ecosystem to make the process straightforward.
* Efficiency: Employ QLoRA (Quantization and Low-Rank Adapters) to fine-tune the model on a single, free T4 GPU in Google Colab.
*  Verification: Demonstrate the improvement by comparing the model's performance before and after fine-tuning.


While TensorFlow is a powerful and robust framework,  particularly for techniques like QLoRA and *parameter-efficient fine-tuning* (PEFT), is  dominated by PyTorch. The Hugging Face `peft` and `bitsandbytes` libraries, which are essential for this efficient approach, have the most stable and feature-rich implementations in PyTorch.

#Initial Setup

## Installation of Libraries
First, we need to install the necessary Python libraries. We'll use the -q flag for a quiet installation to keep the output clean.

* transformers: Provides the LLM models and tokenizers (e.g., our base model, TinyLlama).

* peft: The Parameter-Efficient Fine-Tuning library. This is where the LoRA/QLoRA logic resides.

* bitsandbytes: Enables the 4-bit quantization, which is the "Q" in QLoRA. This is what drastically reduces memory usage.

* datasets: A library for easily loading and processing datasets, including those from the Hugging Face Hub.

* accelerate: A library from Hugging Face that helps abstract away the hardware (CPU/GPU) and enables seamless distributed training (though we'll use a single GPU here).

* trl: The Transformer Reinforcement Learning library, which provides a high-level SFTTrainer (Supervised Fine-tuning Trainer) that simplifies the training loop.

In [None]:
# ==============================================================================
# STEP 1: INSTALL LIBRARIES
# ==============================================================================
# This installs all the necessary libraries for running the fine-tuning process.
# - transformers: For models and tokenizers.
# - peft: For LoRA/QLoRA implementation.
# - bitsandbytes: For 4-bit quantization.
# - datasets: For loading the training data.
# - accelerate: For hardware abstraction.
# - trl: For the Supervised Fine-tuning Trainer (SFTTrainer).
# ==============================================================================
!pip install -q transformers peft bitsandbytes datasets accelerate trl

## Check for GPU
Make sure you have your GPU available. You will need it.

In [None]:
import torch

# Check for GPU availability
if torch.cuda.is_available():
    print(f"GPU detected: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected.")
    print("If you are using Google Colab, please go to 'Runtime' > 'Change runtime type' and select 'GPU' as the hardware accelerator.")

##Logging in to Hugging Face Hub with a Token  
To download or fine-tune models from the Hugging Face Hub (especially private ones), you need to authenticate using your personal API token. This code demonstrates how to securely log in using a secret stored in Google Colab‚Äôs userdata vault, instead of hardcoding your token.

* Visit https://huggingface.co/settings/tokens
* Click "New token", give it a name, and select the scopes you need (e.g., "Read" or "Write" if uploading models).
* In your Colab notebook, click on the üîê ‚ÄúSecrets‚Äù tab (or use the menu: Tools > Secrets).

* Add a new secret: Name: HUGGINGFACE_API_KEY Value: your copied Hugging Face token

* Run the following code  to authenticate.

In [None]:
from huggingface_hub import login
from google.colab import userdata

# Leer el token desde los secretos de Google Colab
token = userdata.get('HUGGINGFACE_API_KEY')

# Autenticarse con Hugging Face
login(token=token)

#Library Imports, model base, and loading the Dataset

As the base model for fine-tuning, we will use `TinyLlama` ‚Äî a compact, fast, and efficient 1.1B parameter model. It has been pretrained on approximately 1 trillion tokens and is well-suited for instruction tuning. TinyLlama is ideal for low-resource environments such as Google Colab or GPUs with 16GB of VRAM.

When illustrating QLoRA for instruction tuning, it‚Äôs important to start from a base model that has not already been fine-tuned on instructions. The `TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T` model is a partially trained checkpoint that has not undergone instruction tuning, which makes it suitable for demonstrating how QLoRA can align a model with instruction-following capabilities.  In contrast, `TinyLlama-1.1B-Chat` is already instruction-tuned.  

We will fine-tune the model to follow instructions better. We'll use a small, pre-formatted instruction dataset called `mlabonne/guanaco-llama2-1k`. It's a subset of the Guanaco dataset, containing 1,000 examples, which is perfect for a quick fine-tuning demonstration.

The `load_dataset` function from the `datasets` library downloads and caches the data for us.

Data Structure: The dataset returns a `Dataset` object, which is similar to a Python dictionary. Each entry has a key (e.g., 'train') and a value containing the data. The data itself is structured with a single column named text that contains the full instruction prompt and response.

In [None]:
# ==============================================================================
# STEP 2: IMPORT LIBRARIES
# ==============================================================================
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer

# ==============================================================================
# STEP 3: CONFIGURE MODEL, TOKENIZER, AND DATASET
# ==============================================================================
# Using TinyLlama ‚Äî a small, fast, and efficient 1.1B parameter model
# Pretrained for ~1 trillion tokens and compatible with instruction tuning
# Ideal for low-resource environments (e.g. Google Colab, 16GB VRAM GPUs)
#model_id = "google/gemma-2b"
#model_id =  "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model_id = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# Dataset: Guanaco (LLaMA 2-style instruction-tuning dataset)
# Contains ~1,000 examples in the format:
# "### Instruction:\n{question}\n\n### Response:\n{answer}"
# Originally designed for LLaMA2, but also useful for tuning smaller chat models
# This helps TinyLlama learn to follow prompt-response structure

dataset_name = "mlabonne/guanaco-llama2-1k"



# Load the dataset for training
dataset = load_dataset(dataset_name, split="train")

# Imprimir un ejemplo
print("-------- A couple of exemples of instruction/response --------")
print(dataset[6]['text'])
print(dataset[10]['text'])

#Loading the Base Model and Tokenizer

This is the core of our setup.  

**Quantization (QLoRA)**: To make this work on a Colab GPU, we will load the model in 4-bit precision. This is configured using the BitsAndBytesConfig class from transformers.
* `load_in_4bit=True`: This is the flag that tells the model to load with 4-bit quantization.
* `bnb_4bit_quant_type="nf4"`: We use the "nf4" (Normalized Float 4) quantization type, which is a state-of-the-art technique for maintaining performance with 4-bit models.
* `bnb_4bit_compute_dtype=torch.bfloat16`: While the model weights are stored in 4-bit, computations are performed in a higher-precision format (16-bit bfloat) for stability and accuracy.

**Tokenizer**: The tokenizer converts our text prompts into a format the model can understand (tokens). We load the tokenizer that corresponds to our chosen model. We set `padding_side="right"` to prevent issues with certain model architectures during open-ended generation. In open-ended generation tasks (for example, when you provide a prompt and the model continues the text), if padding is added to the left (i.e., at the beginning of the input), some models may get confused and start generating text based on the padding tokens.

The **base model** will also be evaluated using a sample prompt.

In [None]:
# ==============================================================================
# STEP 4: LOAD MODEL AND TOKENIZER WITH 4-BIT QUANTIZATION (QLORA)
# ==============================================================================
# Configure BitsAndBytes for 4-bit quantization to save memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load the base model with the specified quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# ==============================================================================
# STEP 5: EVALUATE THE BASE MODEL (BEFORE FINE-TUNING)
# ==============================================================================
print("--- CHECKING THE BASE MODEL ---")
prompt_before = "Who is the Prime Minister of Canada?"
instruction_prompt_before = f"### Instruction:\n{prompt_before}\n\n### Response:\n"

# Create a text generation pipeline
base_model_pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100  # Use max_new_tokens to control the length of the new text
)

result_before = base_model_pipe(instruction_prompt_before)
print("\n--- BASE MODEL OUTPUT ---")
print(f"Prompt:\n{instruction_prompt_before}")
print("\nGenerated Response:")
print(result_before[0]["generated_text"].replace(instruction_prompt_before, "").strip())


# Generate and store the result
result_before = base_model_pipe(instruction_prompt_before)
print("--- EVALUATION OF BASE MODEL COMPLETE ---")


#Configure LoRA (Low-Rank Adaptation)

Now we configure LoRA. Instead of training the entire model (which has billions of parameters), we will only train small "adapter" layers. This is the "PEFT" part.

* `LoraConfig`: This object holds the configuration for our LoRA layers.

* `r`: The rank of the update matrices. A lower rank means fewer trainable parameters. A common value is 8 or 16.

* lora_alpha: A scaling factor for the LoRA updates. It's common practice to set this to twice the rank (r).

* `target_modules`: This is crucial. We specify which layers of the original model we want to attach our LoRA adapters to. For TinyLlama, these are typically the query, key, value, and output projection layers of the attention mechanism. Optionally: gate_proj, up_proj, down_proj (from the MLP/feedforward layer).

* `lora_dropout`: A dropout probability for the LoRA layers to prevent overfitting.

* `bias`: Specifies how to treat bias parameters. 'none' is standard.

* `task_type`: We specify 'CAUSAL_LM' because we are training a model for text generation

In [None]:
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig

# ==============================================================================
# 6. LoRA CONFIGURATION
# ==============================================================================
# This config enables Low-Rank Adaptation (LoRA), which adds a small number of
# trainable parameters to the base model for efficient fine-tuning.
peft_config = LoraConfig(
    r=16,  # Rank of the decomposition matrices
    lora_alpha=32,  # Scaling factor
    target_modules=[  # Specify which layers to apply LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,  # Dropout probability for LoRA layers
    bias="none",  # Don't add trainable biases
    task_type="CAUSAL_LM",  # Task type: causal language modeling
)

#Set Up and Run the Training




We'll use the `SFTTrainer` (Supervised Fine-Tuning Trainer), which is designed specifically for this kind of instruction fine-tuning.

`TrainingArguments`: This object defines all the hyperparameters for the training process:

* `max_length`: The maximum number of tokens in each input sequence. Shorter sequences use less memory and compute, which is helpful if you're fine-tuning on a limited-resource machine.  **‚ö†Ô∏è Tune this to avoid  OOM (out-of-memory) errors. The instruction and response together can be 1024 or more in the guanaco-llama2-1k dataset. Low values will make prompts + responses cut off.**

* `per_device_train_batch_size`: Number of samples per batch on each GPU. Keep this low (e.g., 1 or 2) for Colab.

* `gradient_accumulation_steps`: Number of steps to accumulate gradients before  updating the model's weights. This effectively increases the batch size without using more memory. batch_size * accumulation_steps is your effective batch size.
* `learning_rate`: The speed at which the model learns. 2e-4 (or 0.0002) is a relatively high learning rate ‚Äî good for fast adaptation during LoRA or SFT.If the model is unstable, you can try lowering this.

* `output_dir`: Where to save checkpoints and the final model.

* `report_to`: Controls integration with experiment tracking tools like Weights & Biases

* `dataset_text_field`:  The column name in your dataset that contains the full prompt + response.


`SFTTrainer`: This orchestrates the entire training process.

* model:	Your base LLM (TinyLlama, etc.), quantized or not
* train_dataset;	Dataset with a "text" column
* args:	Training config (from SFTConfig)
* peft_config:	LoRA settings: what layers to adapt, etc.
tokenizer	Not required explicitly ‚Äî inferred from model

Remember that one *step* is one mini training cycle using a batch of examples.  During a single step, the model:

1. Processes a batch of input data

2. Performs a forward pass to generate predictions

3. Computes the loss by comparing predictions to the ground truth

4. Performs a backward pass to compute gradients and updates the model's weights

In QLoRA,
* The base model weights are frozen and stored in quantized form (saving memory).
* Only the LoRA adapters‚Äô parameters are updated during the backward pass.


**‚ö†Ô∏è The PEFT takes over an hour here**


In [None]:
# ==============================================================================
# 7. SUPERVISED FINE-TUNING CONFIGURATION (SFT)
# ==============================================================================
# This config controls the behavior of the SFTTrainer (Supervised Fine-Tuning).
# It's designed to reduce memory usage and avoid Weights & Biases logging.
sft_config = SFTConfig(
    max_length=1024,  # Maximum sequence length (lower = less memory usage)
    per_device_train_batch_size=1,  # Small batch size for low-memory environments
    gradient_accumulation_steps=2,  # Simulate larger batches by accumulating gradients
    learning_rate=2e-4,  # Learning rate for the optimizer
    output_dir="./TinyLlama-tuned-results",  # Where to save model outputs
    report_to="none",  # Disable logging to W&B or other tracking tools
    dataset_text_field="text",  # Column name in dataset that contains text
    packing=False,  # Don't combine multiple samples into one input sequence
)

# ==============================================================================
# 8. CREATE THE TRAINER
# ==============================================================================
# The SFTTrainer handles training, evaluation, and saving.
trainer = SFTTrainer(
    model=model,  # Your base model (already quantized with BitsAndBytes, ideally)
    train_dataset=dataset,  # The training data
    args=sft_config,  # Training configuration
    peft_config=peft_config  # LoRA configuration for parameter-efficient tuning
)

# ==============================================================================
# 9. TRAIN THE MODEL
# ==============================================================================
# Launch training. This may take time depending on the model size and hardware.
trainer.train()
print("--- FINE-TUNING COMPLETE ---")


‚ö†Ô∏è **You may see the warning:**  "No label_names provided for model class `PeftModelForCausalLM`..."  This happens because PEFT wraps the base model, hiding its internals,  so the Trainer cannot automatically infer the training labels.  For causal language modeling  (predicting next tokens), this is usually not a problem, since the default behavior (using input_ids shifted by 1) is correct.

# Final Check - Performance After Fine-Tuning
 We will ask the model the exact same question as in Step 5 and see if its response has improved. The fine-tuned model should  provide a much more direct and accurate answer, following the instruction format it learned from the Guanaco dataset.

 We will use Hugging Face‚Äôs `pipeline` utility to easily create text-generation pipelines  for both the base model and the fine-tuned model. This allows us to generate text  in a simple and consistent way.

In [None]:
# ==============================================================================
# STEP 10: COMPARE BASE AND FINE-TUNED MODEL RESPONSES ON MULTIPLE PROMPTS
# ==============================================================================

from transformers import pipeline

print("\n--- COMPARING BASE AND FINE-TUNED MODEL OUTPUTS ---\n")


# Create a text-generation pipeline using the base model.
# This pipeline will generate up to 100 new tokens using the base model and its tokenizer,
# with computations performed in bfloat16 precision for efficiency.

base_model_pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100,
    torch_dtype=torch.bfloat16
)



# Convert the fine-tuned model's parameters to bfloat16 dtype to match the base model.
trainer.model = trainer.model.to(dtype=torch.bfloat16)
# Create a text-generation pipeline using the fine-tuned model loaded from the trainer.
# This pipeline will also generate up to 100 new tokens, using the same tokenizer and precision.
fine_tuned_pipe = pipeline(
    task="text-generation",
    model=trainer.model,         # Fine-tuned model loaded from the Trainer
    tokenizer=tokenizer,
    max_new_tokens=100,
    torch_dtype=torch.bfloat16,

)

# List of example prompts to evaluate
prompts = [
    # 1. SPECIFIC INSTRUCTION FOLLOWING
    "Create a numbered list of 5 benefits of renewable energy. Each point should be exactly one sentence long.",

    # 2. FORMAT AND STRUCTURE SPECIFIC
    "Write a professional email to a client explaining a project delay. Use formal tone and include: greeting, explanation, apology, new timeline, and closing.",

    # 3. STEP-BY-STEP REASONING
    "A store has 150 apples. They sell 40% on Monday and 25% of the remaining on Tuesday. How many apples are left? Show your work step by step.", #result: 67.5 apples!

    # 4. ANALYSIS AND COMPARISON
    "Compare Python and JavaScript programming languages. Provide exactly 3 similarities and 3 differences in a clear format.",

    # 5. INSTRUCTIONS WITH CONSTRAINTS
    "Explain photosynthesis in exactly 3 sentences. Use simple words that a 10-year-old could understand.",

    # 6. CREATIVITY WITH SPECIFIC PARAMETERS
    "Write a haiku about technology. Follow the traditional 5-7-5 syllable pattern exactly.",

    # 7. PRACTICAL PROBLEM SOLVING
    "I have a job interview tomorrow at 9 AM. It's 8 PM now. Create a checklist of 6 things I should do tonight to prepare.",

    # 8. SIMPLIFIED TECHNICAL EXPLANATION
    "Explain what machine learning is to someone who has never used a computer. Use analogies and avoid technical jargon.",

    # 9. MULTI-STEP INSTRUCTIONS
    "Help me plan a healthy dinner for 4 people with a budget of $25. List ingredients with estimated costs, then provide simple cooking instructions.",

    # 10. CRITICAL ANALYSIS
    "What are the pros and cons of working from home? Give exactly 4 pros and 4 cons, each in one sentence.",

    # 11. ROLEPLAY AND SPECIFIC CONTEXT
    "You are a customer service representative. A customer is complaining about a delayed delivery. Write a helpful response that shows empathy and offers solutions.",

    # 12. COMPLEX FORMAT INSTRUCTIONS
    "Create a simple study schedule for someone learning Spanish. Format it as a weekly table with specific activities for each day, 30 minutes per day maximum."
]


# Run comparison
for i, prompt in enumerate(prompts, 1):
    instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"

    # Generate response from base model (before fine-tuning)
    base_output = base_model_pipe(instruction)[0]["generated_text"]
    base_response = base_output.split("### Response:\n")[-1].strip()

    # Generate response from fine-tuned model
    fine_output = fine_tuned_pipe(instruction)[0]["generated_text"]
    fine_response = fine_output.split("### Response:\n")[-1].strip()

    # Print side-by-side comparison
    print(f"--- EXAMPLE {i} ---")
    print(f"üî∏ PROMPT:\n{prompt}\n")
    print(f"üîπ BASE MODEL RESPONSE:\n{base_response}\n")
    print(f"üîπ FINE-TUNED MODEL RESPONSE:\n{fine_response}")
    print("-" * 60 + "\n")

#Conclusion
In this notebook, we have successfully:

* Loaded an open-source LLM (TinyLlama-1.1B) in a memory-efficient 4-bit format (QLoRA).
* Loaded a dataset for instruction fine-tuning (guanaco-llama2-1k).
* Established a baseline for the model's performance.
* Configured LoRA to train only a small fraction of the model's parameters.
* Used the high-level `SFTTrainer` to perform the fine-tuning process with just a few lines of code.
* Verified  the fine-tuned model for instruction-following.

From here, you can experiment with different models, datasets, and training hyperparameters,