# Fine-tune Falcon-7B with LoRA: A Step-by-Step Guide

This notebook demonstrates how to efficiently fine-tune the Falcon-7B language model using **LoRA (Low-Rank Adaptation)** and **4-bit quantization** (QLoRA).

## 🚀 What You'll Learn

- How to use **4-bit quantization** to reduce memory requirements
- How to apply **LoRA** for parameter-efficient fine-tuning
- How to use the **SFTTrainer** from the TRL library
- How to fine-tune on the **Guanaco dataset** for instruction following
- How to monitor training and save checkpoints

## 🎯 Why This Approach?

- **Memory Efficient**: 4-bit quantization reduces GPU memory by ~75%
- **Fast Training**: LoRA only trains ~0.1% of parameters
- **High Quality**: Achieves results comparable to full fine-tuning
- **Practical**: Works on consumer GPUs (RTX 3090, 4090, etc.)

## 🔧 Requirements

- **GPU**: 16GB+ VRAM (RTX 3090, 4090, A100, etc.)
- **Time**: 2-4 hours for 200 steps
- **Storage**: ~15GB for model and checkpoints

## 📖 Table of Contents

1. [Setup and Authentication](#1-setup-and-authentication)
2. [Install Required Packages](#2-install-required-packages)
3. [Load and Explore Dataset](#3-load-and-explore-dataset)
4. [Load Model with 4-bit Quantization](#4-load-model-with-4-bit-quantization)
5. [Configure LoRA](#5-configure-lora)
6. [Set Training Arguments](#6-set-training-arguments)
7. [Initialize Trainer and Train](#7-initialize-trainer-and-train)
8. [Inference and Testing](#8-inference-and-testing)

---

**Note**: This notebook can run on Google Colab, Jupyter Lab, or any notebook environment with GPU support.

## 1. Setup and Authentication

First, we'll authenticate with Hugging Face to access models and datasets.

**You'll need a Hugging Face account**:
1. Sign up at [huggingface.co](https://huggingface.co)
2. Get your access token from [Settings > Access Tokens](https://huggingface.co/settings/tokens)
3. Create a token with `write` permissions

In [None]:
# Login to Hugging Face
# This will prompt you to enter your token
from huggingface_hub import notebook_login

notebook_login()

## 2. Install Required Packages

We'll install the necessary libraries:

- **transformers**: Core library for working with LLMs
- **peft**: Parameter-Efficient Fine-Tuning library (includes LoRA)
- **trl**: Transformer Reinforcement Learning library (includes SFTTrainer)
- **accelerate**: For optimized training
- **bitsandbytes**: For 4-bit/8-bit quantization
- **datasets**: For loading datasets
- **einops**: Tensor operations
- **wandb**: For experiment tracking (optional)

In [None]:
# Install required packages
# This may take a few minutes
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets einops wandb
!pip install -q bitsandbytes==0.43.1

print("✅ All packages installed successfully!")

In [None]:
# Import necessary libraries
import torch
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig
from trl import SFTTrainer
from datasets import load_dataset

# Check GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 3. Load and Explore Dataset

We'll use the **OpenAssistant Guanaco** dataset, which contains high-quality instruction-response pairs.

### About the Dataset

- **Name**: `timdettmers/openassistant-guanaco`
- **Type**: Instruction-following conversations
- **Size**: ~10K examples
- **Format**: Text field contains instruction and response
- **Use Case**: Training instruction-following models

In [None]:
# Load the Guanaco dataset
dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name, split="train")

print(f"✅ Dataset loaded successfully!")
print(f"\nDataset size: {len(dataset):,} examples")
print(f"\nDataset features:")
print(dataset.features)

In [None]:
# Let's examine a few samples from the dataset
print("=== Sample 1 ===")
print(dataset[0]['text'])
print("\n" + "="*70 + "\n")

print("=== Sample 2 ===")
print(dataset[1]['text'])
print("\n" + "="*70 + "\n")

print("=== Sample 3 ===")
print(dataset[2]['text'])

In [None]:
# Get dataset statistics
import numpy as np

# Calculate text lengths
text_lengths = [len(sample['text']) for sample in dataset]

print("=== Dataset Statistics ===")
print(f"Total samples: {len(dataset):,}")
print(f"\nText length statistics (characters):")
print(f"  Mean: {np.mean(text_lengths):.0f}")
print(f"  Median: {np.median(text_lengths):.0f}")
print(f"  Min: {np.min(text_lengths)}")
print(f"  Max: {np.max(text_lengths)}")
print(f"  Std: {np.std(text_lengths):.0f}")

## 4. Load Model with 4-bit Quantization

We'll load Falcon-7B with **4-bit quantization** to drastically reduce memory usage.

### What is 4-bit Quantization?

- **Reduces model size** from ~14GB to ~4GB
- **Enables training on consumer GPUs** (16GB+ VRAM)
- **Minimal quality loss** with NF4 (NormalFloat4) quantization
- **Uses bitsandbytes** library for efficient computation

### BitsAndBytes Configuration

- **load_in_4bit**: Load model weights in 4-bit precision
- **bnb_4bit_quant_type**: Use "nf4" (NormalFloat4) - best for LLMs
- **bnb_4bit_compute_dtype**: Use float16 for computations

In [None]:
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",            # Use NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.float16, # Compute in float16
)

print("✅ Quantization config created:")
print(f"  - 4-bit loading: {bnb_config.load_in_4bit}")
print(f"  - Quantization type: {bnb_config.bnb_4bit_quant_type}")
print(f"  - Compute dtype: {bnb_config.bnb_4bit_compute_dtype}")

In [None]:
# Load Falcon-7B with 4-bit quantization
model_name = "ybelkada/falcon-7b-sharded-bf16"

print(f"Loading {model_name}...")
print("This may take a few minutes on first load...")

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,  # Falcon requires custom code
    device_map="auto",       # Automatically distribute across GPUs
)

# Disable cache for training (saves memory)
model.config.use_cache = False

print("\n✅ Model loaded successfully!")
print(f"Model: {model_name}")
print(f"Device map: {model.hf_device_map}")

In [None]:
# Display model information
def count_parameters(model):
    """Count total and trainable parameters"""
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total_params, trainable_params

total, trainable = count_parameters(model)

print("=== Model Information ===")
print(f"Total parameters: {total:,} ({total/1e9:.2f}B)")
print(f"Trainable parameters (before LoRA): {trainable:,}")
print(f"\nModel architecture:")
print(model)

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Set padding token (Falcon doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

print("✅ Tokenizer loaded successfully!")
print(f"\nVocabulary size: {len(tokenizer):,}")
print(f"EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")

In [None]:
# Test the tokenizer
test_text = "Hello! How can I assist you today?"
tokens = tokenizer(test_text, return_tensors="pt")

print(f"Original text: {test_text}")
print(f"\nTokenized IDs: {tokens['input_ids'][0].tolist()}")
print(f"Number of tokens: {len(tokens['input_ids'][0])}")
print(f"\nDecoded back: {tokenizer.decode(tokens['input_ids'][0])}")

## 5. Configure LoRA

Now we'll configure **LoRA (Low-Rank Adaptation)** for efficient fine-tuning.

### What is LoRA?

LoRA injects trainable low-rank matrices into model layers, allowing us to:
- Train only ~0.1% of the model's parameters
- Achieve results comparable to full fine-tuning
- Train much faster with less memory

### LoRA Hyperparameters

- **r (rank)**: Dimensionality of low-rank matrices (4-64 typical)
  - Higher = more capacity but slower
  - We use **64** for good quality
  
- **lora_alpha**: Scaling factor for LoRA weights
  - Usually 2×r or 1×r
  - We use **16**

- **lora_dropout**: Dropout for regularization
  - We use **0.1** (10%)

- **target_modules**: Which layers to apply LoRA to
  - For Falcon: `query_key_value`, `dense`, `dense_h_to_4h`, `dense_4h_to_h`

In [None]:
# Configure LoRA
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",  # Attention projection
        "dense",            # Attention output
        "dense_h_to_4h",    # MLP up-projection
        "dense_4h_to_h",    # MLP down-projection
    ]
)

print("✅ LoRA configuration created:")
print(f"  - Rank (r): {lora_r}")
print(f"  - Alpha: {lora_alpha}")
print(f"  - Dropout: {lora_dropout}")
print(f"  - Target modules: {peft_config.target_modules}")
print(f"  - Task type: {peft_config.task_type}")

## 6. Set Training Arguments

We'll configure the training parameters for optimal results.

### Training Configuration

**Batch Size & Accumulation:**
- `per_device_train_batch_size=1`: Batch size per GPU
- `gradient_accumulation_steps=1`: Accumulate gradients over N steps
- Effective batch size = 1 × 1 = 1

**Optimization:**
- `learning_rate=2e-4`: Learning rate (2×10⁻⁴)
- `optim="paged_adamw_32bit"`: Memory-efficient AdamW optimizer
- `max_grad_norm=0.3`: Gradient clipping for stability
- `warmup_ratio=0.03`: 3% of steps for learning rate warmup

**Training Duration:**
- `max_steps=200`: Total training steps
- For longer training, increase to 500-1000 steps

**Logging & Saving:**
- `logging_steps=10`: Log metrics every 10 steps
- `save_steps=10`: Save checkpoint every 10 steps

**Mixed Precision:**
- `fp16=True`: Use 16-bit floating point (faster, less memory)

In [None]:
# Define training arguments
output_dir = "./results"
per_device_train_batch_size = 1
gradient_accumulation_steps = 1
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 200
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

print("✅ Training arguments configured:")
print(f"  - Output directory: {output_dir}")
print(f"  - Batch size: {per_device_train_batch_size}")
print(f"  - Gradient accumulation: {gradient_accumulation_steps}")
print(f"  - Learning rate: {learning_rate}")
print(f"  - Max steps: {max_steps}")
print(f"  - Optimizer: {optim}")
print(f"  - Mixed precision: FP16")
print(f"  - Save every: {save_steps} steps")

## 7. Initialize Trainer and Train

We'll use the **SFTTrainer** (Supervised Fine-Tuning Trainer) from the TRL library.

### What is SFTTrainer?

- Specialized trainer for supervised fine-tuning of LLMs
- Handles tokenization automatically
- Works seamlessly with PEFT (LoRA)
- Includes helpful defaults for LLM training

### Training Process

1. Initialize SFTTrainer with model, dataset, and configs
2. Convert normalization layers to float32 (for stability)
3. Start training with `trainer.train()`
4. Monitor progress via logging

**Note**: Training 200 steps takes approximately 1-2 hours on an RTX 4090.

In [None]:
# Initialize SFTTrainer
max_seq_length = 512  # Maximum sequence length for training

print("Initializing SFTTrainer...")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

print("\n✅ Trainer initialized successfully!")
print(f"  - Max sequence length: {max_seq_length}")
print(f"  - Training samples: {len(dataset):,}")
print(f"  - Training steps: {max_steps}")

In [None]:
# Check trainable parameters after LoRA injection
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params:,} || "
        f"all params: {all_param:,} || "
        f"trainable%: {100 * trainable_params / all_param:.4f}"
    )

print("=== LoRA Parameters ===")
print_trainable_parameters(trainer.model)

In [None]:
# Convert normalization layers to float32 for training stability
# This is important when using mixed precision with quantized models
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

print("✅ Normalization layers converted to float32 for stability")

In [None]:
# Start training!
print("=" * 70)
print("STARTING TRAINING")
print("=" * 70)
print(f"\nTraining for {max_steps} steps...")
print(f"Saving checkpoints every {save_steps} steps to: {output_dir}")
print("\nThis will take approximately 1-2 hours on an RTX 4090.")
print("\nYou can monitor progress via the logging output below.")
print("=" * 70)
print()

# Train the model
trainer.train()

print("\n" + "=" * 70)
print("TRAINING COMPLETE!")
print("=" * 70)

In [None]:
# Save the final trained model
final_model_path = "./falcon-7b-guanaco-lora-final"

print(f"Saving final model to: {final_model_path}")
trainer.save_model(final_model_path)

print("\n✅ Model saved successfully!")
print(f"\nModel location: {final_model_path}")
print("\nYou can load this model later for inference.")

## 8. Inference and Testing

Now let's test our fine-tuned model by generating responses!

### Loading the Model

We'll load:
1. The base Falcon-7B model (quantized)
2. Our trained LoRA adapters

### Generation Tips

- **Temperature**: Controls randomness (0.1=focused, 1.0=creative)
- **top_k**: Consider only top K tokens
- **top_p**: Nucleus sampling threshold
- **max_new_tokens**: Maximum tokens to generate

In [None]:
# Load the fine-tuned model for inference
from peft import PeftModel, PeftConfig

# Path to your saved model
model_path = "./falcon-7b-guanaco-lora-final"

print(f"Loading fine-tuned model from: {model_path}")
print("This may take a moment...")

# Load base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto",
)

# Load LoRA adapters
model_inference = PeftModel.from_pretrained(base_model, model_path)

# Load tokenizer
tokenizer_inference = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer_inference.pad_token = tokenizer_inference.eos_token

print("\n✅ Model loaded successfully for inference!")

In [None]:
# Define a helper function for text generation
def generate_response(prompt, max_new_tokens=200, temperature=0.7, top_k=50, top_p=0.95):
    """
    Generate a response for the given prompt.
    
    Args:
        prompt: Input text
        max_new_tokens: Maximum number of tokens to generate
        temperature: Sampling temperature (higher = more random)
        top_k: Consider only top K tokens
        top_p: Nucleus sampling threshold
    """
    # Tokenize input
    inputs = tokenizer_inference(prompt, return_tensors="pt").to(model_inference.device)
    
    # Generate
    with torch.no_grad():
        outputs = model_inference.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer_inference.eos_token_id,
        )
    
    # Decode and return
    response = tokenizer_inference.decode(outputs[0], skip_special_tokens=True)
    return response

print("✅ Generation function ready!")

In [None]:
# Example 1: Simple question
prompt = "### Human: What is machine learning? ### Assistant:"

print("Prompt:")
print(prompt)
print("\n" + "="*70)
print("Response:")
print()

response = generate_response(prompt, max_new_tokens=150, temperature=0.7)
print(response)

In [None]:
# Example 2: Explanation request
prompt = "### Human: Explain how neural networks work in simple terms. ### Assistant:"

print("Prompt:")
print(prompt)
print("\n" + "="*70)
print("Response:")
print()

response = generate_response(prompt, max_new_tokens=200, temperature=0.7)
print(response)

In [None]:
# Example 3: Coding question
prompt = "### Human: Write a Python function to calculate the Fibonacci sequence. ### Assistant:"

print("Prompt:")
print(prompt)
print("\n" + "="*70)
print("Response:")
print()

response = generate_response(prompt, max_new_tokens=250, temperature=0.5)
print(response)

In [None]:
# Try your own prompt!
your_prompt = "### Human: What are the benefits of using LoRA for fine-tuning? ### Assistant:"

print("Your Prompt:")
print(your_prompt)
print("\n" + "="*70)
print("Response:")
print()

response = generate_response(your_prompt, max_new_tokens=200, temperature=0.7)
print(response)

## 🎉 Congratulations!

You've successfully:
- ✅ Loaded Falcon-7B with 4-bit quantization
- ✅ Configured LoRA for efficient fine-tuning
- ✅ Trained on the Guanaco instruction dataset
- ✅ Generated text with your fine-tuned model

## 📊 Training Summary

| Metric | Value |
|--------|-------|
| Base Model | Falcon-7B (7B parameters) |
| Quantization | 4-bit (NF4) |
| PEFT Method | LoRA (rank 64) |
| Trainable Parameters | ~0.1% of total |
| Training Steps | 200 |
| Training Time | ~1-2 hours (RTX 4090) |
| GPU Memory | ~16GB |

## 🚀 Next Steps

1. **Train Longer**: Increase `max_steps` to 500-1000 for better results
2. **Adjust LoRA**: Try different `r` values (32, 128) and `lora_alpha`
3. **Different Datasets**: Fine-tune on your own instruction datasets
4. **Merge Adapters**: Merge LoRA weights into base model for deployment
5. **Evaluate**: Test on benchmarks to measure improvement

## 💡 Tips for Better Results

1. **Longer Training**: More steps generally improve quality
2. **Larger LoRA Rank**: Higher `r` = more capacity (but slower)
3. **Learning Rate**: Try 1e-4 or 5e-4 if results aren't good
4. **Batch Size**: Increase if you have more GPU memory
5. **Dataset Quality**: High-quality data > large quantity

## 📚 Resources

- [LoRA Paper](https://arxiv.org/abs/2106.09685)
- [QLoRA Paper](https://arxiv.org/abs/2305.14314)
- [PEFT Documentation](https://huggingface.co/docs/peft)
- [TRL Documentation](https://huggingface.co/docs/trl)
- [Falcon Model Card](https://huggingface.co/tiiuae/falcon-7b)

## 🔧 Saving and Sharing Your Model

### Save to Hugging Face Hub

```python
# Login to Hugging Face
from huggingface_hub import login
login()

# Push to Hub
trainer.push_to_hub("your-username/falcon-7b-guanaco-lora")
```

### Load Your Model Later

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("ybelkada/falcon-7b-sharded-bf16")

# Load your LoRA adapters
model = PeftModel.from_pretrained(base_model, "your-username/falcon-7b-guanaco-lora")
```

---

**Happy Fine-Tuning! 🎓✨**

For more advanced techniques, check out the other notebooks in this repository:
- [Full Fine-Tuning](../01-Full-Fine-Tuning/)
- [Instruction Tuning](../03-Instruction-Tuning/)
- [Reasoning Tuning](../04-Reasoning-Tuning/)