# Colab 2: LoRA Fine-tuning with SmolLM2-135M using Unsloth

## Overview
This notebook demonstrates **LoRA (Low-Rank Adaptation)** fine-tuning using Unsloth with the SmolLM2-135M model.

### What is LoRA?
- **Parameter Efficient**: Only trains small adapter matrices (~1-5M parameters) instead of all 135M
- **Fast & Memory Efficient**: Requires less GPU memory and trains faster
- **Modular**: Adapters can be swapped on/off the base model
- **Quality**: Achieves 95-99% of full fine-tuning performance with 1% of parameters

### Key Features:
- Model: `unsloth/SmolLM2-135M-Instruct` (135 million parameters)
- LoRA Rank: 16 (controls adapter size)
- Dataset: Same Alpaca dataset (200 examples) for comparison with Colab 1
- Training time: ~2-3 minutes on free Colab T4 GPU
- Task: General instruction following

### What You'll Learn:
1. How LoRA works and why it's efficient
2. Loading a model with 4-bit quantization
3. Configuring LoRA hyperparameters
4. Training with QLoRA (Quantized LoRA)
5. Exporting to GGUF format for deployment

## Step 1: Install Unsloth

Installing the latest version of Unsloth with optimizations.

In [1]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

## Step 2: Verify GPU and Setup

In [2]:
import torch
from unsloth import FastLanguageModel

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4
CUDA version: 12.6


## Step 3: Load Model with 4-bit Quantization

### QLoRA Configuration:
- `load_in_4bit=True`: Loads model in 4-bit precision (saves ~75% memory)
- This enables **QLoRA** (Quantized LoRA) - combining quantization with LoRA
- The base model uses 4-bit, but LoRA adapters train in full precision
- Result: Massive memory savings with minimal quality loss

In [3]:
# Configuration
max_seq_length = 2048  # SmolLM2 supports up to 8K tokens
dtype = None  # Auto-detect optimal dtype
load_in_4bit = True  # Enable 4-bit quantization for QLoRA

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/SmolLM2-135M-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print(f"\nModel loaded in 4-bit mode successfully!")
print(f"Model type: {type(model).__name__}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/423 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]


Model loaded in 4-bit mode successfully!
Model type: LlamaForCausalLM
Tokenizer vocab size: 49153


## Step 4: Configure LoRA Adapters

### Key LoRA Hyperparameters:

1. **r (rank)**: Size of the adapter matrices
   - Common values: 8, 16, 32, 64
   - Higher = more capacity but slower/larger
   - We use **r=16** for good balance

2. **lora_alpha**: Scaling factor
   - Usually set to r or 2*r
   - We use **alpha=16** (alpha/r = 1)

3. **target_modules**: Which layers to adapt
   - q_proj, k_proj, v_proj, o_proj: Attention layers
   - gate_proj, up_proj, down_proj: MLP layers
   - More modules = better quality but larger adapters

4. **lora_dropout**: Regularization (0 to 0.1)
   - We use **0** for small datasets

In [4]:
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,  # LoRA rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,  # LoRA scaling
    lora_dropout = 0,  # No dropout for small dataset
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # Memory optimization
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print(f"\n‚úÖ LoRA adapters configured!")
print(f"üìä Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}% of total)")
print(f"üìä Total parameters: {total_params:,}")
print(f"\nüí° Only {trainable_params:,} parameters will be updated during training!")

Unsloth 2025.11.6 patched 30 layers with 30 QKV layers, 30 O layers and 30 MLP layers.



‚úÖ LoRA adapters configured!
üìä Trainable parameters: 4,884,480 (5.66% of total)
üìä Total parameters: 86,315,904

üí° Only 4,884,480 parameters will be updated during training!


## Step 5: Load and Prepare Dataset

We'll use the same Alpaca dataset as Colab 1 to compare results.

In [5]:
from datasets import load_dataset

# Load Alpaca dataset (first 200 examples)
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:200]")

print(f"Dataset loaded: {len(dataset)} examples")
print(f"\nFirst example:")
print(f"Instruction: {dataset[0]['instruction']}")
print(f"Input: {dataset[0]['input']}")
print(f"Output: {dataset[0]['output'][:100]}...")

README.md: 0.00B [00:00, ?B/s]

alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset loaded: 200 examples

First example:
Instruction: Give three tips for staying healthy.
Input: 
Output: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and...


## Step 6: Format Dataset for Training

In [6]:
# Define formatting function
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []

    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# Apply formatting
dataset = dataset.map(formatting_prompts_func, batched=True)

print("Dataset formatted successfully!")

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Dataset formatted successfully!


## Step 7: Configure Training Parameters

### Training Configuration for LoRA:
- **learning_rate = 2e-4**: Standard for LoRA (can go higher than full fine-tuning)
- **batch_size √ó gradient_accumulation = 8**: Effective batch size
- **max_steps = 60**: Quick training for demonstration
- **optim = adamw_8bit**: Memory-efficient optimizer

### LoRA Training is Faster Because:
- Fewer parameters to update (1-3% of model)
- Less memory for gradients
- Can use higher batch sizes

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

print("Trainer configured for LoRA training!")

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/200 [00:00<?, ? examples/s]

Trainer configured for LoRA training!


## Step 8: Train the Model with LoRA

Training should be faster and use less memory than full fine-tuning!

In [8]:
# Show GPU stats before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"Memory used before training: {start_gpu_memory} GB.\n")

# Train!
trainer_stats = trainer.train()

# Show final stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)

print(f"\n{'='*50}")
print(f"Training completed!")
print(f"Peak memory used: {used_memory} GB ({used_percentage}% of {max_memory} GB)")
print(f"Memory used for LoRA training: {used_memory_for_lora} GB")
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"{'='*50}")
print(f"\nüí° Compare this with Full Fine-tuning (Colab 1):")
print(f"   - LoRA uses significantly less memory")
print(f"   - Training is faster")
print(f"   - Only updated ~{trainable_params:,} parameters vs 135M!")

The model is already on multiple devices. Skipping the move to device specified in `args`.


GPU = Tesla T4. Max memory = 14.741 GB.
Memory used before training: 0.193 GB.



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 200 | Num Epochs = 3 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 4,884,480 of 139,400,064 (3.50% trained)


Step,Training Loss
1,1.8455
2,2.0602
3,1.9942
4,2.2585
5,1.8661
6,2.1497
7,1.9203
8,2.4886
9,2.0172
10,2.111



Training completed!
Peak memory used: 0.541 GB (3.67% of 14.741 GB)
Memory used for LoRA training: 0.348 GB
Training time: 112.23 seconds
Final loss: 1.6734

üí° Compare this with Full Fine-tuning (Colab 1):
   - LoRA uses significantly less memory
   - Training is faster
   - Only updated ~4,884,480 parameters vs 135M!


## Step 9: Test the LoRA Model (Inference)

Let's see how our LoRA-tuned model performs!

In [9]:
# Enable fast inference mode
FastLanguageModel.for_inference(model)

# Test examples
test_instructions = [
    "What are the three primary colors?",
    "Write a haiku about coding.",
    "Explain what machine learning is in simple terms."
]

print("Testing LoRA fine-tuned model:\n")
print("="*70)

for instruction in test_instructions:
    # Format the prompt
    prompt = alpaca_prompt.format(instruction, "", "")

    # Tokenize
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        use_cache=True
    )

    # Decode and display
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    response_text = response.split("### Response:\n")[1] if "### Response:" in response else response

    print(f"\nüìù Instruction: {instruction}")
    print(f"ü§ñ Response: {response_text}")
    print("="*70)

Testing LoRA fine-tuned model:


üìù Instruction: What are the three primary colors?
ü§ñ Response: The primary colors are three primary colors: red, blue, and yellow. These three colors are used to create all other color combinations.




üìù Instruction: Write a haiku about coding.
ü§ñ Response: 


üìù Instruction: Explain what machine learning is in simple terms.
ü§ñ Response: Machine learning is the process of using computers to learn and improve from experience. It involves using algorithms and statistical models to analyze and interpret data to identify patterns, trends, and relationships that may not be immediately apparent. By applying this process, machine learning enables computers to learn from their mistakes, identify errors, and improve their performance over time. This process is based on the idea that computers are not limited to the past, but rather they can learn from the current state of knowledge, and make decisions based on their current data.

### Explanation:

## Step 10: Save LoRA Adapters

### LoRA Saving Options:
1. **Save adapters only**: Small files (~10-50MB)
2. **Save merged model**: Combines base model + adapters
3. **Export to GGUF**: For deployment with Ollama/llama.cpp

In [10]:
# Option 1: Save LoRA adapters only (smallest, ~10-50MB)
model.save_pretrained("smollm2_lora_adapters")
tokenizer.save_pretrained("smollm2_lora_adapters")
print("‚úÖ LoRA adapters saved to: smollm2_lora_adapters/")
print("   (Small files - can be loaded on top of base model)\n")

# Option 2: Save merged model (combines base + adapters)
model.save_pretrained_merged("smollm2_lora_merged", tokenizer, save_method="merged_16bit")
print("‚úÖ Merged model saved to: smollm2_lora_merged/")
print("   (Full model with LoRA weights merged in)\n")

‚úÖ LoRA adapters saved to: smollm2_lora_adapters/
   (Small files - can be loaded on top of base model)

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `smollm2_lora_merged`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  7.24it/s]


Successfully copied all 1 files from cache to `smollm2_lora_merged`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 11848.32it/s]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.27s/it]


Unsloth: Merge process complete. Saved to `/content/smollm2_lora_merged`
‚úÖ Merged model saved to: smollm2_lora_merged/
   (Full model with LoRA weights merged in)



## Step 11: Export to GGUF Format

### GGUF Quantization Options:
- **q8_0**: 8-bit (highest quality, ~135MB)
- **q4_k_m**: 4-bit medium (balanced, ~70MB) ‚≠ê Recommended
- **q5_k_m**: 5-bit medium (good quality, ~90MB)
- **q2_k**: 2-bit (smallest, ~35MB)

We'll use **q4_k_m** for a good balance.

In [None]:
# Export to GGUF with q4_k_m quantization (recommended for deployment)
model.save_pretrained_gguf(
    "smollm2_lora_gguf",
    tokenizer,
    quantization_method="q4_k_m"  # 4-bit quantization
)

print("‚úÖ GGUF model saved to: smollm2_lora_gguf/")
print("   Quantization: q4_k_m (4-bit, ~70MB)")
print("\nüöÄ You can now use this model with:")
print("   - Ollama: ollama create mymodel -f Modelfile")
print("   - llama.cpp: ./main -m model.gguf -p 'prompt'")

Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `smollm2_lora_gguf`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  8.00it/s]


Successfully copied all 1 files from cache to `smollm2_lora_gguf`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 4969.55it/s]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.26s/it]


Unsloth: Merge process complete. Saved to `/content/smollm2_lora_gguf`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages


## Step 12: Compare Model Sizes

Let's see the size difference between different save methods!

In [None]:
import os

def get_folder_size(path):
    """Calculate total size of all files in a folder"""
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(path):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            total_size += os.path.getsize(filepath)
    return total_size / (1024 * 1024)  # Convert to MB

print("\nüìä MODEL SIZE COMPARISON:\n")
print("="*60)

folders = [
    ("smollm2_lora_adapters", "LoRA Adapters Only"),
    ("smollm2_lora_merged", "Merged Model (16-bit)"),
    ("smollm2_lora_gguf", "GGUF (q4_k_m)")
]

for folder, name in folders:
    if os.path.exists(folder):
        size = get_folder_size(folder)
        print(f"{name:30} : {size:8.2f} MB")
    else:
        print(f"{name:30} : Not found")

print("="*60)
print("\nüí° Key Insights:")
print("   - LoRA adapters are tiny (10-50MB) - only the trained weights")
print("   - Merged model is full size (~270MB) - ready for deployment")
print("   - GGUF is optimized (~70MB) - best for edge devices")