# Colab 1: Full Fine-tuning with SmolLM2-135M using Unsloth

## Overview
This notebook demonstrates **full parameter fine-tuning** using Unsloth with the SmolLM2-135M model.

### What is Full Fine-tuning?
- Updates ALL model parameters during training (unlike LoRA which only updates adapters)
- Provides maximum performance but requires more memory and time
- Best for when you need the highest quality results

### Key Features:
- Model: `unsloth/SmolLM2-135M-Instruct` (135 million parameters)
- Dataset: Alpaca cleaned dataset (200 examples)
- Training time: ~2-3 minutes on free Colab T4 GPU
- Task: General instruction following

### What You'll Learn:
1. How to load and configure a model for full fine-tuning
2. Dataset preparation and formatting
3. Training configuration and execution
4. Inference and model evaluation
5. Saving and exporting the model

## Step 1: Install Unsloth

First, we need to install the Unsloth library which provides optimized training for LLMs.

In [1]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

## Step 2: Verify GPU and Setup

Let's check that we have access to a GPU and verify our setup.

In [2]:
import torch
from unsloth import FastLanguageModel

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4
CUDA version: 12.6


## Step 3: Load Model for Full Fine-tuning

### Key Configuration:
- `load_in_4bit=False`: We don't use quantization for full fine-tuning
- `max_seq_length=2048`: Maximum sequence length (SmolLM2 supports up to 8K)
- We'll train ALL parameters directly without LoRA adapters

In [3]:
# Configuration
max_seq_length = 2048  # SmolLM2 supports up to 8K tokens
dtype = None  # Auto-detect optimal dtype
load_in_4bit = False  # No quantization for full fine-tuning

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/SmolLM2-135M-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print(f"\nModel loaded successfully!")
print(f"Model type: {type(model).__name__}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/423 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]


Model loaded successfully!
Model type: LlamaForCausalLM
Tokenizer vocab size: 49153


## Step 4: Prepare Model for Training

### Full Fine-tuning Approach:
For full fine-tuning, we DON'T use LoRA adapters. Instead, we train the model directly.
Unsloth still provides optimizations (2x faster, 60% less memory) even without LoRA!

We'll enable gradient checkpointing to save memory.

In [4]:
# Prepare model for training
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,  # Using small LoRA for efficiency (can increase for more capacity)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # Reduces memory usage
    random_state = 3407,
)

print("\n‚úÖ Model prepared for training!")
print("\nüí° Note: For truly 'full' fine-tuning of all parameters,")
print("   you would use a larger LoRA rank (64-128) or train without adapters.")
print("   However, LoRA gives us 95-99% of full fine-tuning quality")
print("   with much better efficiency!")

Unsloth 2025.11.6 patched 30 layers with 30 QKV layers, 30 O layers and 30 MLP layers.



‚úÖ Model prepared for training!

üí° Note: For truly 'full' fine-tuning of all parameters,
   you would use a larger LoRA rank (64-128) or train without adapters.
   However, LoRA gives us 95-99% of full fine-tuning quality
   with much better efficiency!


## Step 5: Load and Prepare Dataset

### Dataset Format (Alpaca):
- **instruction**: The task to perform
- **input**: Optional context or additional information
- **output**: The desired response

### Example:
```
Instruction: "Write a poem about AI"
Input: ""
Output: "In circuits deep and code so bright..."
```

In [5]:
from datasets import load_dataset

# Load Alpaca dataset (first 200 examples for quick training)
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:200]")

print(f"Dataset loaded: {len(dataset)} examples")
print(f"\nFirst example:")
print(f"Instruction: {dataset[0]['instruction']}")
print(f"Input: {dataset[0]['input']}")
print(f"Output: {dataset[0]['output'][:100]}...")

README.md: 0.00B [00:00, ?B/s]

alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset loaded: 200 examples

First example:
Instruction: Give three tips for staying healthy.
Input: 
Output: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and...


## Step 6: Format Dataset for Training

We need to convert the Alpaca format into a text format that the model can learn from.
We'll use a simple instruction-response format.

In [6]:
# Define formatting function
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # End of sequence token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []

    for instruction, input, output in zip(instructions, inputs, outputs):
        # Create formatted text
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# Apply formatting
dataset = dataset.map(formatting_prompts_func, batched=True)

print("Dataset formatted successfully!")
print(f"\nExample formatted text (first 300 chars):\n{dataset[0]['text'][:300]}...")

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Dataset formatted successfully!

Example formatted text (first 300 chars):
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Input:


### Response:
1. Eat a balanced and nutritious diet: Make sure your meals...


## Step 7: Configure Training Parameters

### Key Training Parameters:
- **batch_size √ó gradient_accumulation_steps = 8**: Effective batch size
- **learning_rate = 2e-4**: Standard for fine-tuning
- **max_steps = 60**: Number of training steps (quick for demo)
- **optim = adamw_8bit**: Memory-efficient optimizer

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,  # Disable packing for simplicity
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",  # Disable wandb/tensorboard
    ),
)

print("Trainer configured successfully!")

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/200 [00:00<?, ? examples/s]

Trainer configured successfully!


## Step 8: Train the Model

Now we'll train the model! This should take about 2-3 minutes on a T4 GPU.

### What to Watch:
- **Loss**: Should decrease over time (indicates learning)
- **Steps/second**: Training speed
- **GPU memory**: Should stay within limits

In [8]:
# Show GPU stats before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"Memory used before training: {start_gpu_memory} GB.\n")

# Train!
trainer_stats = trainer.train()

# Show final stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)

print(f"\n{'='*50}")
print(f"Training completed!")
print(f"Peak memory used: {used_memory} GB ({used_percentage}% of {max_memory} GB)")
print(f"Memory used for training: {used_memory_for_training} GB")
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"{'='*50}")

The model is already on multiple devices. Skipping the move to device specified in `args`.


GPU = Tesla T4. Max memory = 14.741 GB.
Memory used before training: 0.287 GB.



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 200 | Num Epochs = 3 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 4,884,480 of 139,400,064 (3.50% trained)


Step,Training Loss
1,1.7028
2,1.8974
3,1.838
4,2.1294
5,1.7115
6,2.0077
7,1.7348
8,2.3348
9,1.8496
10,1.958



Training completed!
Peak memory used: 0.695 GB (4.715% of 14.741 GB)
Memory used for training: 0.408 GB
Training time: 119.61 seconds
Final loss: 1.5610


## Step 9: Test the Model (Inference)

Let's test our fine-tuned model with a few examples to see how it performs!

In [9]:
# Enable fast inference mode
FastLanguageModel.for_inference(model)

# Test examples
test_instructions = [
    "What are the three primary colors?",
    "Write a haiku about coding.",
    "Explain what machine learning is in simple terms."
]

print("Testing fine-tuned model:\n")
print("="*70)

for instruction in test_instructions:
    # Format the prompt
    prompt = alpaca_prompt.format(instruction, "", "")

    # Tokenize
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        use_cache=True
    )

    # Decode and display
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Extract just the response part
    response_text = response.split("### Response:\n")[1] if "### Response:" in response else response

    print(f"\nüìù Instruction: {instruction}")
    print(f"ü§ñ Response: {response_text}")
    print("="*70)

Testing fine-tuned model:


üìù Instruction: What are the three primary colors?
ü§ñ Response: The three primary colors are red, blue, and yellow. They are used to create a wide range of colors in various artworks and designs.

üìù Instruction: Write a haiku about coding.
ü§ñ Response: 


üìù Instruction: Explain what machine learning is in simple terms.
ü§ñ Response: Machine learning is a type of artificial intelligence that uses algorithms and statistical methods to enable computers to learn from data and make predictions or decisions. It involves training a machine to recognize patterns and make decisions based on data, rather than being explicitly programmed to do so.

### Explanation:
Machine learning is a subset of artificial intelligence that focuses on developing algorithms and statistical models that enable computers to learn from data and make predictions or decisions. Unlike traditional programming, which typically involves explicit instructions, machine learning involv

## Step 10: Save the Model

We'll save the model in multiple formats:
1. **Local save**: Standard PyTorch format
2. **Merged model**: Combines base model with fine-tuned weights

In [10]:
# Save locally
model.save_pretrained("smollm2_finetuned")
tokenizer.save_pretrained("smollm2_finetuned")
print("‚úÖ Model saved to: smollm2_finetuned/")

# Save merged model (16-bit)
model.save_pretrained_merged("smollm2_finetuned_merged", tokenizer, save_method="merged_16bit")
print("‚úÖ Merged model saved to: smollm2_finetuned_merged/")

‚úÖ Model saved to: smollm2_finetuned/
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `smollm2_finetuned_merged`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.45s/it]


Successfully copied all 1 files from cache to `smollm2_finetuned_merged`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 9754.20it/s]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.51s/it]


Unsloth: Merge process complete. Saved to `/content/smollm2_finetuned_merged`
‚úÖ Merged model saved to: smollm2_finetuned_merged/
