<a href="https://colab.research.google.com/github/safiyahf/MyFirstRepo/blob/main/custom_finetuning_workshop_student_copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Llama 3.2:3B with Unsloth on Custom Dataset

## Setup and Installation

This step prepares your Google Colab environment by installing Unsloth and its dependencies. Unsloth is a library that optimizes LLM training to be faster and more memory-efficient. The installation code detects if you're running in Colab and installs the appropriate dependencies. This is essential groundwork that enables the rest of the fine-tuning process to run smoothly on Google's free GPU resources


In [None]:
# Install Unsloth and dependencies
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab and Kaggle notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers peft==0.14.0 trl==0.16.0 triton==3.2.0
    !pip install --no-deps cut_cross_entropy unsloth_zoo==2025.3.16
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth==2025.3.18
    !pip install transformers==4.49.0

## Initialize Model

Here, you load a pre-trained language model (Llama 3.2 3B Instruct) that will serve as the foundation for fine-tuning. The code configures important parameters like sequence length and sets up memory-efficient 4-bit quantization to reduce VRAM usage. It also initializes LoRA (Low-Rank Adaptation) adapters, which is a technique that allows you to fine-tune only a small subset of the model's parameters (typically 1-10%), dramatically reducing memory requirements and training time while maintaining performance.

Before you run this block make sure the colab runtime is on a GPU. To do this:

1)Click on "Runtime" in the top menu

2)Select "Change runtime type"

2)In the dialog that appears, make sure T4 GPU is selected

3)Click "Save"

In [None]:
# Import necessary libraries
from unsloth import FastLanguageModel
import torch
from datasets import Dataset
import pandas as pd

# Configure model parameters
max_seq_length = 2048  # Choose any length (auto RoPE scaling supported)
dtype = None  # None for auto detection. Float16 for T4, V100. Bfloat16 for Ampere+ GPUs
load_in_4bit = True  # Use 4-bit quantization to reduce memory usage

# Load the base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name =,  # You can change this to other models
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...",  # Uncomment and add your token if using gated models
)

# Add LoRA adapters to update only a small percentage of parameters
model = FastLanguageModel.get_peft_model(
    model,
    r =,  # Rank: Choose 8, 16, 32, 64, or 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,  # 0 is optimized
    bias = "none",     # "none" is optimized
    use_gradient_checkpointing = "unsloth",  # Use "unsloth" for 30% less VRAM
    random_state = 3407,
    use_rslora = False,  # Rank stabilized LoRA (optional)
    loftq_config = None,  # LoftQ (optional)
)

## Set Up Chat Template

This step configures how conversations are formatted for the model. The chat template structures the input text with special tokens that help the model distinguish between user queries and assistant responses. For Llama 3.1, this includes specific header markers like <|start_header_id|>user<|end_header_id|>. Proper formatting is crucial because it teaches the model to recognize the conversation structure and generate appropriate responses in the correct style.

In [None]:
# Configure the chat template for the tokenizer
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.2",  # You can also use: zephyr, chatml, mistral, vicuna, etc.
)

## Prepare Custom Dataset

This critical step processes your training data into the format required by the model. Three different methods are provided: manually creating a dataset structure with example conversations, loading from a CSV file, or importing from Hugging Face. The data needs to be structured as conversations with alternating user and assistant messages. The code then applies the chat template to format each conversation properly and converts it into tokenized inputs that the model can process during training. We shall be using our brainrot dataset to increase the LLM's rizz

In [None]:
from datasets import Dataset
import pandas as pd

# OPTION 1: Create a custom dataset from a DataFrame
# This is an example - replace with your actual data loading code

# Example: Create a simple dataset with 3 samples
"""
data = {
    "conversations": [
        [
            {"role": "user", "content": "What is machine learning?"},
            {"role": "assistant", "content": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed."}
        ],
        [
            {"role": "user", "content": "Explain neural networks simply"},
            {"role": "assistant", "content": "Neural networks are computing systems inspired by the human brain. They consist of interconnected nodes (neurons) that process and transmit information, allowing the system to learn patterns from data."}
        ],
        [
            {"role": "user", "content": "How does reinforcement learning work?"},
            {"role": "assistant", "content": "Reinforcement learning is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize some notion of cumulative reward. It's based on trial and error, with feedback in the form of rewards or penalties."}
        ]
    ]
}
"""

# OPTION 2: Load data from a CSV file - Uncomment to use
# Assuming your CSV has columns: 'user_message' and 'assistant_response'
"""
df = pd.read_csv()

# Convert your data to the expected format
conversations = []
for i in range(len(df)):
    convo = [
        {"role": "user", "content": df.loc[i, 'user_message']},
        {"role": "assistant", "content": df.loc[i, 'assistant_response']}
    ]
    conversations.append(convo)

data = {"conversations": conversations}
"""

# OPTION 3: Load from a Hugging Face dataset - Uncomment to use
"""
from datasets import load_dataset
external_dataset = load_dataset("your_username/your_dataset", split="train")

# If your dataset is already in the right format, you can use it directly
# Otherwise, you'll need to convert it to the proper format with conversations
"""
# Create the dataset from your data
dataset = Dataset.from_dict(data)

# Format the data for training
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

# Show an example of the formatted data
print("Example of formatted data:")
print(dataset[0]["text"])

## Configure Training

Here, you set up the SFTTrainer (Supervised Fine-Tuning Trainer) with parameters that control the learning process. This includes batch size, learning rate, number of epochs, and optimizer settings. The code also configures a crucial optimization that masks the loss calculation so that the model only learns from assistant responses, not from user inputs. This step essentially defines how the model will learn from your data and how quickly it will adapt to generate responses in the style of your training examples.


In [None]:
# Set up the trainer
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only

# Configure training arguments
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,  # Set to True for 5x faster training with short sequences
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,  # Adjust based on your dataset size
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Set to "wandb" for Weights & Biases logging
    ),
)

# Configure the trainer to only train on assistant responses
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>",
    response_part="<|start_header_id|>assistant<|end_header_id|>",
)

# Verify masking is correctly applied
print("Verifying masking:")
print(tokenizer.decode(trainer.train_dataset[0]["input_ids"]))

space = tokenizer(" ", add_special_tokens=False).input_ids[0]
print(tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[0]["labels"]]))

## Start Training

This step executes the actual training process and monitors GPU resource usage. The model begins learning from your dataset, updating the LoRA parameters to better match the style and content of the assistant responses in your training data. The code tracks metrics like training time and memory usage, which helps you understand the resource requirements and efficiency of your fine-tuning job. For larger datasets, this step may take considerable time, but the optimizations from Unsloth make it much faster than traditional methods.

In [None]:
# Display GPU information
print("GPU Information:")
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# Train the model
print("Starting training...")
trainer_stats = trainer.train()

# Display training statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

## Inference with Fine-tuned Model

After training, this step lets you immediately test your fine-tuned model with a sample prompt. The model is switched to inference mode for faster generation, and a TextStreamer is set up to display the generated response token by token. This gives you immediate feedback on how well your model has learned from the training data and allows you to assess if it needs further training or adjustments before final deployment.

In [None]:
# Set up the model for inference
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# Test the model with a prompt
test_messages = [
    {"role": "user", "content": ""}
]

# Generate a response
inputs = tokenizer.apply_chat_template(
    test_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

# Stream the output
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)

## Save the Fine-tuned Model

This final step preserves your trained model for future use. You can save just the LoRA adapters (which are much smaller than the full model), push the model to Hugging Face Hub for sharing, or convert it to different formats like GGUF for use with llama.cpp. These options give you flexibility in how you deploy your model, whether for personal use, collaboration with others, or integration into different applications and platforms.

In [None]:
# Save the LoRA adapters (smaller size, requires base model to use)
model.save_pretrained("finetuned_lora_model")
tokenizer.save_pretrained("finetuned_lora_model")

# Optional: Push to Hugging Face Hub
# model.push_to_hub("your_username/your_model_name", token="your_hf_token")
# tokenizer.push_to_hub("your_username/your_model_name", token="your_hf_token")

# Optional: Save as merged model in float16 (full model, larger size)
# model.save_pretrained_merged("finetuned_merged_model", tokenizer, save_method="merged_16bit")

# Optional: Save as GGUF for llama.cpp
# model.save_pretrained_gguf("finetuned_gguf_model", tokenizer, quantization_method="q4_k_m")

print("Fine-tuning completed and model saved!")