# 🧠 COLAB SESSION 1: Fine-Tuning a Compact Model via Unsloth.ai  
**Base Model:** `unsloth/smollm2-135m`  
**Objective:** Train on the *Math Reasoning* portion of the **MATH** dataset  

---

In this notebook, we’ll perform full fine-tuning of a lightweight language model using **Unsloth.ai** tools.  
We’ll begin by installing all required libraries, including `unsloth`, `datasets`, `transformers`, `accelerate`, `bitsandbytes`, `wandb`, and `huggingface_hub`.


In [None]:
!pip install unsloth datasets transformers accelerate bitsandbytes wandb huggingface_hub

[0mCollecting unsloth
  Using cached unsloth-2025.11.2-py3-none-any.whl.metadata (61 kB)
Collecting bitsandbytes
  Using cached bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting unsloth_zoo>=2025.11.3 (from unsloth)
  Using cached unsloth_zoo-2025.11.3-py3-none-any.whl.metadata (32 kB)
Collecting torch>=2.4.0 (from unsloth)
  Downloading torch-2.9.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Using cached xformers-0.0.33-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Collecting datasets
  Using cached datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting trl!=0.19.0,<=0.23.0,>=0.18.2 (from unsloth)
  Using cached trl-0.23.0-py3-none-any.whl.metadata (11 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.5.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Collecting cut_cross_entropy (from unsloth_zoo>=2025.11.3->unsloth)
  Using cach

## 🔧 Step 2: Import Essential Libraries

In [None]:
import unsloth
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from huggingface_hub import login
import wandb

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.9.0+cu130 with CUDA 1300 (you have 2.9.0+cu128)
    Python  3.10.19 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.9.0+cu130 with CUDA 1300 (you have 2.9.0+cu128)
    Python  3.10.19 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 🎫 Step 3: Authenticate with Hugging Face & Weights & Biases

In this step, we’ll connect to both **Hugging Face Hub** and **Weights & Biases (W&B)** for model access and experiment tracking.  
Instead of manually typing the tokens, we’ll securely fetch them from **Colab’s `userdata` storage** where they’ve been saved as secrets.  


In [None]:
from google.colab import userdata
from huggingface_hub import login
import wandb

# Retrieve stored tokens
hf_token = userdata.get("HGFaceApi")
wb_token = userdata.get("wb_token")

# Authenticate with both platforms
login(hf_token)
wandb.login(key=wb_token)

# Initialize a W&B tracking session
run = wandb.init(
    project="Full-Finetuning-SmolLM2-135M",
    job_type="training",
    anonymous="allow"
)

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mshubhamjaysukhbhai-kothiya[0m ([33mshubhamjaysukhbhai-kothiya-san-jose-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


## ⚙️ Step 4: Initialize the SmolLM2-135M Model for Full Fine-Tuning

Now we’ll load the **SmolLM2-135M** model using **Unsloth’s FastLanguageModel** utility.  
Full fine-tuning mode means every parameter of the model is updated — no quantization or parameter freezing is applied.  


In [None]:
max_seq_length = 2048
dtype = None  # Uses the default data type for your device (e.g., float16 or bfloat16)

# Load model and tokenizer with full fine-tuning capability
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/smollm2-135m",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=False,        # Disable quantization → full precision training
    full_finetuning=True,      # Enable training of all layers
    token=hf_token,
)

print("✅ Model is ready for full fine-tuning!")


==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Float16 full finetuning uses more memory since we upcast weights to float32.


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/742 [00:00<?, ?B/s]

✅ Model is ready for full fine-tuning!


## 📘 Step 5: Load and Prepare a Small Instruction-Tuning Dataset

For demonstration purposes, we’ll use a lightweight version of the **Alpaca dataset** (a well-known instruction-following corpus).  
Only a small subset of 500 samples is loaded to speed up experimentation.  

We’ll then format the data into a consistent instruction–input–response structure expected by the model.


In [None]:
from datasets import load_dataset

# Load a small portion of the Alpaca dataset for quick experimentation
dataset = load_dataset("tatsu-lab/alpaca", split="train[:500]")

# Template used to organize each example into instruction, input, and response
prompt_template = """Below is an instruction describing a task, along with an input that adds context.
Write an appropriate response to complete the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def prepare_prompt_examples(examples):
    """Formats the Alpaca samples into the defined prompt template."""
    texts = []
    for instruction, input_text, output_text in zip(
        examples["instruction"], examples["input"], examples["output"]
    ):
        if input_text.strip() == "":
            formatted_text = prompt_template.format(instruction, "N/A", output_text) + EOS_TOKEN
        else:
            formatted_text = prompt_template.format(instruction, input_text, output_text) + EOS_TOKEN
        texts.append(formatted_text)
    return {"text": texts}

# Apply formatting to the dataset
dataset = dataset.map(prepare_prompt_examples, batched=True)

print("✅ Example of formatted prompt:\n", dataset["text"][0][:400])

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-a09b74b3ef9c3b(…):   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

✅ Example of formatted prompt:
 Below is an instruction describing a task, along with an input that adds context.
Write an appropriate response to complete the request.

### Instruction:
Give three tips for staying healthy.

### Input:
N/A

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a c


## 🏋️ Step 6: Set Up the Trainer and Define Training Parameters

We’ll now configure the supervised fine-tuning (SFT) trainer provided by **Unsloth’s integration with TRL**.  
This setup specifies how the model will train — including batch size, learning rate, precision format, logging steps, and integration with **Weights & Biases** for tracking progress.


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 1,
        num_train_epochs = 3,
        warmup_steps = 5,
        learning_rate = 1e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 5,
        output_dir = "outputs",
        save_total_limit = 1,
        report_to = "wandb"
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
# ===============================================================
# 🚀 Step 7. Train the model and monitor GPU usage
# ===============================================================

gpu = torch.cuda.get_device_properties(0)
print(f"Using GPU: {gpu.name} ({round(gpu.total_memory/1e9, 2)} GB VRAM)")

trainer_stats = trainer.train()

Using GPU: NVIDIA A100-SXM4-40GB (42.47 GB VRAM)


Step,Training Loss
5,2.5748
10,2.1928
15,1.6738
20,1.4019
25,1.3693
30,1.1449
35,1.3306
40,1.2823
45,1.1258
50,1.2354


Unsloth: Will smartly offload gradients to save VRAM!


## 📊 Step 8: Review Runtime and GPU Memory Usage

After training completes, we’ll summarize key performance metrics — including total runtime and the maximum GPU memory reserved during fine-tuning.  
These stats help evaluate training efficiency and resource utilization.


In [None]:
# Start fine-tuning and capture the returned training statistics
trainer_stats = trainer.train()

print("✅ Training completed successfully!")
used_mem = round(torch.cuda.max_memory_reserved() / 1e9, 3)
print(f"⏱ Runtime: {round(trainer_stats.metrics['train_runtime']/60, 2)} minutes")
print(f"💾 Peak reserved GPU memory: {used_mem} GB")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 3 | Total steps = 189
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 134,515,584 of 134,515,584 (100.00% trained)


Step,Training Loss
5,2.51
10,1.4536
15,1.1554
20,1.0787
25,1.2468
30,1.0734
35,1.2959
40,1.2381
45,1.0856
50,1.2075


Unsloth: Will smartly offload gradients to save VRAM!
✅ Training completed successfully!
⏱ Runtime: 2.61 minutes
💾 Peak reserved GPU memory: 3.02 GB


## 🧩 Step 9: Test the Fine-Tuned Model with a Sample Prompt

We’ll switch the model to inference mode and pass in a math-reasoning question formatted with the same prompt template used during training.  
The model’s generated response will then be decoded and displayed in Markdown for easier reading.


In [None]:
# Ensure consistent dtype for inference
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
from IPython.display import Markdown

# 1) Put the model in inference mode (applies Unsloth speedups)
FastLanguageModel.for_inference(model)

# 2) Pick a single dtype and move the model to it
inference_dtype = torch.bfloat16 if is_bfloat16_supported() else torch.float16
model = model.to(dtype=inference_dtype)

# 3) Build the prompt (use your existing template variable)
test_prompt = prompt_template.format(
    "If the system of equations 3x + y = a and 2x + 5y = 2a has a solution when x = 2, compute a.",
    "",
    "",
)

# 4) Tokenize and move tensors to the same device as the model
inputs = tokenizer([test_prompt], return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# 5) Generate under autocast with the chosen dtype
with torch.cuda.amp.autocast(dtype=inference_dtype):
    outputs = model.generate(**inputs, max_new_tokens=150)  # keep use_cache=True (default) for speed

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = decoded.split("### Response:")[1].strip() if "### Response:" in decoded else decoded
Markdown(response)


  with torch.cuda.amp.autocast(dtype=inference_dtype):


3x + 2y = 2a

## 💾 Step 10: Save and Upload the Fine-Tuned Model to Hugging Face

Finally, we’ll export the locally fine-tuned model and tokenizer, and then publish them to the Hugging Face Hub.  
Make sure you’ve already authenticated with your Hugging Face account (see Step 3).  




In [None]:
new_model_local = "SmolLM2-135M-Math"
new_model_online = "shubh7/SmolLM2-135M-Math"

model.save_pretrained(new_model_local)
tokenizer.save_pretrained(new_model_local)

model.push_to_hub(new_model_online)
tokenizer.push_to_hub(new_model_online)

print("✅ Model fine-tuned and uploaded successfully!")

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...5M-Math/model.safetensors:   0%|          |  524kB /  269MB            

Saved model to https://huggingface.co/shubh7/SmolLM2-135M-Math


README.md: 0.00B [00:00, ?B/s]

✅ Model fine-tuned and uploaded successfully!
