# (a) Full Finetuning ‚Äî SmolLM2‚Äë135M (Transformers)
**Created:** 2025-11-10 02:42 UTC

This notebook performs *full-parameter fine‚Äëtuning* on the tiny **SmolLM2‚Äë135M** model using a miniature toy dataset so it runs fast in Kaggle.
We also install **Unsloth** (for later notebooks) and show how to apply a chat template, but the actual full fine‚Äëtuning here uses vanilla ü§ó Transformers since it's a very small model.

> **Checklist for your recording**  
> 1) Show GPU is enabled (Settings ‚Üí Accelerator: GPU, Internet: On).  
> 2) Walk through the dataset format and preprocessing.  
> 3) Start training (just a few hundred steps).  
> 4) Show sample generations before/after.  
> 5) Save + download the model.

In [1]:
!pip -q install --upgrade pip
!pip -q install "transformers>=4.44.2" "datasets>=2.19.0" "accelerate>=0.33.0" "evaluate" "peft" "trl" "bitsandbytes" "unsloth>=2024.11.0"

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.8/1.8 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
pylibcudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
bigframes 2.12.0 requires rich<14,>=12.4.4, but you have rich 14.2.0 which is incompatible.
libcugraph-cu12 25.6.0 requires libraft-cu12==25.6.*, but you have libraft-cu12 25.2.0 which is incompatible.
torchaudio 2.6.0+cu124 requi

In [2]:
import torch, platform, os, json, random
print("Python:", platform.python_version())
print("PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

Python: 3.11.13
PyTorch: 2.8.0+cu128
CUDA available: True
GPU: Tesla T4


## Build a tiny toy chat dataset
We keep it super small so the demo finishes quickly; replace with your real dataset for a longer run. The format becomes a single string field `text` after applying a simple template.

In [3]:
from datasets import Dataset

pairs = [
    {"instruction":"Write a Python function to add two numbers a and b.","response":"def add(a,b):\n    return a+b"},
    {"instruction":"Explain binary search in 2 sentences.","response":"Binary search repeatedly halves a sorted range to find a target. It runs in O(log n) time."},
    {"instruction":"Generate a short pep talk for learning algorithms.","response":"Keep tinkering. Mistakes are breadcrumbs toward understanding‚Äîfollow them."},
    {"instruction":"Fix the bug: def f(x): return x*2 if x>10: return 0","response":"def f(x):\n    if x>10:\n        return 0\n    return x*2"},
]

def simple_template(example):
    prompt = f"<|system|>You are a helpful coding assistant.</s>\n<|user|>{example['instruction']}</s>\n<|assistant|>{example['response']}"
    return {"text": prompt}

raw_ds = Dataset.from_list(pairs)
ds = raw_ds.map(simple_template, remove_columns=raw_ds.column_names)
ds = ds.train_test_split(test_size=0.25, seed=42)
ds

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 3
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

## Load **SmolLM2‚Äë135M** and tokenize

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "HuggingFaceTB/SmolLM2-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tok(batch):
    return tokenizer(batch["text"], truncation=True, max_length=512)
tokenized = ds.map(tok, batched=True, remove_columns=["text"])

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)
model.resize_token_embeddings(len(tokenizer))
model.config.use_cache = False
model

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
2025-11-10 02:52:20.945286: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762743141.133701      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762743141.187393      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu128 for torchao version 0.14.1             Please see https://github.com/pytorch/ao/issues/2919 for more info


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb): Lla

## Train (full‚Äëparameter fine‚Äëtuning)
We keep steps tiny so it completes quickly on Kaggle. Increase `num_train_epochs` or `max_steps` for a real run.

In [6]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
import transformers, torch
print("Transformers:", transformers.__version__)  # just to show in your recording

args = TrainingArguments(
    output_dir="/kaggle/working/smollm2_fullft",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    bf16=torch.cuda.is_available(),
    learning_rate=5e-4,
    warmup_steps=10,
    logging_steps=5,
    # NEW-style flags (>=4.47):
    eval_strategy="steps",
    save_strategy="steps",          # so save_steps takes effect
    logging_strategy="steps",       # optional but tidy
    eval_steps=20,
    save_steps=50,
    max_steps=120,
    report_to="none",
)

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    data_collator=collator,
)
train_result = trainer.train()
train_result


Transformers: 4.57.1


Step,Training Loss,Validation Loss
20,0.0359,2.921495
40,0.0195,3.050814
60,0.0192,3.084465
80,0.0192,3.09258
100,0.0192,3.109593
120,0.0192,3.093855


TrainOutput(global_step=120, training_loss=0.17629979513585567, metrics={'train_runtime': 36.8776, 'train_samples_per_second': 52.064, 'train_steps_per_second': 3.254, 'total_flos': 13993367362560.0, 'train_loss': 0.17629979513585567, 'epoch': 120.0})

## Quick smoke test

In [7]:
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = "<|system|>You are a helpful coding assistant.</s>\n<|user|>Write a Python function to compute factorial.</s>\n<|assistant|>"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    _ = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7, streamer=streamer)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


def factorial(n):
    return n * factorial(n-1)
<|assistant|>Write a Python function to add two numbers a and b.</s>
<|assistant|>def add(a,b):
    return a+b
<|assistant|>def add(a,b):
    return a+b
<|assistant|>def add(a,b):
    return a+b
<|assistant|>def add(a,b):
    return a+b
<|assistant|>def add(a,b


In [8]:
trainer.save_model("/kaggle/working/smollm2_fullft")
tokenizer.save_pretrained("/kaggle/working/smollm2_fullft")
print("Saved to /kaggle/working/smollm2_fullft")

Saved to /kaggle/working/smollm2_fullft
