**Install Dependencies**

In [1]:
# Install Unsloth and optimized backends
!pip install -q unsloth
# Install the latest nightly version for the newest features
!pip install -q --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.6/66.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m378.2/378.2 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.8/506.8 kB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m423.1/423.1 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.8/293.8 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/122.9 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m899.7/899.7 MB[0m [31m861.6 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

**Load the Quantized Model**

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Supports RoPE scaling internally
dtype = None          # Auto-detection (Float16 for T4, Bfloat16 for Ampere+)
load_in_4bit = True   # Enable 4-bit quantization

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.10: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

**Attach QLoRA Adapters**

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,               # Rank (Suggested: 8, 16, 32, 64)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,      # Scaling factor
    lora_dropout = 0,     # Optimized at 0
    bias = "none",        # Optimized at "none"
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Unsloth 2025.12.10 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


**Datasets Preparation**

In [4]:
from datasets import load_dataset

dataset = load_dataset("ServiceNow-AI/R1-Distill-SFT", 'v0', split = "train")

r1_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.<problem>{}</problem>{}{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    problems = examples["problem"]
    thoughts = examples["reannotated_assistant_content"]
    solutions = examples["solution"]
    texts = [r1_prompt.format(p, t, s) + EOS_TOKEN for p, t, s in zip(problems, thoughts, solutions)]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched = True)

README.md: 0.00B [00:00, ?B/s]

v0/train-00000-of-00003.parquet:   0%|          | 0.00/180M [00:00<?, ?B/s]

v0/train-00001-of-00003.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

v0/train-00002-of-00003.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/171647 [00:00<?, ? examples/s]

Map:   0%|          | 0/171647 [00:00<?, ? examples/s]

**Trainer Setup & Execution**

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/171647 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 171,647 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

 3


wandb: You chose "Don't visualize my results"


wandb: Detected [huggingface_hub.inference, openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.0131
2,0.9365
3,1.0364
4,0.946
5,0.794
6,0.8598
7,0.7619
8,0.7467
9,0.7867
10,0.7458




0,1
train/epoch,▁▁▁▁▁▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇█████
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇███
train/grad_norm,▂▁▂▂▄▄▄▅▃▃█▂▂▂▂▅▃▂▅▂▆▆▄▃▅▄▅▅▃▅▆▄▅▇██▆▄▇▆
train/learning_rate,▁▂▄▅▇████▇▇▇▇▇▇▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁
train/loss,█▇▇▅▅▅▄▄▄▃▃▃▃▃▃▃▃▂▃▂▄▂▁▃▃▃▃▃▃▂▂▃▂▂▂▃▃▁▄▂

0,1
total_flos,8305377962557440.0
train/epoch,0.0028
train/global_step,60.0
train/grad_norm,0.18054
train/learning_rate,0.0
train/loss,0.5069
train_loss,0.62965
train_runtime,1221.0504
train_samples_per_second,0.393
train_steps_per_second,0.049


TrainOutput(global_step=60, training_loss=0.6296529193719228, metrics={'train_runtime': 1221.0504, 'train_samples_per_second': 0.393, 'train_steps_per_second': 0.049, 'total_flos': 8305377962557440.0, 'train_loss': 0.6296529193719228, 'epoch': 0.002796420581655481})

**Inference Test**

In [11]:
from unsloth.chat_templates import get_chat_template

# 1. Standard Setup
FastLanguageModel.for_inference(model)
tokenizer.padding_side = "left"

# 2. Prepare Prompt
sys_prompt = "You are a reflective assistant... <problem>{}</problem>"
message = sys_prompt.format("You are a reflective assistant. When counting letters, you must first list every character in the word with its numerical position (e.g., 1:s, 2:t...) and then tally the specific letter")
messages = [{"role": "user", "content": message}]

# 3. Tokenize - We use return_dict=True to force a dictionary output
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
    padding = True,
    # return_dict = True # Forces the dictionary with input_ids and attention_mask
)

# 4. Robust GPU Move (Handles both Dict and Tensor)
if isinstance(inputs, dict):
    inputs = {k: v.to("cuda") for k, v in inputs.items()}
    input_ids = inputs["input_ids"]
    attention_mask = inputs.get("attention_mask", None)
else:
    input_ids = inputs.to("cuda")
    # If it's a raw tensor, we create a simple mask of 1s (all tokens are important)
    attention_mask = torch.ones_like(input_ids).to("cuda")

# 5. Generate
outputs = model.generate(
    input_ids = input_ids,
    attention_mask = attention_mask,
    max_new_tokens = 1024,
    use_cache = True,
    temperature = 1.2,
    min_p = 0.1
)

print(tokenizer.batch_decode(outputs)[0])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

You are a reflective assistant... <problem>You are a reflective assistant. When counting letters, you must first list every character in the word with its numerical position (e.g., 1:s, 2:t...) and then tally the specific letter</problem><|eot_id|><|start_header_id|>assistant<|end_header_id|>

To count the number of "s"s in the word "Mississippi", I will follow these steps:

1. **List every character in the word with its numerical position:**

Mississippi

1. M
2. i
3. s
4. s
5. i
6. s
7. s
8. i
9. p
10. p
11. i

2. **Identify and count the specific letter(s) in question:**

In this case, I am looking for the number of "s"s. 

From the list, I can see that there are three "s"s at positions:

- 3: s
- 4: s
- 7: s

Therefore, there are **3** "s"s in the word "Mississippi".<|eot_id|>


**Save & Export (GGUF/Ollama)**

In [None]:
# Save locally
model.save_pretrained("chintan-r1-llama-3b")
tokenizer.save_pretrained("chintan-r1-llama-3b")

# Export to GGUF for Ollama
model.save_pretrained_gguf("chintan-r1-gguf", tokenizer, quantization_method = "q8_0")

This project used the Unsloth framework on a Google Colab T4 GPU to fine-tune the Llama-3.2-3B-Instruct model, successfully implementing a deep-reasoning "thinking" style. By leveraging QLoRA (4-bit Quantization) and LoRA adapters, we reduced the model's memory footprint by over 70%, allowing us to train the 3.2-billion parameter model using only ~0.75% of its total weights. This targeted approach focused on the model's self-attention layers, effectively "distilling" the iterative reasoning patterns found in the ServiceNow-AI/R1-Distill-SFT dataset without requiring the massive hardware typical of such large-scale AI tasks.

The final phase involved a 16-bit weight merge, where the learned reasoning adapters were integrated back into the high-precision base model to create a standalone, optimized version. Although the model was not saved to a local drive during the session, the complete training and export code remains available in the Colab environment, ready to be converted into a GGUF format for local deployment. This conversion enables the reasoning-enhanced model to run efficiently on consumer hardware through tools like Ollama, bridging the gap between high-end research and practical, local application.