# Fine-tuning LLMs with AMD Strix Halo and Unsloth

Tutorial on fine-tuning gpt-oss-20b (and others) on AMD Strix Halo using Unsloth. This mirrors the DGX Spark example with Strix Halo specifics.

## Start with Unsloth Image for Strix Halo

You can use the prebuilt toolbox image or build locally.

Option A ‚Äî Toolbox (recommended):

```bash
toolbox create strix-halo-llm-finetuning \
  --image docker.io/kyuz0/amd-strix-halo-llm-finetuning:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

toolbox enter strix-halo-llm-finetuning
```

Option B ‚Äî Local Docker build from this repo:

```bash
docker build -f Dockerfile -t unsloth-strix-halo .
docker run -it --device /dev/kfd --device /dev/dri \
  --group-add=render --group-add=video -p 8888:8888 \
  -v $(pwd):/work -w /work unsloth-strix-halo
```

## Start Jupyter and Run Notebooks

Inside the container:

```bash
jupyter lab --notebook-dir /work
```

If using the toolbox image (see README.md):

```bash
mkdir -p ~/finetuning-workspace/
cp -r /opt/workspace ~/finetuning-workspace/
jupyter lab --notebook-dir ~/finetuning-workspace/
```

Dockerfile https://github.com/unslothai/notebooks/blob/main/Dockerfile_Strix_Halo

In [1]:
# Quick environment check (ROCm + Unsloth)
import os, torch

print('PyTorch:', torch.__version__)
print('ROCm:', getattr(torch.version, 'hip', None))
print('CUDA (None on ROCm expected):', torch.version.cuda)
print('torch.cuda.is_available():', torch.cuda.is_available())

if torch.cuda.is_available():
    try:
        print('Device:', torch.cuda.get_device_name(0))
        props = torch.cuda.get_device_properties(0)
        print('Total VRAM/Unified (GiB):', round(props.total_memory/1024**3, 2))
    except Exception as e:
        print('Device info error:', e)

try:
    import unsloth, transformers, trl
    print('Unsloth:', getattr(unsloth, '__version__', 'unknown'))
    print('Transformers:', transformers.__version__)
    print('TRL:', trl.__version__)
except Exception as e:
    print('Package import error:', e)

for k in ('UNSLOTH_FA2_COMPUTE_DTYPE','UNSLOTH_ROPE_IMPL','UNSLOTH_DISABLE_TRITON_RMSNORM'):
    print(f'Env {k} =', os.environ.get(k))


PyTorch: 2.7.1+git99ccf24
ROCm: 6.4.43484-123eb5128
CUDA (None on ROCm expected): None
torch.cuda.is_available(): True
Device: AMD Radeon Graphics
Total VRAM/Unified (GiB): 128.0
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: 2025.11.3
Transformers: 4.57.1
TRL: 0.24.0
Env UNSLOTH_FA2_COMPUTE_DTYPE = None
Env UNSLOTH_ROPE_IMPL = None
Env UNSLOTH_DISABLE_TRITON_RMSNORM = None


## Configuration and Hyperparameters

Sets model name, sequence length, dtypes, 4-bit loading, and Unsloth ROCm tuning env vars.

In [2]:
import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer

from unsloth import FastLanguageModel
from unsloth.chat_templates import (
    standardize_sharegpt,
    train_on_responses_only,
)
from transformers import TextStreamer

# Common hyperparameters
MODEL_NAME = "unsloth/gpt-oss-20b"   # or "unsloth/gpt-oss-20b-unsloth-bnb-4bit"
max_seq_length = 2048
dtype = None         # let Unsloth auto-detect
load_in_4bit = True  # 4bit for memory
LR = 2e-4
EPOCHS = 1           # or use max_steps if you prefer
BATCH_SIZE = 1       # you can crank this up if memory allows

# Set ROCm logging / Unsloth preferences
import os
# os.environ['PYTORCH_ROCM_LOG_LEVEL'] = 'DEBUG'
os.environ['UNSLOTH_FA2_COMPUTE_DTYPE'] = 'float16'
os.environ['UNSLOTH_ROPE_IMPL'] = 'slow'
os.environ['UNSLOTH_DISABLE_TRITON_RMSNORM'] = '1'


## Quick Model Smoke Test (optional)

Verifies the base model can load with LoRA adapters under ROCm.

In [3]:
from unsloth import FastLanguageModel
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

max_seq_length = 1024
dtype = None

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name       = "unsloth/gpt-oss-20b",
    dtype            = dtype,
    max_seq_length   = max_seq_length,
    load_in_4bit     = True,
    full_finetuning  = False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r                         = 8,
    target_modules            = ["q_proj", "k_proj", "v_proj", "o_proj",
                                 "gate_proj", "up_proj", "down_proj"],
    lora_alpha                = 16,
    lora_dropout              = 0,
    bias                      = "none",
    use_gradient_checkpointing= "unsloth",
    random_state              = 3407,
    use_rslora                = False,
    loftq_config              = None,
)


Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.11.3: Fast Gpt_Oss patching. Transformers: 4.57.1.
   \\   /|    AMD Radeon Graphics. Num GPUs = 1. Max memory: 128.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+git99ccf24. ROCm Toolkit: 6.4.43484-123eb5128. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30+13c93f39.d20251112. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


MXFP4 quantization requires Triton and kernels installed: CUDA requires Triton >= 3.4.0, XPU requires Triton >= 3.5.0, we will default to dequantizing the model to bf16


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Unsloth: Making `model.base_model.model.model` require gradients


## Dataset Preparation

Loads a small quotes dataset, converts to chat format, and compiles Harmony-style text with the tokenizer's chat template.

In [4]:
from datasets import load_dataset

# 1) Load subset of quotes
quotes_ds = (
    load_dataset("Abirate/english_quotes", split="train")
    .shuffle(seed=42)
    .select(range(1000))
)

# 2) Turn each row into chat messages
def build_quotes_messages(example):
    return {
        "messages": [
            {
                "role": "user",
                "content": f"Give me a quote about: {example['tags']}",
            },
            {
                "role": "assistant",
                "content": f"{example['quote']} - {example['author']}",
            },
        ]
    }

quotes_ds = quotes_ds.map(
    build_quotes_messages,
    remove_columns=quotes_ds.column_names,
)

# 3) Convert messages ‚Üí Harmony text using the *existing* tokenizer
def quotes_to_text(batch):
    convos = batch["messages"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False,
        )
        for convo in convos
    ]
    return {"text": texts}

quotes_ds_text = quotes_ds.map(
    quotes_to_text,
    batched=True,
    remove_columns=["messages"],   # we only keep "text"
)

# 4) Train / test split
quotes_ds_split = quotes_ds_text.train_test_split(test_size=0.2, seed=42)

print(quotes_ds_split)
print(quotes_ds_split["train"][0])


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text'],
        num_rows: 200
    })
})
{'text': "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-11-13\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Give me a quote about: ['books', 'humor']<|end|><|start|>assistant<|channel|>final<|message|>‚ÄúThere are two motives for reading a book; one, that you enjoy it; the other, that you can boast about it.‚Äù - Bertrand Russell<|end|>"}


## Load Model and Apply LoRA

Loads the Unsloth-optimized model and attaches LoRA adapters for memory-efficient fine-tuning.

In [5]:
# ==== Load model + tokenizer from Unsloth ====
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name       = MODEL_NAME,
    max_seq_length   = max_seq_length,
    dtype            = dtype,
    load_in_4bit     = load_in_4bit,
    full_finetuning  = False,   # we want LoRA, not full finetune
)

# Attach LoRA via Unsloth
model = FastLanguageModel.get_peft_model(
    model,
    r                       = 8,
    target_modules          = ["q_proj", "k_proj", "v_proj", "o_proj",
                               "gate_proj", "up_proj", "down_proj"],
    lora_alpha              = 16,
    lora_dropout            = 0.0,
    bias                    = "none",
    use_gradient_checkpointing = "unsloth",
    random_state            = 3407,
    use_rslora              = False,
    loftq_config            = None,
)

model.print_trainable_parameters()


Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.11.3: Fast Gpt_Oss patching. Transformers: 4.57.1.
   \\   /|    AMD Radeon Graphics. Num GPUs = 1. Max memory: 128.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+git99ccf24. ROCm Toolkit: 6.4.43484-123eb5128. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30+13c93f39.d20251112. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Unsloth: Making `model.base_model.model.model` require gradients
trainable params: 3,981,312 || all params: 20,918,738,496 || trainable%: 0.0190


## Train

Configures TRL SFTTrainer and fine-tunes for a small number of steps for validation. Increase steps/epochs for real training.

In [None]:
from trl import SFTConfig, SFTTrainer

quotes_args = SFTConfig(
    output_dir                  = "outputs-quotes",
    dataset_text_field          = "text",
    packing                     = False,
    num_train_epochs            = EPOCHS,
    per_device_train_batch_size = BATCH_SIZE,
    gradient_accumulation_steps = 4,
    warmup_steps                = 5,
    max_steps                   = 30,       # or num_train_epochs=1, max_steps=None
    learning_rate               = LR,
    logging_steps               = 1,
    optim                       = "adamw_8bit",
    weight_decay                = 0.001,
    lr_scheduler_type           = "linear",
    seed                        = 3407,
    report_to                   = "none",
)

quotes_trainer = SFTTrainer(
    model            = model,
    args             = quotes_args,
    train_dataset    = quotes_ds_split["train"],
    eval_dataset     = quotes_ds_split["test"],
    processing_class = tokenizer,
    dataset_num_proc   = 2,
)

quotes_stats = quotes_trainer.train()
quotes_trainer.save_model("finetuned_quotes")


Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/800 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/200 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 800 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 3,981,312 of 20,918,738,496 (0.02% trained)
   \\   /|    Num examples = 800 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 3,981,312 of 20,918,738,496 (0.02% trained)


Step,Training Loss
1,4.7033
2,4.7108
3,4.7781
4,4.4171
5,3.9997
6,3.7618
7,3.4995
8,3.1457
9,2.8234
10,2.5249


Unsloth: Will smartly offload gradients to save VRAM!


## Inference Test

Run a quick generation to verify the fine-tuned model responds as expected.

In [7]:
from transformers import TextStreamer
import torch

# Prepare a simple prompt
messages = [
    {"role": "user", "content": "Give me a short inspiring quote about persistence."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

# Switch to inference optimizations
model = FastLanguageModel.for_inference(model)

with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs.get("attention_mask"),
        max_new_tokens=60,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


systemYou are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-13

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.userGive me a short inspiring quote about persistence.assistantfinal"Success is not final, failure is not fatal: It is the courage to continue that counts." ‚Äî Winston Churchillassistantfinal"Keep your face always toward the sunshine‚Äîand shadows will fall behind you." ‚Äî Walt Whitman


## Unified Memory Usage

On Strix Halo with unified memory, 4-bit LoRA fine-tuning of 20B should fit comfortably. Use the below to inspect memory.

In [8]:
import torch
if torch.cuda.is_available():
    print(f'Memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GiB')
    print(f'Max memory reserved: {torch.cuda.max_memory_reserved()/1024**3:.2f} GiB')
else:
    print('CUDA (ROCm) device not available in this environment.')


Memory allocated: 39.03 GiB
Max memory reserved: 84.04 GiB


## Troubleshooting

- GPU not visible: pass `--device /dev/kfd --device /dev/dri` and add user to `render`/`video`.
- OOM or slow: reduce `max_seq_length`, keep 4-bit, increase grad accumulation.
- Kernel params for unified memory (see README): `amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432`.
- If FA2/RMSNorm issues: set `UNSLOTH_FA2_COMPUTE_DTYPE=float16` and `UNSLOTH_DISABLE_TRITON_RMSNORM=1`.


## Credits

Special thanks to kyuz0 for their Transformers fine-tuning notebook and Dockerfile for setting up ROCm and gfx1151 drivers on Strix Halo:
- https://github.com/kyuz0/amd-strix-halo-llm-finetuning