Raw data, harness scripts, and methodology for the benchmark in Vulkan/RADV vs ROCm 6.4 on Strix Halo: What 128 Benchmark Runs Actually Showed.
If you want the narrative and the verdict, read the post. This repo is for the people who want to verify our numbers or re-run the bench on their own hardware.
- On Qwen 3.6-35B-A3B MoE: Vulkan/RADV is 25–32 % faster on generation than ROCm 6.4.4 across all prompt sizes. Vulkan also wins prefill by 6–8 % once context grows past trivial.
- On Gemma 4 31B Dense: Vulkan runs cleanly at ~6 t/s. ROCm fails 48 of 48 runs in three distinct failure modes (hipGraphInstantiate OOM at production context size, degenerate output at smaller context, no tested workaround restores correct output).
- ROCm failures reproduce identically on
llama.cppmaster (dbe7901ca, May 14 2026) — not a stale-build issue.
| File / Directory | What it is |
|---|---|
methodology.md |
Full setup, build flags, run procedure, decision log |
results.csv |
128 raw runs, one row per inference call |
summary.csv |
Aggregated stats per (backend, model, cache_k, prompt) cell |
garbage-samples.txt |
Forensic captures of ROCm × Gemma degenerate output with token-ID decomposition |
anomalies.log |
Verbatim crash logs and failure traces |
prompts/ |
The four fixed prompts used across the bench |
scripts/ |
Harness (server lifecycle, VRAM polling, streaming TTFT capture) |
- Bosgame M5
- AMD Ryzen AI MAX+ 395 ("Strix Halo")
- gfx1151 iGPU, 96 GB BIOS-allocated VRAM
- 128 GB unified LPDDR5X-8000 @ ~256 GB/s
- Fedora Server 43, headless
Kernel command line:
amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856
| Component | Version |
|---|---|
llama.cpp (primary) |
b9016 (846262d78), May 4 2026 |
llama.cpp (verification) |
master (dbe7901ca), May 14 2026 |
| Vulkan/RADV | Mesa 25.3.6 (default Fedora 43 stack) |
| ROCm/HIP | 6.4.4 (host-native, dnf install rocm-hip-runtime rocm-llvm hip-runtime-amd) |
| rocWMMA | 6.4.0-3.fc43 (does not enumerate gfx1151 — bench built with GGML_HIP_ROCWMMA_FATTN=OFF) |
Build flags, both backends:
-DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON
Backend-specific:
build-vulkan: -DGGML_VULKAN=ON -DGGML_HIP=OFF
build-rocm: -DGGML_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1151 -DGGML_HIP_ROCWMMA_FATTN=OFF
| Model | File | Size |
|---|---|---|
| Qwen 3.6-35B-A3B (MoE) | Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf (Unsloth Dynamic Q5) |
~25 GB |
| Gemma 4 31B (Dense) | gemma-4-31B-it-Q8_0.gguf |
~33 GB |
See methodology.md for the full procedure. Short version:
- Two parallel
llama.cppbuilds at b9016, installed to/opt/llamacpp/vulkan/bin/and/opt/llamacpp/rocm/bin/. - Stop any other inference processes — the bench needs idle GPU.
- Server runs on port 9090 (separate from any production endpoints).
- Run the harness in
scripts/run_bench.sh. - Each cell: 5 runs, drop run 1 as warmup, statistics on runs 2–5.
- Thermal cooldown to GPU edge temperature < 60 °C between runs.
Expect roughly 6–8 hours wall-clock for the full 128-run sweep with thermal cooldowns. The Gemma × ROCm cells fail fast (~10–30 s each) so they don't dominate runtime.
If you can reproduce these results on similar hardware — or, more
interestingly, can't — open an issue. Include hardware details, ROCm
and Mesa versions, llama.cpp commit, and what you saw.
If you find a methodology mistake, also open an issue. Data and the post both get corrected with attribution.
| What | License |
|---|---|
| Data (CSVs, logs, samples) | CC0 1.0 — public domain, attribution appreciated but not required |
| Scripts | MIT |
- The post: thefrontierlab.ai
- Background — what 96 GB of VRAM actually gets you on Strix Halo: Post 1