Skip to content

thefrontierlab/post2-bench

Repository files navigation

post2-bench

Raw data, harness scripts, and methodology for the benchmark in Vulkan/RADV vs ROCm 6.4 on Strix Halo: What 128 Benchmark Runs Actually Showed.

If you want the narrative and the verdict, read the post. This repo is for the people who want to verify our numbers or re-run the bench on their own hardware.

Key findings

  • On Qwen 3.6-35B-A3B MoE: Vulkan/RADV is 25–32 % faster on generation than ROCm 6.4.4 across all prompt sizes. Vulkan also wins prefill by 6–8 % once context grows past trivial.
  • On Gemma 4 31B Dense: Vulkan runs cleanly at ~6 t/s. ROCm fails 48 of 48 runs in three distinct failure modes (hipGraphInstantiate OOM at production context size, degenerate output at smaller context, no tested workaround restores correct output).
  • ROCm failures reproduce identically on llama.cpp master (dbe7901ca, May 14 2026) — not a stale-build issue.

What's here

File / Directory What it is
methodology.md Full setup, build flags, run procedure, decision log
results.csv 128 raw runs, one row per inference call
summary.csv Aggregated stats per (backend, model, cache_k, prompt) cell
garbage-samples.txt Forensic captures of ROCm × Gemma degenerate output with token-ID decomposition
anomalies.log Verbatim crash logs and failure traces
prompts/ The four fixed prompts used across the bench
scripts/ Harness (server lifecycle, VRAM polling, streaming TTFT capture)

Hardware

  • Bosgame M5
  • AMD Ryzen AI MAX+ 395 ("Strix Halo")
  • gfx1151 iGPU, 96 GB BIOS-allocated VRAM
  • 128 GB unified LPDDR5X-8000 @ ~256 GB/s
  • Fedora Server 43, headless

Kernel command line:

amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856

Software

Component Version
llama.cpp (primary) b9016 (846262d78), May 4 2026
llama.cpp (verification) master (dbe7901ca), May 14 2026
Vulkan/RADV Mesa 25.3.6 (default Fedora 43 stack)
ROCm/HIP 6.4.4 (host-native, dnf install rocm-hip-runtime rocm-llvm hip-runtime-amd)
rocWMMA 6.4.0-3.fc43 (does not enumerate gfx1151 — bench built with GGML_HIP_ROCWMMA_FATTN=OFF)

Build flags, both backends:

-DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON

Backend-specific:

build-vulkan: -DGGML_VULKAN=ON -DGGML_HIP=OFF
build-rocm:   -DGGML_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1151 -DGGML_HIP_ROCWMMA_FATTN=OFF

Models

Model File Size
Qwen 3.6-35B-A3B (MoE) Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf (Unsloth Dynamic Q5) ~25 GB
Gemma 4 31B (Dense) gemma-4-31B-it-Q8_0.gguf ~33 GB

Reproducing

See methodology.md for the full procedure. Short version:

  1. Two parallel llama.cpp builds at b9016, installed to /opt/llamacpp/vulkan/bin/ and /opt/llamacpp/rocm/bin/.
  2. Stop any other inference processes — the bench needs idle GPU.
  3. Server runs on port 9090 (separate from any production endpoints).
  4. Run the harness in scripts/run_bench.sh.
  5. Each cell: 5 runs, drop run 1 as warmup, statistics on runs 2–5.
  6. Thermal cooldown to GPU edge temperature < 60 °C between runs.

Expect roughly 6–8 hours wall-clock for the full 128-run sweep with thermal cooldowns. The Gemma × ROCm cells fail fast (~10–30 s each) so they don't dominate runtime.

Verifying, extending, correcting

If you can reproduce these results on similar hardware — or, more interestingly, can't — open an issue. Include hardware details, ROCm and Mesa versions, llama.cpp commit, and what you saw.

If you find a methodology mistake, also open an issue. Data and the post both get corrected with attribution.

License

What License
Data (CSVs, logs, samples) CC0 1.0 — public domain, attribution appreciated but not required
Scripts MIT

Related

About

128-run Vulkan/RADV vs ROCm 6.4 benchmark on Strix Halo · companion to thefrontierlab.ai Post 2

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE-DATA

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors