Run llama.cpp with Vulkan GPU inference on a Raspberry Pi 5 with an AMD Radeon GPU.
Hardware: Arasaka KAI-7 - cyberpunk inspired, RPi 5 based, small form factor PC with AMD Radeon RX 5600 XT GPU
Flash Raspberry Pi OS Trixie Lite (64-bit) to your SD card, boot, and run:
wget -qO- https://raw.githubusercontent.com/stylesuxx/roast/master/install.sh | sudo bashA reboot is required after kernel installation. Re-run sudo roast-setup after reboot to complete the setup.
If you prefer building a custom kernel from source (e.g. for a specific kernel version), use the --coreforge flag. This builds the Coreforge GPU-enabled kernel and a patched mesa radv driver. Takes 1-2 hours.
sudo roast-setup --coreforge- GPU-accelerated LLM inference on a Raspberry Pi 5 via Vulkan
- ~48 tok/s generation, ~365 tok/s prompt processing with a 7B Q4_K_M model on an RX 5600 XT
- Multiple models can be managed as systemd services on different ports
- Web UI via Open WebUI (optional, Docker-based)
- No Ollama needed - llama-server provides an OpenAI-compatible API directly
- Setup takes about 10 minutes (default) or 1-2 hours (with
--coreforge)
Default (rpi-update) method:
- Fixes locale to
en_US.UTF-8 - Runs a full system upgrade
- Removes armhf multiarch (incompatible with Pi 5's 16K page kernel)
- Installs kernel with amdgpu support via rpi-update PR #7113
- Enables PCIe Gen 3
- Installs AMD firmware and Vulkan drivers
- Builds llama.cpp with Vulkan backend
- Installs the
roastCLI globally - Optionally installs Docker and Open WebUI
Coreforge method (--coreforge) does the same but replaces step 4 with a kernel build from source, and adds a patched mesa radv driver + memcpy fix for 16K page compatibility.
- Raspberry Pi 5 (aarch64)
- AMD Radeon GPU connected via PCIe (tested with RX 5700 XT / Navi 10)
- External power supply for the GPU
The Pi 5 has a single PCIe x1 Gen 3 slot (~1 GB/s). This limits model load time but not inference speed - once the model is in VRAM, compute happens entirely on the GPU.
What matters most:
- VRAM - determines max model size and context window
- Memory bus width - determines text generation (tg) speed. This is the bottleneck for interactive chat. Wider = faster token output.
- Compute speed - determines prompt processing (pp) speed. Faster GPU = faster context ingestion.
- amdgpu driver support - GCN 5 / Polaris and newer
Important: Text generation is memory-bandwidth limited, not compute limited. A GPU with a wider memory bus (e.g. 192-bit or 256-bit) will generate tokens faster than a GPU with a narrower bus (128-bit), even if the narrower bus GPU has more raw compute power. For example, the RX 5600 XT (192-bit) generates tokens ~50% faster than the RX 9060 XT (128-bit) despite being much slower at prompt processing.
| GPU | VRAM | Bus | Gen | tg speed | Best for |
|---|---|---|---|---|---|
| RX 5600 XT | 6 GB | 192-bit | RDNA1 | ~47 tok/s | Budget, fast tg, small models |
| RX 6600 | 8 GB | 128-bit | RDNA2 | ~32 tok/s | More VRAM, slower tg |
| RX 6700 XT | 12 GB | 192-bit | RDNA2 | ~47 tok/s | Great balance of VRAM + tg |
| RX 6800 | 16 GB | 256-bit | RDNA2 | ~60 tok/s | Best tg + most VRAM |
| RX 7600 | 8 GB | 128-bit | RDNA3 | ~35 tok/s | Faster compute, slower tg |
| RX 7700 XT | 12 GB | 192-bit | RDNA3 | ~50 tok/s | Good balance |
| RX 9060 XT | 16 GB | 128-bit | RDNA4 | ~32 tok/s | Fastest pp (2400 tok/s), 16 GB VRAM, slow tg (default method only, --coreforge not yet supported) |
- Fresh Raspberry Pi OS Trixie Lite (64-bit / arm64)
- Internet connection
- Time and patience for the kernel build
The roast CLI is installed globally by the setup script.
sudo roast add Qwen/Qwen2.5-7B-Instruct-GGUF qwen2.5-7b-instruct-q4_k_m.gguf --port 8080 --enablesudo roast add TheBloke/Mistral-7B-GGUF mistral-7b.Q4_K_M.gguf --port 8080 --enable
sudo roast add TheBloke/Codellama-7B-GGUF codellama-7b.Q4_K_M.gguf --port 8081 --enablesudo roast list # List all models and their status
sudo roast status # Full status (GPU, Vulkan, models, disk)
sudo roast enable <name> # Start a model service
sudo roast disable <name> # Stop a model service
sudo roast remove <name> # Remove service and optionally delete the model file
sudo roast config <name> ... # Change context size, GPU layers, port, etc.
sudo roast bench <name> # Run llama-bench on a model/opt/llama.cpp/build/bin/llama-server -m /opt/llama.cpp/models/your-model.gguf --port 8080 -ngl 99After adding a model, run sudo roast bench <model-name> to verify performance. The bench command automatically uses the same context size and GPU layer settings as your running service, so the results are representative of real-world performance.
sudo roast bench qwen2.5-coder-7b-instruct-q4_k_mExpected results for a 7B Q4_K_M model on an RX 5600 XT (16K context, all layers on GPU):
| Metric | Expected | Problem if lower |
|---|---|---|
| pp (prompt processing) | ~375 tok/s | Layers or KV cache spilling to CPU |
| tg (text generation) | ~46 tok/s | Layers or KV cache spilling to CPU |
The Pi 5 connects to the GPU via a single PCIe x1 Gen 3 lane (~1 GB/s). Once the model is loaded into VRAM, inference happens entirely on the GPU at full speed. But if any part of the model or its KV cache spills to system RAM, every token requires data to cross the PCIe bus, dropping performance by 10-20x.
For full speed, the model weights + KV cache + compute buffers must all fit in VRAM:
| VRAM | Model | Max context (approximate) |
|---|---|---|
| 6 GB | 7B Q4_K_M (4.4 GB) | ~16K |
| 8 GB | 7B Q4_K_M (4.4 GB) | ~48K |
| 12 GB | 7B Q4_K_M (4.4 GB) | ~128K |
| 12 GB | 13B Q4_K_M (7.9 GB) | ~32K |
| 16 GB | 13B Q4_K_M (7.9 GB) | ~64K |
By default, each model runs with --parallel 1 (single user). This gives one user the full context window. If you need multiple concurrent users (e.g. Open WebUI + aider at the same time), increase it:
sudo roast config <model-name> --parallel 2Each slot gets its own KV cache, so the VRAM cost multiplies: 2 slots = 2x KV cache. On limited VRAM, reduce context size when increasing parallel slots.
If bench results are significantly below expected (e.g. <20 tok/s), reduce context size, parallel slots, or ensure all layers are on GPU:
sudo roast config <model-name> --gpu-layers 99 --context-size 16384 --parallel 1
sudo roast bench <model-name>- Do not run
apt full-upgradewithout checking - it can overwrite the rpi-update kernel. The setup script pins kernel packages to prevent this, but be cautious. - Do not add
memcpy.soto/etc/ld.so.preloadunless you hit alignment errors at runtime - Do not install
linux-image-arm64(Debian generic kernel) - Pi 5 will not boot - Do not enable armhf multiarch - 16K page kernel breaks 32-bit ARM libs
- AMD Ubuntu repos are x86_64 only - not useful on Pi 5
- SearXNG (if installed) has no authentication by default - it's accessible to anyone on your network. Only run it on a trusted local network, or add firewall rules to restrict access.
nvtop # included in the setupOnce R.O.A.S.T. is running, you have an OpenAI-compatible API at http://<hostname>:8080/v1. Here are some tools that can use it:
-
aider - CLI coding agent that can edit files, run commands, and work with git. Great for pair programming from the terminal.
pip install aider-chat aider --openai-api-base http://ai01:8080/v1 --openai-api-key unused \ --model openai/qwen2.5-coder-7b-instruct-q4_k_m.gguf --no-auto-commits -
OpenCode - Terminal-based AI coding agent with file editing, search, and shell command tools. Requires a model with native tool calling support and 32K+ context. Works best with 16GB+ VRAM (e.g. Qwen 2.5 14B Instruct). See the OpenCode docs for OpenAI-compatible provider configuration.
-
Continue - VS Code / JetBrains extension for code completion and chat. Point it at your local API for a self-hosted Copilot alternative.
-
Open Interpreter - CLI agent that can run code, manage files, and control your computer via natural language.
pip install open-interpreter interpreter --api-base http://ai01:8080/v1 --model qwen2.5-coder-7b-instruct-q4_k_m.gguf
-
Open WebUI - Web-based chat interface (included in the setup script). Supports RAG, document upload, and multi-user access.
-
LibreChat - Another web chat interface with plugin support and conversation branching.
-
n8n - Workflow automation platform. Connect your local LLM to email, Slack, databases, and other services.
-
LiteLLM - API proxy that lets you expose your local model as if it were any provider (Anthropic, OpenAI, etc). Useful for tools that don't support custom endpoints directly.
- Coreforge Linux - GPU-enabled RPi kernel fork
- Coreforge memcpy patch
- RPi kernel build docs
- llama.cpp
- Open WebUI