R.O.A.S.T.

Radeon On ARM, Serving Tokens

Run llama.cpp with Vulkan GPU inference on a Raspberry Pi 5 with an AMD Radeon GPU.

Hardware: Arasaka KAI-7 - cyberpunk inspired, RPi 5 based, small form factor PC with AMD Radeon RX 5600 XT GPU

Quick Install

Flash Raspberry Pi OS Trixie Lite (64-bit) to your SD card, boot, and run:

wget -qO- https://raw.githubusercontent.com/stylesuxx/roast/master/install.sh | sudo bash

A reboot is required after kernel installation. Re-run sudo roast-setup after reboot to complete the setup.

Coreforge method (alternative)

If you prefer building a custom kernel from source (e.g. for a specific kernel version), use the --coreforge flag. This builds the Coreforge GPU-enabled kernel and a patched mesa radv driver. Takes 1-2 hours.

sudo roast-setup --coreforge

What to Expect

GPU-accelerated LLM inference on a Raspberry Pi 5 via Vulkan
~48 tok/s generation, ~365 tok/s prompt processing with a 7B Q4_K_M model on an RX 5600 XT
Multiple models can be managed as systemd services on different ports
Web UI via Open WebUI (optional, Docker-based)
No Ollama needed - llama-server provides an OpenAI-compatible API directly
Setup takes about 10 minutes (default) or 1-2 hours (with --coreforge)

What It Does

Default (rpi-update) method:

Fixes locale to en_US.UTF-8
Runs a full system upgrade
Removes armhf multiarch (incompatible with Pi 5's 16K page kernel)
Installs kernel with amdgpu support via rpi-update PR #7113
Enables PCIe Gen 3
Installs AMD firmware and Vulkan drivers
Builds llama.cpp with Vulkan backend
Installs the roast CLI globally
Optionally installs Docker and Open WebUI

Coreforge method (--coreforge) does the same but replaces step 4 with a kernel build from source, and adds a patched mesa radv driver + memcpy fix for 16K page compatibility.

Hardware

Raspberry Pi 5 (aarch64)
AMD Radeon GPU connected via PCIe (tested with RX 5700 XT / Navi 10)
External power supply for the GPU

GPU Selection Guide

The Pi 5 has a single PCIe x1 Gen 3 slot (~1 GB/s). This limits model load time but not inference speed - once the model is in VRAM, compute happens entirely on the GPU.

What matters most:

VRAM - determines max model size and context window
Memory bus width - determines text generation (tg) speed. This is the bottleneck for interactive chat. Wider = faster token output.
Compute speed - determines prompt processing (pp) speed. Faster GPU = faster context ingestion.
amdgpu driver support - GCN 5 / Polaris and newer

Important: Text generation is memory-bandwidth limited, not compute limited. A GPU with a wider memory bus (e.g. 192-bit or 256-bit) will generate tokens faster than a GPU with a narrower bus (128-bit), even if the narrower bus GPU has more raw compute power. For example, the RX 5600 XT (192-bit) generates tokens ~50% faster than the RX 9060 XT (128-bit) despite being much slower at prompt processing.

GPU	VRAM	Bus	Gen	tg speed	Best for
RX 5600 XT	6 GB	192-bit	RDNA1	~47 tok/s	Budget, fast tg, small models
RX 6600	8 GB	128-bit	RDNA2	~32 tok/s	More VRAM, slower tg
RX 6700 XT	12 GB	192-bit	RDNA2	~47 tok/s	Great balance of VRAM + tg
RX 6800	16 GB	256-bit	RDNA2	~60 tok/s	Best tg + most VRAM
RX 7600	8 GB	128-bit	RDNA3	~35 tok/s	Faster compute, slower tg
RX 7700 XT	12 GB	192-bit	RDNA3	~50 tok/s	Good balance
RX 9060 XT	16 GB	128-bit	RDNA4	~32 tok/s	Fastest pp (2400 tok/s), 16 GB VRAM, slow tg (default method only, `--coreforge` not yet supported)

Requirements

Fresh Raspberry Pi OS Trixie Lite (64-bit / arm64)
Internet connection
Time and patience for the kernel build

Model Manager

The roast CLI is installed globally by the setup script.

Add a model

sudo roast add Qwen/Qwen2.5-7B-Instruct-GGUF qwen2.5-7b-instruct-q4_k_m.gguf --port 8080 --enable

Multiple models on different ports

sudo roast add TheBloke/Mistral-7B-GGUF mistral-7b.Q4_K_M.gguf --port 8080 --enable
sudo roast add TheBloke/Codellama-7B-GGUF codellama-7b.Q4_K_M.gguf --port 8081 --enable

Manage models

sudo roast list              # List all models and their status
sudo roast status            # Full status (GPU, Vulkan, models, disk)
sudo roast enable <name>     # Start a model service
sudo roast disable <name>    # Stop a model service
sudo roast remove <name>     # Remove service and optionally delete the model file
sudo roast config <name> ... # Change context size, GPU layers, port, etc.
sudo roast bench <name>      # Run llama-bench on a model

Manual run

/opt/llama.cpp/build/bin/llama-server -m /opt/llama.cpp/models/your-model.gguf --port 8080 -ngl 99

Performance Tuning

After adding a model, run sudo roast bench <model-name> to verify performance. The bench command automatically uses the same context size and GPU layer settings as your running service, so the results are representative of real-world performance.

sudo roast bench qwen2.5-coder-7b-instruct-q4_k_m

Expected results for a 7B Q4_K_M model on an RX 5600 XT (16K context, all layers on GPU):

Metric	Expected	Problem if lower
pp (prompt processing)	~375 tok/s	Layers or KV cache spilling to CPU
tg (text generation)	~46 tok/s	Layers or KV cache spilling to CPU

Why performance can be slow

The Pi 5 connects to the GPU via a single PCIe x1 Gen 3 lane (~1 GB/s). Once the model is loaded into VRAM, inference happens entirely on the GPU at full speed. But if any part of the model or its KV cache spills to system RAM, every token requires data to cross the PCIe bus, dropping performance by 10-20x.

For full speed, the model weights + KV cache + compute buffers must all fit in VRAM:

VRAM	Model	Max context (approximate)
6 GB	7B Q4_K_M (4.4 GB)	~16K
8 GB	7B Q4_K_M (4.4 GB)	~48K
12 GB	7B Q4_K_M (4.4 GB)	~128K
12 GB	13B Q4_K_M (7.9 GB)	~32K
16 GB	13B Q4_K_M (7.9 GB)	~64K

Parallel slots

By default, each model runs with --parallel 1 (single user). This gives one user the full context window. If you need multiple concurrent users (e.g. Open WebUI + aider at the same time), increase it:

sudo roast config <model-name> --parallel 2

Each slot gets its own KV cache, so the VRAM cost multiplies: 2 slots = 2x KV cache. On limited VRAM, reduce context size when increasing parallel slots.

Fixing slow performance

If bench results are significantly below expected (e.g. <20 tok/s), reduce context size, parallel slots, or ensure all layers are on GPU:

sudo roast config <model-name> --gpu-layers 99 --context-size 16384 --parallel 1
sudo roast bench <model-name>

Things to Avoid

Do not run apt full-upgrade without checking - it can overwrite the rpi-update kernel. The setup script pins kernel packages to prevent this, but be cautious.
Do not add memcpy.so to /etc/ld.so.preload unless you hit alignment errors at runtime
Do not install linux-image-arm64 (Debian generic kernel) - Pi 5 will not boot
Do not enable armhf multiarch - 16K page kernel breaks 32-bit ARM libs
AMD Ubuntu repos are x86_64 only - not useful on Pi 5
SearXNG (if installed) has no authentication by default - it's accessible to anyone on your network. Only run it on a trusted local network, or add firewall rules to restrict access.

GPU Monitoring

nvtop    # included in the setup

Next Steps

Once R.O.A.S.T. is running, you have an OpenAI-compatible API at http://<hostname>:8080/v1. Here are some tools that can use it:

Coding Assistants

aider - CLI coding agent that can edit files, run commands, and work with git. Great for pair programming from the terminal.

pip install aider-chat
aider --openai-api-base http://ai01:8080/v1 --openai-api-key unused \
      --model openai/qwen2.5-coder-7b-instruct-q4_k_m.gguf --no-auto-commits

OpenCode - Terminal-based AI coding agent with file editing, search, and shell command tools. Requires a model with native tool calling support and 32K+ context. Works best with 16GB+ VRAM (e.g. Qwen 2.5 14B Instruct). See the OpenCode docs for OpenAI-compatible provider configuration.
Continue - VS Code / JetBrains extension for code completion and chat. Point it at your local API for a self-hosted Copilot alternative.

Open Interpreter - CLI agent that can run code, manage files, and control your computer via natural language.

pip install open-interpreter
interpreter --api-base http://ai01:8080/v1 --model qwen2.5-coder-7b-instruct-q4_k_m.gguf

Chat and Knowledge

Open WebUI - Web-based chat interface (included in the setup script). Supports RAG, document upload, and multi-user access.
LibreChat - Another web chat interface with plugin support and conversation branching.

Automation

n8n - Workflow automation platform. Connect your local LLM to email, Slack, databases, and other services.
LiteLLM - API proxy that lets you expose your local model as if it were any provider (Anthropic, OpenAI, etc). Useful for tools that don't support custom endpoints directly.

References

Coreforge Linux - GPU-enabled RPi kernel fork
Coreforge memcpy patch
RPi kernel build docs
llama.cpp
Open WebUI

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
roast-setup.sh		roast-setup.sh
roast.sh		roast.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R.O.A.S.T.

Radeon On ARM, Serving Tokens

Quick Install

Coreforge method (alternative)

What to Expect

What It Does

Hardware

GPU Selection Guide

Requirements

Model Manager

Add a model

Multiple models on different ports

Manage models

Manual run

Performance Tuning

Why performance can be slow

Parallel slots

Fixing slow performance

Things to Avoid

GPU Monitoring

Next Steps

Coding Assistants

Chat and Knowledge

Automation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

R.O.A.S.T.

Radeon On ARM, Serving Tokens

Quick Install

Coreforge method (alternative)

What to Expect

What It Does

Hardware

GPU Selection Guide

Requirements

Model Manager

Add a model

Multiple models on different ports

Manage models

Manual run

Performance Tuning

Why performance can be slow

Parallel slots

Fixing slow performance

Things to Avoid

GPU Monitoring

Next Steps

Coding Assistants

Chat and Knowledge

Automation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages