# Running Large Language Models


## Goals

* Learn to use `llama.cpp` to run inference on the AMD GPU


# LLMs with llama.cpp

In this notebook we will use llama.cpp to execute LLMs (Large Language Models). `llama.cpp` enables model loading and inference on a variety of CPU and GPU platforms including Ryzen AI through ROCm and Vulkan.

To launch `llama.cpp`, open a separate terminal and run the command below. `llama.cpp` will pull the model, detect the AMD GPU and start a chat dialog. Try asking the model a question.

![](images/new_terminal.png)

```bash
unset HSA_OVERRIDE_GFX_VERSION
export PATH=/ryzers/llamacpp/build/bin/:$PATH
llama-cli -hf unsloth/Qwen3-1.7B-GGUF:Q4_K_M
```

`llama.cpp` also supports a server/client architecture amd can serve up models from a variety of model zoos. We will cover this in the next notebook.



## Benchmarking models and runtime selection

On this platform, `llama.cpp` is compiled with ROCm and Vulkan backends. `llama-bench` is a utility that allows you to benchmark models under multiple backends. Run the code below in a terminal to compare the model execution under ROCm and Vulkan.

```bash
llama-bench -m /ryzers/.cache/llamacpp/unsloth_Llama-3.2-3B-Instruct-GGUF_Llama-3.2-3B-Instruct-Q4_K_M.gguf -dev ROCm0,Vulkan0
```

You should see output similar to:

```bash
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | ROCm,Vulkan |  99 | ROCm0        |           pp512 |      4413.60 ± 38.18 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | ROCm,Vulkan |  99 | ROCm0        |           tg128 |        135.61 ± 0.78 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | ROCm,Vulkan |  99 | Vulkan0      |           pp512 |      4251.58 ± 45.03 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | ROCm,Vulkan |  99 | Vulkan0      |           tg128 |        138.42 ± 0.29 |
```



The above output displays for each device, ROCm or Vulkan, the performance in tokens/second **(t/s)**. The performance tests are **pp512** and **tg128**. **pp512** is a prompt processing test and indicates how quickly the model can process prompts of 512 tokens. **tg128** is a token generation test which indicates how quickly the model can generate token.



## References

* [Llama.cpp](https://github.com/ggml-org/llama.cpp)




---
Copyright© 2025 AMD, Inc SPDX-License-Identifier: MIT