# LLM Inference with AMD Radeon™ and Instinct™ GPUs
In this tutorial, we’ll explore how to leverage Hugging Face Transformers, Text Generation Inference (TGI), and vLLM to serve and test LLMs on AMD hardware. You’ll learn how to install and configure ROCm for AMD Radeon™ and Instinct™ GPUs, set environment variables for multi-GPU setups (e.g., HIP_VISIBLE_DEVICES), and launch your favorite models with reduced-precision (like FP16 or BF16) to balance performance and memory usage. We’ll also walk through best practices for containerizing your workflow with Docker, ensuring a reproducible setup for model deployment. By following these steps, you’ll be able to fine-tune, serve, and query advanced LLMs in a ROCm-accelerated environment, capitalizing on AMD GPU performance for state-of-the-art natural language processing tasks.

## Prerequisites
### 1. Hardware Requirements
* AMD ROCm GPUs (e.g., MI210, MI300X, 7900-xt)
* Ensure your system meets the System Requirements, including ROCm 6.0+ and Ubuntu 22.04

### 2. Software
* **ROCm installed** and verified on your system
* **Docker installed** on your system. Refer to the [Docker installation guide](https://docs.docker.com/get-docker/) if needed

### 3. System Configuration
* **NUMA auto-balancing** disabled for optimal performance
```bash
(shell)sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
(shell)cat /proc/sys/kernel/numa_balancing
--> output should be 0
```
* **ROCm** environment validated using rocm-smi
```bash
(shell)rocm-smi
--> user should see all the available GPUs as shown in below table(MI300x case):

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)                                                  
==========================================================================================================================
0       2     0x74a1,   28851  41.0°C      133.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%    
1       3     0x74a1,   51499  37.0°C      133.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%    
2       4     0x74a1,   57603  38.0°C      136.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%    
3       5     0x74a1,   22683  34.0°C      133.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%    
4       6     0x74a1,   53458  38.0°C      133.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%    
5       7     0x74a1,   26954  35.0°C      132.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%    
6       8     0x74a1,   16738  39.0°C      134.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%    
7       9     0x74a1,   63738  37.0°C      131.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%    
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================
```
### 3.Hugging Face API Access
* Obtain an API token from Hugging Face for downloading models
* Ensure you have a Hugging Face API token with the necessary permissions and approval to access [Meta's LLaMA checkpoints](https://huggingface.co/meta-llama)
```bash
access token starting with hf_xxxxxx
need Hugging Face API token for rest of the tutorial (make sure to save the token on a seperate txt file)
```

## Inference on Hugging Face Transformers
Hugging Face Transformers is a popular open-source library that provides an easy-to-use interface for working with state-of-the-art language models, such as BERT, GPT, and Llama variants. These models can be fine-tuned or used off-the-shelf for tasks like text generation, question answering, and sentiment analysis.   

In this tutorial, we’ll demonstrate how to run inference on Hugging Face Transformers models using AMD Radeon™ and Instinct™ GPUs. We will cover configuring ROCm for GPU support, installing the necessary libraries, and running a LLM(meta-llama/Meta-Llama-3.1-8B-Instruct) in a containerized environment. 

### Prepare Inference Environment
#### Step 1: Launch the Docker Container
Run the following command in your terminal to pull the prebuilt Docker image containing all necessary dependencies and launch the Docker container with proper configuration:
```bash
(shell)docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 1G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace rocm/pytorch:latest
--> if docker is launched it will look like root@xxx:
```
#### Step 2: Install Necessary Libraries
This step sets up the essential dependencies needed for working with large language models and running your workflows in a notebook interface:
```bash
(docker)pip install accelerate transformers jupyter
```
#### Step 3: Provide Your Hugging Face API Token
Hugging Face Token can be generated by signing into your account at **[Hugging Face Tokens](https://huggingface.co/settings/tokens)**
Ensure the token has the necessary permissions for your tasks (e.g., \"read\" or \"write\"). Tokens typically start with \"hf_\".
```bash
(docker)cd && cd /workspace
(docker)export HF_TOKEN="Your hugging face token to access gated models" 
ex. (docker)export HF_TOKEN=hf_xxxx
```
**Note:**Mounts the current host directory ($(pwd)) to **/workspace** in the container, allowing files to be shared between the host and the container.

### Run LLM Inference using Hugging Face Transformers 
Inside the docker container, run the following command inside the container to launch jupyter notebook server:
```bash
(docker)jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root
```
**Note:**Save the token or URL provided in the terminal output to access the notebook from your host machine.
Now move on to the jupyter notebook and execute the following python code:

In [None]:
import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)
query = "Explain the concept of AI to me."
messages = [
    {"role": "system", "content": "You are an expert in the field of AI. Make sure to provide an explanation in few sentences."},
    {"role": "user", "content": query},
]

outputs = pipeline(
    messages,
    max_new_tokens=512,
    top_p = 0.7,     
    temperature=0.2,               
)

response = outputs[0]["generated_text"][-1]['content']
print('-------------------------------')
print('Query:\n', query)
print('-------------------------------')
print('Response:\n', response)

## Inference on Hugging Face TGI
Hugging Face Text Generation Inference (TGI) is a high-performance, low-latency solution for serving advanced language models in production. It streamlines the process of text generation, enabling developers to deploy and scale language models for tasks like summarization, conversational AI, and content creation.

In this tutorial, we’ll demonstrate how to configure and run TGI using AMD Radeon™ and Instinct™ GPUs, leveraging the ROCm software stack for accelerated performance. You’ll learn how to set up your environment, containerize your workflow, and test your inference server by sending customized queries. 

### Prepare Inference Environment
#### Step 1: Launch the Docker Container
Run the following command in your terminal to pull the prebuilt Docker image containing all necessary dependencies and launch the Docker container with proper configuration:
```bash
(shell)docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 1G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --ipc=host --net host --entrypoint /bin/bash ghcr.io/huggingface/text-generation-inference:latest-rocm
--> if docker is launched it will look like root@xxx:
```
#### Step 2: Provide Your Hugging Face API Token
Hugging Face Token can be generated by signing into your account at **[Hugging Face Tokens](https://huggingface.co/settings/tokens)**
Ensure the token has the necessary permissions for your tasks (e.g., \"read\" or \"write\"). Tokens typically start with \"hf_\".
```bash
(docker)cd && cd /workspace
(docker)export HF_TOKEN="Your hugging face token to access gated models" 
ex. (docker)export HF_TOKEN=hf_xxxx
```
**Note:**Mounts the current host directory ($(pwd)) to **/workspace** in the container, allowing files to be shared between the host and the container.

### Deploying LLM using Hugging Face TGI 
Start deploying LLM(meta-llama/Llama-3.1-8B-Instruct) using Hugging Face TGI:
```bash
(docker)HIP_VISIBLE_DEVICES=4 text-generation-launcher --model-id  meta-llama/Llama-3.1-8B-Instruct --num-shard 1 --cuda-graphs 1 --max-batch-prefill-tokens 131072 --max-batch-total-tokens 139264 --dtype float16 --port 8000 --trust-remote-code
--> after successful connection it will show: INFO text_generation_router::server: router/src/server.rs:2402: Connected
```
**Note:** In a multi-GPU environment, it is recommended to set **HIP_VISIBLE_DEVICES=x** to deploy the LLM on the user’s preferred GPU.

The LLM running in the Docker container acts as a server. To test it, open a new shell and send a query to the server.
```bash
(shell) curl -X POST http://localhost:8000/generate \
     -H "Content-Type: application/json" \
     -d '{
       "inputs": "System: You are an expert in the field of AI. Make sure to provide an explanation in few sentences.\nUser: Explain the concept of AI to me.\nAssistant:",
       "parameters": {
         "max_new_tokens": 128,
         "do_sample": false
       }
     }'
```
**Note:** Remember to match the docker --port **8000** and http://localhost:**8000**. If the port is used by other application you can modify the number. 

If the connection is successful the output will be:
```bash
(shell){"generated_text":" AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, ...}
```

## Inference on Hugging Face vLLM
Hugging Face vLLM is an open-source library designed to deliver high throughput and low latency for large language model inference. It optimizes text generation workloads by efficiently batching requests and making full use of GPU resources, enabling developers to handle demanding tasks such as summarization, code generation, and conversational AI at scale.

In this tutorial, we’ll guide you through setting up and running vLLM on AMD Radeon™ and Instinct™ GPUs using the ROCm software stack. You’ll learn how to configure your environment, containerize your workflow, and send test queries to the inference server supported by vLLM.  

### Prepare Inference Environment
#### Step 1: Launch the Docker Container
Run the following command in your terminal to pull the prebuilt Docker image containing all necessary dependencies and launch the Docker container with proper configuration:
```bash
(shell)docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 1G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --ipc=host --net host --entrypoint /bin/bash rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
--> if docker is launched it will look like root@xxx:
```
#### Step 2: Provide Your Hugging Face API Token
Hugging Face Token can be generated by signing into your account at **[Hugging Face Tokens](https://huggingface.co/settings/tokens)**
Ensure the token has the necessary permissions for your tasks (e.g., \"read\" or \"write\"). Tokens typically start with \"hf_\".
```bash
(docker)cd && cd /workspace
(docker)export HF_TOKEN="Your hugging face token to access gated models" 
ex. (docker)export HF_TOKEN=hf_xxxx
```
**Note:**Mounts the current host directory ($(pwd)) to **/workspace** in the container, allowing files to be shared between the host and the container.

### Deploying LLM using Hugging Face vLLM 
Start deploying LLM(meta-llama/Llama-3.1-8B-Instruct) using Hugging Face vLLM:
```bash
(docker)HIP_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Meta-Llama-3.1-8B-Instruct \
        --gpu-memory-utilization 0.9 \
        --swap-space 16 \
        --disable-log-requests \
        --dtype float16 \
        --max-model-len 131072 \
        --tensor-parallel-size 1 \
        --host 0.0.0.0 \
        --port 3000 \
        --num-scheduler-steps 10 \
        --enable-chunked-prefill False \
        --max-num-seqs 128 \
        --max-num-batched-tokens 131072 \
        --max-model-len 131072 \
        --distributed-executor-backend "mp"
--> after successful connection it will show: INFO: Uvicorn running on socket ('0.0.0.0', 88) (Press CTRL+C to quit) 
```
**Note:** In a multi-GPU environment, it is recommended to set **HIP_VISIBLE_DEVICES=x** to deploy the LLM on the user’s preferred GPU.

The LLM running in the Docker container acts as a server. To test it, open a new shell and send a query to the server.
```bash
(shell) curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are an expert in the field of AI. Make sure to provide an explanation in few sentences."
    },
    {
      "role": "user",
      "content": "Explain the concept of AI."
    }
  ],
  "stream": false,
  "max_tokens": 128  

}' \
    -H 'Content-Type: application/json'
```
**Note:** Remember to match the docker --port **3000** and http://localhost:**3000**. If the port is used by other application you can modify the number. 

If the connection is successful the output will be:
```bash
(shell){"id":"chat-xx","object":"chat.completion","created":1736494622,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) is a field of computer science ...}
```


## Conclusion
By following this tutorial, you’ve learned how to set up your environment for AMD Radeon™ or Instinct™ GPUs using ROCm, install and configure Hugging Face Transformers, TGI, and vLLM, and serve large language models for high-performance inference. You also discovered how to containerize your workflow and test the deployment by sending text-generation queries to the inference server. 