## Inference on vLLM
vLLM is an open-source library designed to deliver high throughput and low latency for large language model inference. It optimizes text generation workloads by efficiently batching requests and making full use of GPU resources, empowering developers to manage complex tasks like code generation and large-scale conversational AI.

In this tutorial, we’ll guide you through setting up and running vLLM on AMD Radeon™ and Instinct™ GPUs using the ROCm software stack. You’ll learn how to configure your environment, containerize your workflow, and send test queries to the inference server supported by vLLM.  

### Prepare Inference Environment
#### 1. Launch the Docker Container
Run the following command in your terminal to pull the prebuilt Docker image containing all necessary dependencies and launch the Docker container with proper configuration:
```bash
(shell)docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 1G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --ipc=host --net host --entrypoint /bin/bash rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
--> if docker is launched it will look like root@xxx:
```
```bash
(docker)cd && cd /workspace
```
**Note:**Mounts the current host directory ($(pwd)) to **/workspace** in the container, allowing files to be shared between the host and the container.

### 2. Install and Launch Jupyter
Inside the Docker container, install Jupyter using the following command:
```bash

(docker)pip install --upgrade pip setuptools wheel
(docker)pip install jupyter
(docker)jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root
```
**Note:**Save the token or URL provided in the terminal output to access the notebook from your host machine.

### 3. Provide Your Hugging Face Token
You will need a Hugging Face API token to access meta-llama/Llama-3.1-8B-Instruct. Tokens typically start with "hf_". Generate your token at Hugging Face Tokens and request access for [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

Run the following interactive block in your Jupyter notebook to set up the token:
**Note:** Please uncheck the "Add token as Git credential?" option.

In [None]:
from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
status = notebook_login()

Verify that your token was captured correctly:

In [None]:
# Validate the token
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

### Deploying LLM using vLLM 
Start deploying LLM(meta-llama/Llama-3.1-8B-Instruct) using vLLM:

In [None]:
!HIP_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Meta-Llama-3.1-8B-Instruct \
        --gpu-memory-utilization 0.9 \
        --swap-space 16 \
        --disable-log-requests \
        --dtype float16 \
        --max-model-len 131072 \
        --tensor-parallel-size 1 \
        --host 0.0.0.0 \
        --port 3000 \
        --num-scheduler-steps 10 \
        --enable-chunked-prefill False \
        --max-num-seqs 128 \
        --max-num-batched-tokens 131072 \
        --max-model-len 131072 \
        --distributed-executor-backend "mp"

--> after successful connection it will show: INFO: Uvicorn running on socket ('0.0.0.0', 88) (Press CTRL+C to quit) 
**Note:** In a multi-GPU environment, it is recommended to set **HIP_VISIBLE_DEVICES=x** to deploy the LLM on the user’s preferred GPU.

The LLM running in the Docker container acts as a server. To test it, open a new notebook and send a query to the server.

In [None]:
import requests

url = "http://localhost:3000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are an expert in the field of AI. Make sure to provide an explanation in few sentences."
        },
        {
            "role": "user",
            "content": "Explain the concept of AI."
        }
    ],
    "stream": False,
    "max_tokens": 128
}

response = requests.post(url, headers=headers, json=data)
print(response.json())


**Note:** Remember to match the docker --port **3000** and http://localhost:**3000**. If the port is used by other application you can modify the number. 

If the connection is successful the output will be:
```bash
{"id":"chat-xx","object":"chat.completion","created":1736494622,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) is a field of computer science ...}
```