# Deploying Llama-3.1 8B using vLLM

vLLM is an open-source library designed to deliver high throughput and low latency for large language model (LLM) inference. It optimizes text generation workloads by efficiently batching requests and making full use of GPU resources, empowering developers to manage complex tasks like code generation and large-scale conversational AI.

This tutorial guides you through setting up and running vLLM on AMD Instinct™ GPUs using the ROCm software stack. Learn how to configure your environment, containerize your workflow, and send test queries to the vLLM-supported inference server.

#### In General, you need Hugging Face API token & approval to access Meta Models
#### Here, we pre-downloaded LLama-3.1-8B-Instruct Model for your convinience. 
#### Below "Prerequisites" Steps are for your Learning. You can skip and continue to "Deploying the LLM using vLLM" Section

## LLaMA Model Download ( Optional ) 

### Hugging Face API access

* Obtain an API token from [Hugging Face](https://huggingface.co) for downloading models.
* Ensure the Hugging Face API token has the necessary permissions and approval to access [Meta's Llama checkpoints](https://huggingface.co/meta-llama/Llama-3.1-8B).



## Prepare the inference environment

Follow these steps to get the inference environment ready for use.

### Provide your Hugging Face token

You'll require a Hugging Face API token to access meta-llama/Llama-3.1-8B-Instruct. Generate your token at Hugging Face Tokens and request access for [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). Tokens typically start with "hf_".

Run the following interactive block in your Jupyter notebook to set up the token:

**Note**: Uncheck the "Add token as Git credential?" option.

In [None]:
from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
notebook_login()

Verify that your token was accepted correctly:

In [None]:
# Validate the token
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

## Deploying the LLM using vLLM

Start deploying the LLM (meta-llama/Llama-3.1-8B-Instruct) using vLLM in the Jupyter notebook:

### Start the vLLM server 

Open a new tab in this Jypyter server, click on the terminal icon to open a new terminal, then copy the following command to launch the vLLM server:

```bash
HIP_VISIBLE_DEVICES=0 vllm serve /home/user/Models/meta-llama/Meta-Llama-3.1-8B-Instruct \
        --gpu-memory-utilization 0.3 \
        --swap-space 16 \
        --disable-log-requests \
        --dtype float16 \
        --max-model-len 2048 \
        --tensor-parallel-size 1 \
        --host 0.0.0.0 \
        --port 4000 \
        --num-scheduler-steps 10 \
        --max-num-seqs 128 \
        --max-num-batched-tokens 2048 \
        --max-model-len 2048 \
        --distributed-executor-backend "mp"
```

After successfully connecting, it displays `INFO:     Application startup complete.`.

**Note**: In a multi-GPU environment, the setting `HIP_VISIBLE_DEVICES=x` is recommended to deploy the LLM on your preferred GPU.

### Start the client

After successfully running the server, as described above, run the following code to start your client:

In [None]:
import requests

url = "http://localhost:4000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "/home/user/Models/meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are an expert in the field of AI. Make sure to provide an explanation in few sentences."
        },
        {
            "role": "user",
            "content": "Explain the concept of AI Agent."
        }
    ],
    "stream": False,
    "max_tokens": 128
}

response = requests.post(url, headers=headers, json=data)
print(response.json())


**Note**: Remember to match the Docker `--port` **3000** and the port indicated in the URL, for instance, http://localhost:**3000**. If the port is already used by another application, you can modify the number. 

If the connection is successful, the output will be:

``` bash
{'id': 'chatcmpl-6d31bdc194b74446919bde59f5aa4bb2', 'object': 'chat.completion', 'created': 1751457265, 'model': '/home/user/Models/meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'reasoning_content': None, 'content': 'An AI Agent is a software program or a system that perceives its environment and takes actions to achieve a specific goal or set of goals. It\'s a core concept in Artificial Intelligence (AI) and is often referred to as a "rational agent" ...}
```