# Inference on Hugging Face Transformers   
Transformers library from Hugging Face provides framework for running inference on pre-trained models. By leveraging AMD's advanced architecture, the library can deliver faster inference times and better efficiency for tasks such as text generation, classification, and translation. Support for GPU acceleration ensures seamless integration with AMD MI300X, enabling high-performance AI applications. 

## Prerequisites
### 1.Hardware Requirements
-AMD ROCm GPUs (e.g., MI210, MI300X).
-Ensure your system meets the System Requirements, including ROCm 6.0+ and Ubuntu 22.04.
### 2.Docker
-Install Docker with GPU support
-Ensure your user has appropriate permission to access to GPU

```bash
docker run --rm --device=/dev/kfd --device=/dev/dri rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2 rocm-smi
```
### 3.Hugging Face API Access
-Obtain an API token from Hugging Face for downloading models.
-Ensure you have a Hugging Face API token with the necessary permissions and approval to access Meta’s LLaMA checkpoints.

## Prepare Inference Environment
### 1.Pull the Docker Image
```bash
# Host machine
docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 1G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace rocm/pytorch:latest

# Inside the container 
cd /workspace 
export HF_TOKEN="Your hugging face token to access gated models" 
pip install accelerate transformers 
```

### 2.Install and Launch Jupyter
Inside the Docker container, install Jupyter using the following command:
```bash
pip install --upgrade pip setuptools wheel
pip install jupyter
```
Start the Jupyter server:
```bash
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```
### 3.Run a Sample LLM
Create a hf_transformer.py file inside the docker.

In [None]:
# hf_transformers.py
import transformers
import torch  

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipeline = transformers.pipeline( 
    "text-generation", 
    model=model_id, 
    model_kwargs={"torch_dtype": torch.bfloat16}, 
    device_map="auto", 
) 

messages = [ 
    {"role": "system", "content": "You are a chatbot in the online shopping mall!"}, 
    {"role": "user", "content": "How can I get a refund of this product?"}, 
] 

outputs = pipeline( 
    messages,
    max_new_tokens=10, 
) 

print(outputs[0]["generated_text"][-1]) 

# python hf_transformers.py 
# gives 
{'role': 'assistant', 'content': "I'd be happy to help you with the refund"}

# Inference on Hugging Face TGI 
Text Generation Inference from Hugging Face provides high-performance library optimized for serving large language models. With features like model sharding, multi-GPU inference, and low-latency decoding, TGI takes full advantage of AMD MI300X’s high compute density and memory bandwidth.   

## Prerequisites
### 1.Hardware Requirements
-AMD ROCm GPUs (e.g., MI210, MI300X).
-Ensure your system meets the System Requirements, including ROCm 6.0+ and Ubuntu 22.04.
### 2.Docker
-Install Docker with GPU support
-Ensure your user has appropriate permission to access to GPU

```bash
docker run --rm --device=/dev/kfd --device=/dev/dri rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2 rocm-smi
```
### 3.Hugging Face API Access
-Obtain an API token from Hugging Face for downloading models.
-Ensure you have a Hugging Face API token with the necessary permissions and approval to access Meta’s LLaMA checkpoints.

## Prepare Inference Environment
### 1.Pull the Docker Image
```bash
#Host machine
docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 1G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --ipc=host --net host --entrypoint /bin/bash ghcr.io/huggingface/text-generation-inference:latest-rocm 

# Inside the container
cd /workspace 
export HF_TOKEN="Your hugging face token to access gated models" 
```

### 2.Install and Launch Jupyter
Inside the Docker container, install Jupyter using the following command:
```bash
pip install --upgrade pip setuptools wheel
pip install jupyter
```
Start the Jupyter server:
```bash
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```
### 3.Running LLM model on a local server
```bash
# Launch LLM backend server 

nohup text-generation-launcher --model-id meta-llama/Meta-Llama-3.1-8B-Instruct --num-shard 1 --cuda-graphs 1 --max-batch-prefill-tokens 131072 --max-batch-total-tokens 139264 --dtype float16 --port 88 & 

# Check server status, when "Connected" is printed out then the LLM backend server is ready 
tail nohup.out 

# after a few seconds,
router/src/server.rs:2015: Invalid hostname, defaulting to 0.0.0.0 
2025-01-08T03:25:41.801732Z  INFO text_generation_router::server: router/src/server.rs:2402: Connected 
```

### 4.Acessing the LLM server from the client
```bash
curl localhost:88/v1/chat/completions \
    -X POST \ 
    -d '{ 
  "model": "tgi", 
  "messages": [ 
    { 
      "role": "system", 
      "content": "You are a chatbot in the online shopping mall!" 
    }, 
    { 
      "role": "user", 
      "content": "How can I get a refund of this product?" 
    } 
  ], 
  "stream": false, 
  "max_tokens": 10 
}' \ 
   -H 'Content-Type: application/json' 

# gives 

{"object":"chat.completion","id":"","created":1736307160,"model":"meta-llama/Llama-3.1-8B-Instruct","system_fingerprint":"2.4.2-dev0-sha-9f5c9a5-rocm","choices":[{"index":0,"message":{"role":"assistant","content":"You can initiate the refund process by logging into your"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":56,"completion_tokens":10,"total_tokens":66}} 
```

# Inference on Hugging Face vLLM
vLLM provides specialized inference engine that prioritizes fast decoding for large language models, fully leveraging the power of AMD MI300X. It employs techniques like continuous batching and efficient memory management to maximize throughput. vLLM is well-suited for scenarios requiring real-time inference and high concurrency with large-scale LLMs.     

## Prerequisites
### 1.Hardware Requirements
-AMD ROCm GPUs (e.g., MI210, MI300X).
-Ensure your system meets the System Requirements, including ROCm 6.0+ and Ubuntu 22.04.
### 2.Docker
-Install Docker with GPU support
-Ensure your user has appropriate permission to access to GPU

```bash
docker run --rm --device=/dev/kfd --device=/dev/dri rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2 rocm-smi
```
### 3.Hugging Face API Access
-Obtain an API token from Hugging Face for downloading models.
-Ensure you have a Hugging Face API token with the necessary permissions and approval to access Meta’s LLaMA checkpoints.

## Prepare Inference Environment
### 1.Pull the Docker Image
```bash
# Host machine 
docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 1G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --ipc=host --net host --entrypoint /bin/bash rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4  

# Inside the container 
cd /workspace 
export HF_TOKEN="Your hugging face token to access gated models"  
```

### 2.Install and Launch Jupyter
Inside the Docker container, install Jupyter using the following command:
```bash
pip install --upgrade pip setuptools wheel
pip install jupyter
```
Start the Jupyter server:
```bash
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```
### 3.Running LLM model on a local server
```bash
# Launch LLM backend server
req=128 
max_num_batched_tokens=131072 
model=meta-llama/Meta-Llama-3.1-8B-Instruct 
nohup python -m vllm.entrypoints.openai.api_server \ 
       --model $model \ 
       --gpu-memory-utilization 0.9 \ 
       --swap-space 16 \ 
       --disable-log-requests \ 
       --dtype float16 \ 
       --max-model-len $max_num_batched_tokens \ 
       --tensor-parallel-size 1 \ 
       --host 0.0.0.0 \ 
       --port 88 \ 
       --num-scheduler-steps 10 \ 
       --enable-chunked-prefill False \ 
       --max-num-seqs $req \ 
       --max-num-batched-tokens $max_num_batched_tokens \ 
       --max-model-len $max_num_batched_tokens \ 
       --distributed-executor-backend "mp" & 

# Check server status, when "Connected" is printed out then the LLM backend server is ready 
tail nohup.out 
# after a few seconds,  
INFO:     Application startup complete. 

INFO:     Uvicorn running on socket ('0.0.0.0', 88) (Press CTRL+C to quit) 
```

### 4.Acessing the LLM server from the client
```bash
# Accessing the LLM server from the client 

curl localhost:88/v1/chat/completions \ 
    -X POST \ 
    -d '{ 
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", 
  "messages": [ 
    { 
      "role": "system", 
      "content": "You are a chatbot in the online shopping mall!" 
    }, 
    { 
      "role": "user", 
      "content": "How can I get a refund of this product?" 
    } 
  ], 
  "stream": false, 
  "max_tokens": 10 
}' \ 
    -H 'Content-Type: application/json' 

# gives 
{"id":"chat-31ea193149064b0cb0401311c4cff2a4","object":"chat.completion","created":1736307940,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I'd be happy to help you with the refund","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":56,"total_tokens":66,"completion_tokens":10},"prompt_logprobs":null}  
```