## Inference on Hugging Face TGI
Hugging Face Text Generation Inference (TGI) is a high-performance, low-latency solution for serving advanced language models in production. It streamlines the process of text generation, enabling developers to deploy and scale language models for tasks like conversational AI, and content creation.

In this tutorial, we’ll demonstrate how to configure and run TGI using AMD Radeon™ and Instinct™ GPUs, leveraging the ROCm software stack for accelerated performance. You’ll learn how to set up your environment, containerize your workflow, and test your inference server by sending customized queries. 

### Prepare Inference Environment
#### 1. Launch the Docker Container
Run the following command in your terminal to pull the prebuilt Docker image containing all necessary dependencies and launch the Docker container with proper configuration:
```bash
(shell)docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 1G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --ipc=host --net host --entrypoint /bin/bash ghcr.io/huggingface/text-generation-inference:latest-rocm
--> if docker is launched it will look like root@xxx:
```
```bash
(docker)cd && cd /workspace
```
**Note:**Mounts the current host directory ($(pwd)) to **/workspace** in the container, allowing files to be shared between the host and the container.

### 2. Install and Launch Jupyter
Inside the Docker container, install Jupyter using the following command:
```bash

(docker)pip install --upgrade pip setuptools wheel
(docker)pip install jupyter
(docker)jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root
```
**Note:**Save the token or URL provided in the terminal output to access the notebook from your host machine.

### 3. Provide Your Hugging Face Token
You will need a Hugging Face API token to access meta-llama/Llama-3.1-8B-Instruct. Tokens typically start with "hf_". Generate your token at Hugging Face Tokens and request access for [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

Run the following interactive block in your Jupyter notebook to set up the token:
**Note:** Please uncheck the "Add token as Git credential?" option.


In [None]:
from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
status = notebook_login()

Verify that your token was captured correctly:

In [None]:
# Validate the token
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

### Deploying LLM using Hugging Face TGI 
Start deploying LLM(meta-llama/Llama-3.1-8B-Instruct) using Hugging Face TGI:

In [None]:
!HIP_VISIBLE_DEVICES=4 \
text-generation-launcher \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --num-shard 1 \
    --cuda-graphs 1 \
    --max-batch-prefill-tokens 131072 \
    --max-batch-total-tokens 139264 \
    --dtype float16 \
    --port 8000 \
    --trust-remote-code

**Note:** In a multi-GPU environment, it is recommended to set **HIP_VISIBLE_DEVICES=x** to deploy the LLM on the user’s preferred GPU.

The LLM running in the Docker container acts as a server. To test it, open a new notebook and send a query to the server.

In [None]:
import requests
import json

url = "http://localhost:8000/generate"
headers = {
    "Content-Type": "application/json"
}
data = {
    "inputs": "System: You are an expert in the field of AI. Make sure to provide an explanation in few sentences.\nUser: Explain the concept of AI.\nAssistant:",
    "parameters": {
        "max_new_tokens": 128,
        "do_sample": False
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

**Note:** Remember to match the docker --port **8000** and http://localhost:**8000**. If the port is used by other application you can modify the number. 

If the connection is successful the output will be:
```bash
{"generated_text":" AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, ...}
```