# Getting started with TensorRT-LLM and Triton Inference Server

This hands-on tutorial is based on the TensorRT-LLM demo from ai-PULSE by Scaleway, which can be found here: https://github.com/scaleway/ai-pulse-nvidia-trt-llm/tree/main

In this tutorial, we will cover
- How to convert llama 2 models to TensorRT-LLM format
- Set-up Triton Inference Server with llama 2 models optimized using TensorRT-LLM
- Benchmark the inference performance of Triton + TensorRT-LLM pipeline vs vanilla Python HuggingFace pipeline 

In [8]:
!nvidia-smi

Tue Jun 18 11:52:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:DA:00.0 Off |                    0 |
| N/A   37C    P0              63W / 400W |      2MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## 1. Setup the environment

First let's clone the TensorRT-LLM github repo and be sure to use the correct version for this demo.

In [1]:
!git clone https://github.com/NVIDIA/TensorRT-LLM.git 
!git config --global --add safe.directory /workspace/notebooks/tensorrt-llm/TensorRT-LLM
!cd TensorRT-LLM && git checkout v0.5.0

fatal: destination path 'TensorRT-LLM' already exists and is not an empty directory.
fatal: detected dubious ownership in repository at '/workspace/nv-inference-demo/notebooks/tensorrt-llm/TensorRT-LLM'
To add an exception for this directory, call:

	git config --global --add safe.directory /workspace/nv-inference-demo/notebooks/tensorrt-llm/TensorRT-LLM


Next, let's download the llama 2 models, if it is not already done yet. 

For this you need to go to the models [website](https://llama.meta.com/llama-downloads), register, then an email with a custom URL will be sent to you allowing you to download the llama models.

To proceed with the download, first clone the llama repo, then launch the download script. When prompt with URL, just enter the URL that you received via email before. For this tutorial, we will need to download 1 model: the 7B-chat. Put the downloaded model inside `./llama-models` folder.

Note: **the download could take some time.**

In [2]:
!git clone https://github.com/facebookresearch/llama.git

fatal: destination path 'llama' already exists and is not an empty directory.


**Note: run commands in the following cell in a separate terminal (without the prepending !)**

In [3]:
!mkdir -p llama-models
!cd llama-models && ../llama/download.sh

Enter the URL from email: ^C


We also need to clone the huggingface transformers repo, to be able to use the conversion script to convert llama 2 models checkpoint format to huggingface's Transformers format.

In [4]:
!ls -lah --color llama-models/

total 524K
drwxr-xr-x  3 99 99 4.0K Mar  4 15:51 [0m[01;34m.[0m
drwxr-xr-x 15 99 99 4.0K Jun 17 08:52 [01;34m..[0m
-rw-r--r--  1 99 99 6.9K Jul 15  2023 LICENSE
-rw-r--r--  1 99 99 4.7K Jul 15  2023 USE_POLICY.md
drwxr-xr-x  2 99 99 4.0K Feb 27 23:12 [01;34mllama-2-7b-chat[0m
-rw-r--r--  1 99 99 489K Jul 13  2023 tokenizer.model
-rw-r--r--  1 99 99   50 Jul 13  2023 tokenizer_checklist.chk


In [5]:
!git clone https://github.com/huggingface/transformers
!cd transformers && git checkout v4.39.0

fatal: destination path 'transformers' already exists and is not an empty directory.
fatal: detected dubious ownership in repository at '/workspace/nv-inference-demo/notebooks/tensorrt-llm/transformers'
To add an exception for this directory, call:

	git config --global --add safe.directory /workspace/nv-inference-demo/notebooks/tensorrt-llm/transformers


Now convert meta checkpoint weights to huggingface format

In [6]:
!cp llama-models/tokenizer.model llama-models/llama-2-7b-chat/.
!mkdir -p hf-weights
!python ./transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ./llama-models/llama-2-7b-chat --model_size 7B --output_dir ./hf-weights/7B-chat


You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
Fetching all parameters from the checkpoint at ./llama-models/llama-2-7b-chat.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.95it/s]
Saving in the Transformers format.


## 2. Compile llama 2 models to TensorRT-LLM engines

Just like TensorRT, TensorRT-LLM provides APIs to compile / convert build LLMs to TensorRT engines. In this example, the conversion steps were implemented already in the `TensorRT-LLM/examples/llama/build.py` script provided by TensorRT-LLM repo. We can analyze the script to see how TensorRT-LLM APIs were used to build the LLM model and load the trained weights.

The TensorRT-LLM team is working on high-level APIs to make the conversion steps easier.

In [7]:
!python TensorRT-LLM/examples/llama/build.py \
    --model_dir ./hf-weights/7B-chat  \
    --dtype float16 \
    --use_gpt_attention_plugin float16  \
    --paged_kv_cache \
    --remove_input_padding \
    --use_gemm_plugin float16  \
    --output_dir "./trt-engines/llama_7b/fp16/1-gpu"  \
    --max_input_len 2048 --max_output_len 512 \
    --use_rmsnorm_plugin float16  \
    --enable_context_fmha \
    --use_inflight_batching

[06/18/2024-11:50:22] [TRT-LLM] [I] Serially build TensorRT engines.
[06/18/2024-11:50:22] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 118, GPU 423 (MiB)
[06/18/2024-11:50:27] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1974, GPU +350, now: CPU 2228, GPU 773 (MiB)
[06/18/2024-11:50:27] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[06/18/2024-11:50:32] [TRT-LLM] [I] Loading HF LLaMA ... from ./hf-weights/7B-chat
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:03<00:00,  1.99s/it]
[06/18/2024-11:50:54] [TRT-LLM] [I] HF LLaMA loaded. Total time: 00:00:22
[06/18/2024-11:50:54] [TRT-LLM] [I] Loading weights from HF LLaMA...
[06/18/2024-11:51:30] [TRT-LLM] [I] Weights loaded. Total time: 00:00:35
[06/18/2024-11:51:30] [TRT-LLM] [I] Context FMHA Enabled
[06/18/2024-11:51:30] [TRT-LLM] [I] Remove Padding Enabled
[06/18/2024-11:51:30] [TRT-LLM] [I] Paged KV Cache Enabled
[06/18/2024-11:51:30] [TRT-LLM] [I] Build TensorRT engine l

Now let's run the inference of the llama-2-7b-chat model. Similarly, TensorRT-LLM provides APIs to do that. In this example, the inference script `TensorRT-LLM/examples/llama/run.py` is provided by TensorRT-LLM repo.

In [9]:
!python TensorRT-LLM/examples/llama/run.py \
    --engine_dir=./trt-engines/llama_7b/fp16/1-gpu \
    --max_output_len 500 \
    --tokenizer_dir "llama-models" \
    --input_text "Let me explain what DNA is. DNA, which stands for"

Running the float16 engine ...
Input: "Let me explain what DNA is. DNA, which stands for"
Output: " deoxyribonucleic acid, is a molecule that contains the genetic instructions used in the development and function of all living organisms. It is a long, complex molecule that is made up of four different chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). These bases are arranged in a specific sequence, creating a unique code that determines the characteristics of an organism.
The sequence of these bases along a DNA molecule determines the genetic information encoded in the DNA. This genetic information is used in the development and function of cells, tissues, and organs, and it is passed from one generation to the next through the replication of DNA.
DNA is found in the nucleus of eukaryotic cells (such as humans) and in the cytoplasm of prokaryotic cells (such as bacteria). It is a double-stranded molecule, meaning that there are two complementary strands of nucleo

## 3. Setup Triton Inference Server for LLM inference

To start with Triton, a model repository with certain structure and configuration files should be prepared first. For ease of simplicity, all is setup already in the `triton_model_repo` folder in this example.

Here we will setup 2 LLM inference pipeline: the vanilla Pytorch pipeline with optimization, and the optimized TensorRT-LLM pipeline for the llama-2-7b-chat model.
- The Python pipeline uses huggingface APIs. The model repo is located at `./triton_model_repo/llama_7b/python/llama-huggingface`
- The TensorRT-LLM pipeline contains multiple separated components under `./triton_model_repo/llama_7b/python`: `preprocessing`, `tensorrt_llm` and `postprocessing`. Here we created an `ensemble` folder which encapsulates the `preprocessing`, `postprocessing` and `tensorrt_llm` steps in the same folder.

In [10]:
!apt update && apt install tree

Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1581 B]
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]      [0m[33m
Get:3 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]                [0m
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]        [33m
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [927 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]      [0m[33m[33m
Get:7 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:10 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [51.8 kB]
Get:12 http://archive.ubuntu.com/ubuntu ja

In [11]:
!mkdir -p triton_model_repo/llama_7b/python/ensemble/1
!tree ./triton_model_repo/

[01;34m./triton_model_repo/[0m
└── [01;34mllama_7b[0m
    └── [01;34mpython[0m
        ├── [01;34mensemble[0m
        │   ├── [01;34m1[0m
        │   └── [01;32mconfig.pbtxt[0m
        ├── [01;34mllama_huggingface[0m
        │   ├── [01;34m1[0m
        │   │   ├── [01;34m__pycache__[0m
        │   │   │   └── [00mmodel.cpython-310.pyc[0m
        │   │   └── [00mmodel.py[0m
        │   └── [00mconfig.pbtxt[0m
        ├── [01;34mpostprocessing[0m
        │   ├── [01;34m1[0m
        │   │   ├── [01;34m__pycache__[0m
        │   │   │   └── [00mmodel.cpython-310.pyc[0m
        │   │   └── [00mmodel.py[0m
        │   └── [01;32mconfig.pbtxt[0m
        ├── [01;34mpreprocessing[0m
        │   ├── [01;34m1[0m
        │   │   ├── [01;34m__pycache__[0m
        │   │   │   └── [00mmodel.cpython-310.pyc[0m
        │   │   └── [00mmodel.py[0m
        │   └── [00mconfig.pbtxt[0m
        └── [01;34mtensorrt_llm[0m
            ├── [01;34m1[0m
       

Feel free to look at the `config.pbtxt` files in each component folder to understand how Triton configures the inference pipeline.

Now we can start the Triton server to serve the 2 pipelines

Note: 
- **Start a separate terminal and run the following commands in terminal**
- **Make sure that you do not have `.ipynb_checkpoints` under `triton_model_repo/llama_7b/python`, this folder can be auto-generated by jupyter and can mess up the launching of Triton.**

In [12]:
## Remove ./triton_model_repo/llama_7b/python/.ipynb_checkpoints, which can mess up launch of triton
!rm -rf ./triton_model_repo/llama_7b/python/.ipynb_checkpoints
!find . -type d -name ".ipynb_checkpoints" -exec rm -rf {} \;

find: ‘./.ipynb_checkpoints’: No such file or directory


**Note: launch the command in the following cell in a separate terminal - this server command needs to be kept alive**

In [15]:
!tritonserver --model-repository=/workspace/nv-inference-demo/notebooks/tensorrt-llm/triton_model_repo/llama_7b/python # --log-verbose 5

I0610 13:45:52.600096 2331 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fa74c000000' with size 268435456
I0610 13:45:52.600852 2331 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0610 13:45:52.625729 2331 model_lifecycle.cc:461] loading: postprocessing:1
I0610 13:45:52.626264 2331 model_lifecycle.cc:461] loading: preprocessing:1
I0610 13:45:52.626789 2331 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0610 13:45:52.627317 2331 model_lifecycle.cc:461] loading: llama_huggingface:1
E0610 13:45:52.690018 2331 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
E0610 13:45:52.690081 2331 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.

You can verify that the triton server has successfully launch when you see terminal output such as below:
```

I0304 16:11:54.555571 7346 server.cc:662] 
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble          | 1       | READY  |
| llama_huggingface | 1       | READY  |
| postprocessing    | 1       | READY  |
| preprocessing     | 1       | READY  |
| tensorrt_llm      | 1       | READY  |
+-------------------+---------+--------+

I0304 16:11:54.593570 7346 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0304 16:11:54.594500 7346 metrics.cc:710] Collecting CPU metrics
I0304 16:11:54.594653 7346 tritonserver.cc:2458] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.39.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /workspace/notebooks/tensorrt-llm/triton_model_repo/llama_7b/python                                                                                                                                             |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0304 16:11:54.596304 7346 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I0304 16:11:54.596487 7346 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I0304 16:11:54.637582 7346 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
```

Now let's send inference requests to the triton server via triton client. To send an inflight inference request to Triton server, launch the following command using the provided client side script.

Here we are sending request to the Python pipeline `llama_huggingface`, feel free to change `--model_name` to `ensemble` to send request to tensorRT-LLM pipeline.

In [20]:
!python ./end_to_end_streaming_client.py -u localhost:8001 --model_name llama_huggingface --max_tokens 500  --prompt "Here is an explanation of what DNA is:"

b'Here is an explanation of what DNA is:\n\nDNA (Deoxyribonucleic acid) is a molecule that contains the genetic instructions used in the development and function of all living organisms. DNA is a long, double-stranded helix made up of nucleotides, which are the building blocks of DNA. Each nucleotide is composed of a sugar molecule called deoxyribose, a phosphate group, and one of four nitrogenous bases - adenine (A), guanine (G), cytosine (C), and thymine (T). The sequence of these nitrogenous bases along the DNA molecule determines the genetic information encoded in the DNA.\nDNA is found in the nucleus of eukaryotic cells (such as humans) and in prokaryotic cells (such as bacteria). It is the primary source of genetic information that is passed from one generation to the next, and it plays a central role in the development and function of all living organisms.\nIn summary, DNA is a molecule that contains the genetic instructions used in the development and function of all living org

## 3. Benchmark Python pipeline vs TensorRT-LLM pipeline

Now we are ready to benchmark the performance of TensorRT-LLM for llama-2-7b-chat inference vs the Python pipeline. A benchmark script `identity_test_python_vs_trtllm.py` is provided.

Run the following command to benchmark the throughput of the huggingface Python pipeline

In [16]:
!python ./identity_test_python_vs_trtllm.py \
    -u localhost:8001 \
    --max_input_len 100 \
    --dataset /workspace/nv-inference-demo/notebooks/tensorrt-llm/datasets/mini_cnn_eval.json \
    -i grpc \
    --model_name "llama_huggingface"

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 4 prompts.
[INFO] Total Latency: 3764.447 ms


Run the following command to benchmark the throughput of the tensorRT-LLM pipeline

In [17]:
!python ./identity_test_python_vs_trtllm.py \
    -u localhost:8001 \
    --max_input_len 100 \
    --dataset /workspace/nv-inference-demo/notebooks/tensorrt-llm/datasets/mini_cnn_eval.json \
    -i grpc \
    --model_name "ensemble"

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 4 prompts.
[INFO] Total Latency: 722.567 ms


Depending on the type of GPU you are using, you can observe a different factor of speed-up for the latency measurement, typically around 4 - 5x.

## 4. Going further

We are not finished with TensorRT optimization yet, we can further push the optimization much further with techniques such as:
- Parallelisation: pipeline and tensor
- Inflight dynamic batching
- Model quantization

We will not cover these in this tutorial, but feel free to explore & test these optimizations by referring to the original demo [here](https://github.com/scaleway/ai-pulse-nvidia-trt-llm/tree/main/docs).