# Getting started with TensorRT-LLM and Triton Inference Server

This hands-on tutorial is based on the TensorRT-LLM demo from ai-PULSE by Scaleway, which can be found here: https://github.com/scaleway/ai-pulse-nvidia-trt-llm/tree/main

In this tutorial, we will cover
- How to convert llama 2 models to TensorRT-LLM format
- Set-up Triton Inference Server with llama 2 models optimized using TensorRT-LLM
- Benchmark the inference performance of Triton + TensorRT-LLM pipeline vs vanilla Python HuggingFace pipeline 

## 1. Setup the environment

First let's clone the TensorRT-LLM github repo and be sure to use the correct version for this demo.

In [1]:
!git clone https://github.com/NVIDIA/TensorRT-LLM.git 
!git config --global --add safe.directory /workspace/notebooks/tensorrt-llm/TensorRT-LLM
!cd TensorRT-LLM && git checkout v0.5.0

Cloning into 'TensorRT-LLM'...
remote: Enumerating objects: 10247, done.[K
remote: Counting objects: 100% (207/207), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 10247 (delta 64), reused 192 (delta 61), pack-reused 10040[K
Receiving objects: 100% (10247/10247), 130.76 MiB | 37.75 MiB/s, done.
Resolving deltas: 100% (7052/7052), done.
Updating files: 100% (1949/1949), done.
Updating files: 100% (1725/1725), done.
Note: switching to 'v0.5.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD

Next, let's download the llama 2 models, if it is not already done yet. 

For this you need to go to the models [website](https://llama.meta.com/llama-downloads), register, then an email with a custom URL will be sent to you allowing you to download the llama models.

To proceed with the download, first clone the llama repo, then launch the download script. When prompt with URL, just enter the URL that you received via email before. For this tutorial, we will need to download 1 model: the 7B-chat. Put the downloaded model inside `./llama-models` folder.

Note: **the download could take long time.**

In [2]:
!git clone https://github.com/facebookresearch/llama.git
!ls -lah --color llama-models/

fatal: destination path 'llama' already exists and is not an empty directory.
total 524K
drwxr-xr-x  3 99 99 4.0K Mar  4 15:51 [0m[01;34m.[0m
drwxr-xr-x 15 99 99 4.0K Mar  4 15:51 [01;34m..[0m
-rw-r--r--  1 99 99 6.9K Jul 15  2023 LICENSE
-rw-r--r--  1 99 99 4.7K Jul 15  2023 USE_POLICY.md
drwxr-xr-x  2 99 99 4.0K Feb 27 23:12 [01;34mllama-2-7b-chat[0m
-rw-r--r--  1 99 99 489K Jul 13  2023 tokenizer.model
-rw-r--r--  1 99 99   50 Jul 13  2023 tokenizer_checklist.chk


We also need to clone the huggingface transformers repo, to be able to use the conversion script to convert llama 2 models checkpoint format to huggingface's Transformers format.

In [5]:
!git clone https://github.com/huggingface/transformers

Cloning into 'transformers'...
remote: Enumerating objects: 187335, done.[K
remote: Counting objects: 100% (759/759), done.[K
remote: Compressing objects: 100% (312/312), done.[K
remote: Total 187335 (delta 470), reused 596 (delta 373), pack-reused 186576[K
Receiving objects: 100% (187335/187335), 207.65 MiB | 40.09 MiB/s, done.
Resolving deltas: 100% (131407/131407), done.
Updating files: 100% (4096/4096), done.


Now convert meta checkpoint weights to huggingface format

In [15]:
!cp llama-models/tokenizer.model llama-models/llama-2-7b-chat/.
!python ./transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ./llama-models/llama-2-7b-chat --model_size 7B --output_dir ./hf-weights/7B-chat


You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
Fetching all parameters from the checkpoint at ./llama-models/llama-2-7b-chat.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.90it/s]
Saving in the Transformers format.


## 2. Compile llama 2 models to TensorRT-LLM engines

Just like TensorRT, TensorRT-LLM provides APIs to compile / convert build LLMs to TensorRT engines. In this example, the conversion steps were implemented already in the `TensorRT-LLM/examples/llama/build.py` script provided by TensorRT-LLM repo. We can analyze the script to see how TensorRT-LLM APIs were used to build the LLM model and load the trained weights.

The TensorRT-LLM team is working on high-level APIs to make the conversion steps easier.

In [16]:
!python TensorRT-LLM/examples/llama/build.py \
    --model_dir ./hf-weights/7B-chat  \
    --dtype float16 \
    --use_gpt_attention_plugin float16  \
    --paged_kv_cache \
    --remove_input_padding \
    --use_gemm_plugin float16  \
    --output_dir "./trt-engines/llama_7b/fp16/1-gpu"  \
    --max_input_len 2048 --max_output_len 512 \
    --use_rmsnorm_plugin float16  \
    --enable_context_fmha \
    --use_inflight_batching

[02/27/2024-23:16:50] [TRT-LLM] [I] Serially build TensorRT engines.
[02/27/2024-23:16:50] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 118, GPU 417 (MiB)
[02/27/2024-23:16:53] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1974, GPU +350, now: CPU 2228, GPU 767 (MiB)
[02/27/2024-23:16:53] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[02/27/2024-23:16:58] [TRT-LLM] [I] Loading HF LLaMA ... from ./hf-weights/7B-chat
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:05<00:00,  2.52s/it]
[02/27/2024-23:17:23] [TRT-LLM] [I] HF LLaMA loaded. Total time: 00:00:25
[02/27/2024-23:17:23] [TRT-LLM] [I] Loading weights from HF LLaMA...
[02/27/2024-23:17:59] [TRT-LLM] [I] Weights loaded. Total time: 00:00:35
[02/27/2024-23:17:59] [TRT-LLM] [I] Context FMHA Enabled
[02/27/2024-23:17:59] [TRT-LLM] [I] Remove Padding Enabled
[02/27/2024-23:17:59] [TRT-LLM] [I] Paged KV Cache Enabled
[02/27/2024-23:17:59] [TRT-LLM] [I] Build TensorRT engine l

Now let's run the inference of the llama-2-7b-chat model. Similarly, TensorRT-LLM provides APIs to do that. In this example, the inference script `TensorRT-LLM/examples/llama/run.py` is provided by TensorRT-LLM repo.

In [3]:
!python TensorRT-LLM/examples/llama/run.py \
    --engine_dir=./trt-engines/llama_7b/fp16/1-gpu \
    --max_output_len 100 \
    --tokenizer_dir "llama-models" \
    --input_text "How do I count in French ? 1 un"

Running the float16 engine ...
Input: "How do I count in French ? 1 un"
Output: ", 2 deux, 3 trois, 4 quatre, 5 cinq, 6 six, 7 sept, 8 huit, 9 neuf, 10 dix.
How do you say "I love you" in French? Je t'aime.
How do you say "Thank you" in French? Merci.
How do you say "You're welcome" in French? De rien.
How do you say "Goodbye" in"


## 3. Setup Triton Inference Server for LLM inference

To start with Triton, a model repository with certain structure and configuration files should be prepared first. For ease of simplicity, all is setup already in the `triton_model_repo` folder in this example.

Here we will setup 2 LLM inference pipeline: the vanilla Pytorch pipeline with optimization, and the optimized TensorRT-LLM pipeline for the llama-2-7b-chat model.
- The Python pipeline uses huggingface APIs. The model repo is located at `./triton_model_repo/llama_7b/python/llama-huggingface`
- The TensorRT-LLM pipeline contains multiple separated components under `./triton_model_repo/llama_7b/python`: `preprocessing`, `tensorrt_llm` and `postprocessing`. Here we created an `ensemble` folder which encapsulates the `preprocessing`, `postprocessing` and `tensorrt_llm` steps in the same folder.

In [5]:
!apt update && apt install tree

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]                [0m[33m
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]        [0m[33m
Get:4 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [1889 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]      [0m
Get:6 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1517 kB][33m[33m
Get:7 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1074 kB]
Get:8 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [44.6 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]m[33m[0m[33m[33m[33m[33m
Get:10 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB][0m[33m
Get:11 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB][0m[33m[33m[33m[33m
Ge

In [6]:
!tree ./triton_model_repo/

[01;34m./triton_model_repo/[0m
└── [01;34mllama_7b[0m
    └── [01;34mpython[0m
        ├── [01;34mensemble[0m
        │   ├── [01;34m1[0m
        │   └── [01;32mconfig.pbtxt[0m
        ├── [01;34mllama_huggingface[0m
        │   ├── [01;34m1[0m
        │   │   ├── [01;34m__pycache__[0m
        │   │   │   └── [00mmodel.cpython-310.pyc[0m
        │   │   └── [00mmodel.py[0m
        │   └── [00mconfig.pbtxt[0m
        ├── [01;34mpostprocessing[0m
        │   ├── [01;34m1[0m
        │   │   ├── [01;34m__pycache__[0m
        │   │   │   └── [00mmodel.cpython-310.pyc[0m
        │   │   └── [00mmodel.py[0m
        │   └── [01;32mconfig.pbtxt[0m
        ├── [01;34mpreprocessing[0m
        │   ├── [01;34m1[0m
        │   │   ├── [01;34m__pycache__[0m
        │   │   │   └── [00mmodel.cpython-310.pyc[0m
        │   │   └── [00mmodel.py[0m
        │   └── [00mconfig.pbtxt[0m
        └── [01;34mtensorrt_llm[0m
            ├── [01;34m1[0m
       

Feel free to look at the `config.pbtxt` files in each component folder to understand how Triton configures the inference pipeline.

Now we can start the Triton server to serve the 2 pipelines

Note: 
- **Start a separate terminal and run the following commands in terminal**
- **Make sure that you do not have `.ipynb_checkpoints` under `triton_model_repo/llama_7b/python`, this folder can be auto-generated by jupyter and can mess up the launching of Triton.**

In [15]:
## Remove ./triton_model_repo/llama_7b/python/.ipynb_checkpoints, which can mess up launch of triton
!rm -rf ./triton_model_repo/llama_7b/python/.ipynb_checkpoints

In [16]:
## LAUNCH THIS COMMAND IN A SEPARATE TERMINAL - this server command needs to be kept alive

!tritonserver --model-repository=/workspace/notebooks/tensorrt-llm/triton_model_repo/llama_7b/python # --log-verbose 5

I0304 16:08:03.536612 6090 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0304 16:08:03.800840 6090 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f5dac000000' with size 268435456
I0304 16:08:03.801513 6090 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0304 16:08:03.807170 6090 model_config_utils.cc:680] Server side auto-completed config: name: "ensemble"
platform: "ensemble"
max_batch_size: 128
input {
  name: "text_input"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "max_tokens"
  data_type: TYPE_UINT32
  dims: -1
}
input {
  name: "bad_words"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "stop_words"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "end_id"
  data_type: TYPE_UINT32
  dims: 1
  optional: true
}
input {
  name: "pad_id"
  data_type: TYPE_UINT32
  dims: 1
  optional: true
}
input {
  name: "top_k"
  data_type: TYPE_UINT32
  dims: 1
  optional: true

You can verify that the triton server has successfully launch when you see terminal output such as below:
```

I0304 16:11:54.555571 7346 server.cc:662] 
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble          | 1       | READY  |
| llama_huggingface | 1       | READY  |
| postprocessing    | 1       | READY  |
| preprocessing     | 1       | READY  |
| tensorrt_llm      | 1       | READY  |
+-------------------+---------+--------+

I0304 16:11:54.593570 7346 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0304 16:11:54.594500 7346 metrics.cc:710] Collecting CPU metrics
I0304 16:11:54.594653 7346 tritonserver.cc:2458] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.39.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /workspace/notebooks/tensorrt-llm/triton_model_repo/llama_7b/python                                                                                                                                             |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0304 16:11:54.596304 7346 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I0304 16:11:54.596487 7346 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I0304 16:11:54.637582 7346 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
```

Now let's send inference requests to the triton server via triton client. To send an inflight inference request to Triton server, launch the following command using the provided client side script.

Here we are sending request to the Python pipeline `llama_huggingface`, feel free to change `--model_name` to `ensemble` to send request to tensorRT-LLM pipeline.

In [20]:
!python ./end_to_end_streaming_client.py -u localhost:8001 --model_name llama_huggingface --max_tokens 100  --prompt "I am going to"

b'I am going to be a little more specific about the types of things I would like to see in the future.\n1. More detailed information about the different types of weapons and armor. For example, what are the strengths and weaknesses of each type of weapon? How does the armor work? What are the different types of armor and how do they protect the player?\n2. More variety in the enemies. While the current enemies are interesting, I would like to see more variety in'


## 3. Benchmark Python pipeline vs TensorRT-LLM pipeline

Now we are ready to benchmark the performance of TensorRT-LLM for llama-2-7b-chat inference vs the Python pipeline. A benchmark script `identity_test_python_vs_trtllm.py` is provided.

Run the following command to benchmark the throughput of the huggingface Python pipeline

In [21]:
!python ./identity_test_python_vs_trtllm.py \
    -u localhost:8001 \
    --max_input_len 100 \
    --dataset /workspace/notebooks/tensorrt-llm/datasets/mini_cnn_eval.json \
    -i grpc \
    --model_name "llama_huggingface"

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 4 prompts.
[INFO] Total Latency: 3948.6 ms


Run the following command to benchmark the throughput of the tensorRT-LLM pipeline

In [22]:
!python ./identity_test_python_vs_trtllm.py \
    -u localhost:8001 \
    --max_input_len 100 \
    --dataset /workspace/notebooks/tensorrt-llm/datasets/mini_cnn_eval.json \
    -i grpc \
    --model_name "ensemble"

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 4 prompts.
[INFO] Total Latency: 724.201 ms


Depending on the type of GPU you are using, you can observe a different factor of speed-up for the latency measurement, typically around 4 - 5x.

## 4. Going further

We are not finished with TensorRT optimization yet, we can further push the optimization much further with techniques such as:
- Parallelisation: pipeline and tensor
- Inflight dynamic batching
- Model quantization

We will not cover these in this tutorial, but feel free to explore & test these optimizations by referring to the original demo [here](https://github.com/scaleway/ai-pulse-nvidia-trt-llm/tree/main/docs).