# TensorRT-LLM
The objective of this notebook is to demonstrate the use of TensorRT-LLM to optimize Llama-3.1-8B-Instruct, run inference, and examine using various advance optimization techniques.

## Overview of TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism).

In [None]:
%%bash

git clone https://github.com/NVIDIA/TensorRT-LLM.git -b v0.16.0 --single-branch

## 1. Download model from Huggingface

In [None]:
%%bash

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir Llama-3.1-8B-Instruct --local-dir-use-symlinks=False

## 2. Building TensorRT-LLM engine(s) for Llama-3.1-8B-Instruct

This section shows how to build tensorrt engine(s) using huggingface model.
Before we proceed to build our engine, it is important to be aware of the supported matrixes for Llama-3 as listed below:

- FP16
- FP8
- INT8 & INT4 Weight-Only
- SmoothQuant
- Groupwise quantization (AWQ/GPTQ)
- FP8 KV cache
- INT8 KV cache (+ AWQ/per-channel weight-only)
- Tensor Parallel

### 2.1 Build TensorRT-LLM engines - FP16

**TensorRT-LLM** builds TensorRT engine(s) from HF checkpoint. Firstly, we used the `convert_checkpoint.py` script to convert Llama-3-Taiwan-8B-Instruct into tensorrt-llm checkpoint format. We use the `trtllm-build` command to build our tensorrt engine.

The `trtllm-build` command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The checkpoint directory provides the model's weights and architecture configuration. The number of engine files is also same to the number of GPUs used to run inference.

`trtllm-build` command has a variety of options. In particular, the plugin-related options have two categories:

- Plugin options that requires a data type (e.g., `gpt_attention_plugin`), you can
    - explicitly specify `float16`/`bfloat16`/`float32`, so that the plugins are enabled with the specified precision;
    - implicitly specify `auto`, so that the plugins are enabled with the precision automatically inferred from model dtype (i.e., the dtype specified in weight conversion); or
    - disable the plugin by `disable`.
    
- Other features that requires a boolean (e.g., `context_fmha`, `paged_kv_cache`, `remove_input_padding`), you can
enable/disable the feature by specifying `enable`/`disable`.

Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding --workers argument. Please note that currently workers feature only supports single node.

The last step is to run the inference using the `run.py` and `summarize.py` script. 

In [1]:
%%bash

# Define model weight path, output checkpoint path and output engine path
HF_MODEL=Llama-3.1-8B-Instruct
CKPT_PATH=ckpt/bf16
ENGINE_PATH=llama31/bf16

python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \
    --model_dir $HF_MODEL \
    --output_dir $CKPT_PATH \
    --dtype bfloat16 \
    --tp_size 1

trtllm-build \
    --checkpoint_dir $CKPT_PATH \
    --output_dir $ENGINE_PATH \
    --gemm_plugin auto

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
0.16.0
[01/22/2025-17:33:04] [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False


230it [00:07, 29.60it/s]


Total time of reading and converting: 7.903 s
Total time of saving checkpoint: 23.163 s
Total time of converting checkpoints: 00:00:31
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/22/2025-17:33:40] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set gemm_plugin to auto.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set nccl_plugin to auto.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set lora_plugin to None.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set moe_plugin to auto.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[01/22/2025-17:33:40] [TRT-LLM] [I] Set context_fmha to True.
[01/22/2025-

#### flag description for `convert_checkpoint.py`:
- `model_dir`: path to the model directory
- `output_dir`: path to the directory to store the tensorrt-llm checkpoint format or the tensorrt engine
- `dtype`: data type to use for model conversion to tensorrt-llm checkpoint

#### flag description for `trtllm-build`:
- `checkpoint_dir`: path to the directory to load the tensorrt-llm checkpoint needed to build the tensorrt engine
- `output_dir`: path to the directory to store the tensorrt-llm checkpoint format or the tensorrt engine
- `gemm_plugin`: required plugin to prevent accuracy issue

### 2.2 Build TensorRT-LLM engines - INT8 KV cache + per-channel weight-only quantization
To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes. TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

In [None]:
%%bash

pip install datasets==2.19

# Define model weight path, output checkpoint path and output engine path
HF_MODEL=Llama-3.1-8B-Instruct
CKPT_PATH=ckpt/int8
ENGINE_PATH=llama31/int8

python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \
    --model_dir $HF_MODEL \
    --output_dir $CKPT_PATH \
    --dtype bfloat16 \
    --tp_size 1 \
    --int8_kv_cache \
    --use_weight_only \
    --weight_only_precision int8

trtllm-build \
    --checkpoint_dir $CKPT_PATH \
    --output_dir $ENGINE_PATH \
    --gemm_plugin auto

### 2.3 Build TensorRT-LLM engines - FP8 Post-Training Quantization [Optional]

The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process. Although the V100 does not support the FP8 datatype, we have included it as a reference.

In [None]:
# Define model weight path, output checkpoint path and output engine path
HF_MODEL=Llama-3.1-8B-Instruct
CKPT_PATH=ckpt/fp8
ENGINE_PATH=llama31/fp8

python3 TensorRT-LLM/examples/quantization/quantize.py \
    --model_dir $HF_MODEL \
    --dtype bfloat16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir $CKPT_PATH \
    --calib_size 512 \
    --tp_size 1


trtllm-build --checkpoint_dir $CKPT_PATH \
             --output_dir $ENGINE_PATH \
             --gemm_plugin auto

### 2.4 Build TensorRT-LLM engines - Groupwise quantization (AWQ/GPTQ)
One can enable AWQ/GPTQ INT4 weight only quantization with these options when building engine with trtllm-build:
NVIDIA Modelopt toolkit is used for AWQ weight quantization. Please see [examples/quantization/README.md](tensorrtllm_backend/tensorrt_llm/examples/quantization/README.md) for Modelopt installation instructions.

In [None]:
# Define model weight path, output checkpoint path and output engine path
HF_MODEL=Llama-3.1-8B-Instruct
CKPT_PATH=ckpt/int4_awq
ENGINE_PATH=llama31/int4_awq

# Quantize HF LLaMA 8B checkpoint into INT4 AWQ format
python3 TensorRT-LLM/examples/quantization/quantize.py \
    --model_dir $HF_MODEL \
    --dtype bfloat16 \
    --qformat int4_awq \
    --awq_block_size 128 \
    --output_dir $CKPT_PATH \
    --calib_size 4

trtllm-build --checkpoint_dir $CKPT_PATH \
             --output_dir $ENGINE_PATH \
             --gemm_plugin auto

### 3. Launch Inference Server

Open a terminal and run the following code:

- On the terminal, navigate to the launch script folder by running this command:

```bash
cd /workspace/
```

- Start the Triton Server with this command:

```bash
HF_MODEL=Llama-3.1-8B-Instruct
ENGINE_PATH=llama31/bf16

trtllm-serve $ENGINE_PATH \
--tokenizer $HF_MODEL
```

In [5]:
%%bash

curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "engine",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Where is New York?"}
        ],
        "max_tokens": 16,
        "temperature": 0
    }' | jq -r '.choices[0].message.content'

New York is a state located in the northeastern United States. It is one of
