# Tensor RT for LLMs

This TIR Image comes with all pre-packaged with 
- tensor-rt LLM scripts (/app/scripts)
- triton server with tensor-rt backend (/app/scripts/launch_triton_server.py)
- tensor-rt LLM examples (/app/tensorrt_llm/examples/llama)

### Download model weights from HF

Before we proceed, lets upgrade huggingface hub and download the model weights.

In [None]:
!pip install --upgrade huggingface_hub

In [None]:
!huggingface-cli download sarvamai/OpenHathi-7B-Hi-v0.1-Base

The model will be downloaded to $HOME/.cache folder. Run the following to get actual directory in which base model (OpenHathi) will be downloaded

In [9]:
!ls  $HOME/.cache/huggingface/hub/models--sarvamai--OpenHathi-7B-Hi-v0.1-Base/snapshots

2cb5807b852028defa07c56c96a7ff5c11f8df0e


Assign the directory name above to BASE_MODEL_PATH (see the last bit in the assignment )

In [18]:
%env BASE_MODEL_PATH=/home/jovyan/.cache/huggingface/hub/models--sarvamai--OpenHathi-7B-Hi-v0.1-Base/snapshots/2cb5807b852028defa07c56c96a7ff5c11f8df0e

env: BASE_MODEL_PATH=/home/jovyan/.cache/huggingface/hub/models--sarvamai--OpenHathi-7B-Hi-v0.1-Base/snapshots/2cb5807b852028defa07c56c96a7ff5c11f8df0e


In [19]:
!echo $BASE_MODEL_PATH

/home/jovyan/.cache/huggingface/hub/models--sarvamai--OpenHathi-7B-Hi-v0.1-Base/snapshots/2cb5807b852028defa07c56c96a7ff5c11f8df0e


### Build TensorRT LLM engine for Base Model

TensorRT LLM requires the huggingface weights to be first converted into a format that tensorRT library can understand. The command below will generate checkpoint files in /home/jovyan/ckpt folder. 

In [None]:
!python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir /home/jovyan/.cache/huggingface/hub/models--sarvamai--OpenHathi-7B-Hi-v0.1-Base/snapshots/2cb5807b852028defa07c56c96a7ff5c11f8df0e  \
-- --output_dir /home/jovyan/ckpt --dtype float16

Now we can create an engine with the checkpoint directory (above). here model will create an optimised engine for a specific GPU architecture. hence, you can only run this engine on same make of the GPU card. For e.g. an engine created on A100 can only be run (inference) on A100 cards. 

The following command creates a plain and simple tensorRT engine. No Lora, neither quantization is considered at this point. We will cover those topics in later sections

In [None]:
!trtllm-build --checkpoint_dir /home/jovyan/ckpt \
            --output_dir /home/jovyan/base-engine \
            --gemm_plugin float16 


If the above command runs successfully, you will find engine files created in /home/jovyan/base_engine. The next step is to test the engine. 

In [None]:
!python /app/tensorrt_llm/examples/run.py --run_profiling --engine_dir "/home/jovyan/engine-lora" \
              --max_output_len 125 \
              --tokenizer_dir "sarvamai/OpenHathi-7B-Hi-v0.1-Base" \
              --input_text "मैं एक अच्छा हाथी हूँ"

Note that a specific tokenizer is being passed in the run command above. This script will run inference on given engine, and use the specified tokenizer. If you have custom tokenizer, you can also mention directory where the tokenizer is available. 
The above command will also show the latency for batch size 1. 

### Build TensorRT LLM engine for your custom model

If you have a fully-fine tuned model then the steps would be similar to the base model (shown above) with exception that the checkpoint will be created with directory path where you custom model is stored. Do note, here we are still referring to full model and not lora/qlora yet. 

In [None]:
!python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir /home/jovyan/custom-model-dir \
-- --output_dir /home/jovyan/ckpt --dtype float16

### Use TensorRT LLM engine with Lora weights

When working with LORA we have two options:
- merge base model with lora weights (adapters) and create engine for merged model
- create engine with base model and pass lora weights (adapter) during run-time 

As it turns out, the second option offers most flexibility as well as quality.  hence, we will follow this option in the method below. 

Here we will still work with full model (base model) in FP16. And the LORA weights are also FP16. In later section, we will look at even more optimal configuration of (base model in INT4 + LORA weights FP16). 

The advantage of quantization will be lower gpu memory requirement during inference time. It can however impact performance, so it is recommended to test model for quality. Tensor RT also offers an easy way to test this out. More on this in later sections. 

Lets see how to build engine with LORA.  

In [None]:
# update the lora_dir parameter to wherever your lora weights are stored. The engine will be created in engine-lora folder

trtllm-build --checkpoint_dir /home/jovyan/ckpt \
            --output_dir /home/jovyan/engine-lora \
            --gemm_plugin float16 \
            --lora_plugin float16 \
            --lora_dir  "/home/jovyan/trained-lora" 

Test the engine with lora weights and tokenizer. 

In [None]:
!python /app/tensorrt_llm/examples/run.py --engine_dir "home/jovyan/engine-lora" \
              --max_output_len 125 \
              --tokenizer_dir "sarvamai/OpenHathi-7B-Hi-v0.1-Base" \
              --input_text "मैं एक अच्छा हाथी हूँ" \
              --lora_dir "/home/jovyan/trained-lora/" \
              --lora_task_uids 0 \
              --no_add_special_tokens \
              --use_py_session

To test the same engine without lora just pass lora_task_uids -1. This will offer good idea on how model performs with and without lora. You can also compare the quality of result.  

In [None]:
!python /app/tensorrt_llm/examples/run.py --engine_dir "home/jovyan/engine-lora" \
              --max_output_len 125 \
              --tokenizer_dir "sarvamai/OpenHathi-7B-Hi-v0.1-Base" \
              --input_text "मैं एक अच्छा हाथी हूँ"  \
              --lora_dir "/home/jovyan/trained-lora/" \
              --lora_task_uids -1 \
              --no_add_special_tokens \
              --use_py_session

### Use TensorRT LLM engine with INT4 quantization - Lora weights

The huggingface method of quantization with bitsandbytes does not work out of the box with Tensor RT. here we need to run a post quantization script to generate INT4 quantized version (or any variant) of base model.

Note: you may change model_dir if you are storing base model or custom model in other directories. This model is expected to be full model and not lora (adapter) weights


In [None]:

!python /app/tensorrt_llm/examples/quantization/quantize.py --model_dir /home/jovyan/.cache/huggingface/hub/models--sarvamai--OpenHathi-7B-Hi-v0.1-Base/snapshots/2cb5807b852028defa07c56c96a7ff5c11f8df0e  \
                                   --output_dir /home/jovyan/int4-weights \
                                   --dtype float16 \
                                   --qformat int4  

We can now use the quantised version of base model (from /home/int4-weights) to create tensorrt-engine with lora. 

In [None]:
# update the lora_dir parameter to wherever your lora weights are stored. The engine will be created in engine-lora folder

!trtllm-build --checkpoint_dir /home/jovyan/ckpt \
            --output_dir /home/jovyan/engine-int4 \
            --gemm_plugin float16 \
            --lora_plugin float16 \
            --lora_dir  "/home/jovyan/trained-lora" 

Note: The int4 quantization is available in latest tensorRT (v.0.11.0). 

Once we have an engine (int4 quantised) we can now run the test again with lora. 

In [None]:
!python /app/tensorrt_llm/examples/run.py --engine_dir "home/jovyan/engine-int4" \
              --max_output_len 125 \
              --tokenizer_dir "sarvamai/OpenHathi-7B-Hi-v0.1-Base" \
              --input_text "मैं एक अच्छा हाथी हूँ" \
              --lora_dir "/home/jovyan/trained-lora/" \
              --lora_task_uids 0 \
              --no_add_special_tokens \
              --use_py_session

### References
You can find more references and methods for further optimization here [https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md]
