Skip to content

Deployment of TensorRT-LLM Model on Triton Server #379

@jasonngap1

Description

@jasonngap1

Hi, I am trying to deploy a mistral-7b-instruct model on the triton server, but have met with difficulties. I have successfully converted my Mistral model using trtllm-build in the llama example in the TensorRT-LLM repo but I am not sure how to deploy on the Triton Server. There seem to be many ways to do so and I have tried creating a tensorrt_llm backend and an ensemble backend but both does not work. Is it possible to advice on what I should do? I would like to create an endpoint such that I can pass a prompt to the mistral model on the Triton server to return generated text.

Here are the steps I have done:
After pulling the Mistral model weights, I have converted the raw model weights into tensorrt-llm checkpoint format

python convert_checkpoint.py --model_dir Mistral-7B-Instruct-v0.2 \
                             --output_dir Mistral-7B-Instruct-TensorRT/ \
                             --dtype float16 \
                             --weight_only_precision int8

I have built the engine needed (this returns me with a config.json and rank0.engine file):

trtllm-build --checkpoint_dir Mistral-7B-Instruct-TensorRT/ \
            --output_dir Mistral-7B-Instruct-compiled/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16 \
            --max_input_len 32256

I went on to pull the latest triton server version 24.02 and tried to deploy the tensorrt-llm model but have met with the error: UNAVAILABLE: Invalid argument: unable to find backend library for backend 'tensorrtllm', try specifying runtime on the model configuration

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions