-
Notifications
You must be signed in to change notification settings - Fork 132
Description
Hi, I am trying to deploy a mistral-7b-instruct model on the triton server, but have met with difficulties. I have successfully converted my Mistral model using trtllm-build in the llama example in the TensorRT-LLM repo but I am not sure how to deploy on the Triton Server. There seem to be many ways to do so and I have tried creating a tensorrt_llm backend and an ensemble backend but both does not work. Is it possible to advice on what I should do? I would like to create an endpoint such that I can pass a prompt to the mistral model on the Triton server to return generated text.
Here are the steps I have done:
After pulling the Mistral model weights, I have converted the raw model weights into tensorrt-llm checkpoint format
python convert_checkpoint.py --model_dir Mistral-7B-Instruct-v0.2 \
--output_dir Mistral-7B-Instruct-TensorRT/ \
--dtype float16 \
--weight_only_precision int8
I have built the engine needed (this returns me with a config.json and rank0.engine file):
trtllm-build --checkpoint_dir Mistral-7B-Instruct-TensorRT/ \
--output_dir Mistral-7B-Instruct-compiled/ \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_input_len 32256
I went on to pull the latest triton server version 24.02 and tried to deploy the tensorrt-llm model but have met with the error: UNAVAILABLE: Invalid argument: unable to find backend library for backend 'tensorrtllm', try specifying runtime on the model configuration