-
Notifications
You must be signed in to change notification settings - Fork 132
Description
Hello, I built the triton 23.12 container with the trtllm 0.7.1 backend using the 3rd build option in the triton trtllm guide, and deployed 2 models, Mistral 7b instruct and phind codellama v2 34b, and when I asked them simple questions that didn't need long answers like "what is 1+1" with 50 max tokens, the output was the answer, but it kept generating . It repeated the question with , 2+2 is 4, 3+3 is 6 etc, till it reached the max tokens. I tried sending it temperature 0 and stop_words, but it kept happening. I also tried using prompt engineering to tell hime to stop after giving the answer, but it didn't work. Happened both in generate and generate_stream. Sent the request through the tensorrt_llm_bls. I loaded both models on vllm, and it worked just fine. I was wondering if this is a feature or a bug of there is a way to stop that in the config.pbtxt or something. Thanks in advance! 🙏