I tested Baichuan using the TRT-LLM In-flight Triton Server and found many cases of repetition in the test dataset. In offline inference, I tested these repetitive cases and set the repetition_penalty, which helped prevent some inputs from repeating at the end.

However, during inference on the In-flight Triton Server, regardless of how I modify the repetition_penalty, the repetition in the results does not change.
Based on my testing, the top-k, top-p, and temperature sampling parameters are all effective, while the presence_penalty and repetition_penalty do not seem to be effective in my tests.
Could you try to reproduce this issue? You can test the effectiveness of the repetition_penalty in the TRT-LLM In-flight backend with the Baichuan model.