-
Notifications
You must be signed in to change notification settings - Fork 132
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
A100
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I run this:
curl -X POST my_ip:8000/v2/models/ensemble/generate_stream -d '{"text_input": "hello", "max_tokens":250, "temperature":0.00001, "top_p":0.95, "top_k":1, "repetition_penalty":1.2, "stream":true, "end_id":128009, "random_seed":1}'
But the stream is not received smoothly. For example, 100 tokens are received at once every 3 seconds.
Expected behavior
stream receive smoothly
actual behavior
stream is not received smoothly
additional notes
if i remove dynamic_batching in:
https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.14.0/all_models/inflight_batcher_llm/postprocessing/config.pbtxt
the problem will solve but the speed is still slow.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working