Question about sampler. It takes too much time #249

sleepwalker2017 · 2023-06-26T03:36:57Z

I noticed that, the sampler stage uses lots of repeated cuda kernels. Seems you do sampling in a for loop, launch each kernel for a sequence? Why is this?
BTW, do you compare the performance with FasterTransformer? I didn't see about this.
Thank you!

below is my code:

path = '/data/llm/hf-llama-7b/'
llm = LLM(model=path)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
sampling_params.max_tokens = 1
cnt = 1
start = time.time()
for i in range(cnt):
    with nvtx.annotate("generate", color="red"):
        outputs = llm.generate(prompt_token_ids = input_ids, sampling_params = sampling_params)
end = time.time()
prefill_ticks = (end - start) / cnt

The text was updated successfully, but these errors were encountered:

WoosukKwon · 2023-06-26T17:21:17Z

@sleepwalker2017 Thanks for trying out vLLM and reporting the performance issue! Yes, our sampler is indeed not optimized well yet. Particularly, vLLM performs sampling for one request at a time, because each request can have different sampling parameters. For example, request A may use a top-p sampling while request B in the same batch may use beam search with beam width 6. In such a case, it's not possible to simultaneously process the sampling operations for the two requests. Instead, vLLM process one request at a time. This can incur non-negligible overhead in latency, when you run small models.

That being said, your profiling result is very weird. Could you provide more information about the input_ids you used (e.g., number of sequences, sequence length)?

zhuohan123 · 2023-06-26T20:41:24Z

Please refer to #264 for the comparison with FasterTransformer.

sleepwalker2017 · 2023-06-27T02:03:33Z

#264

@sleepwalker2017 Thanks for trying out vLLM and reporting the performance issue! Yes, our sampler is indeed not optimized well yet. Particularly, vLLM performs sampling for one request at a time, because each request can have different sampling parameters. For example, request A may use a top-p sampling while request B in the same batch may use beam search with beam width 6. In such a case, it's not possible to simultaneously process the sampling operations for the two requests. Instead, vLLM process one request at a time. This can incur non-negligible overhead in latency, when you run small models.

That being said, your profiling result is very weird. Could you provide more information about the input_ids you used (e.g., number of sequences, sequence length)?

Of course, I can provide the input_ids.

Actually it's no special. I use batch = 128, seq_len = 32.
I upload my test inputs.
input_ids.txt

hmellor · 2024-03-08T10:29:28Z

Closing this issue as stale as there has been no discussion in the past 3 months.

If you are still experiencing the issue you describe, feel free to re-open this issue.

sleepwalker2017 changed the title ~~Question about sampler. It costs too much time!~~ Question about sampler. It takes too much time Jun 26, 2023

zhuohan123 mentioned this issue Jun 27, 2023

[Deprecated] vLLM Development Roadmap #244

Closed

76 tasks

AmoghM mentioned this issue Sep 4, 2023

Integrate Speculative decoding to speed up inferences #942

Closed

hmellor closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about sampler. It takes too much time #249

Question about sampler. It takes too much time #249

sleepwalker2017 commented Jun 26, 2023 •

edited

WoosukKwon commented Jun 26, 2023

zhuohan123 commented Jun 26, 2023 •

edited

sleepwalker2017 commented Jun 27, 2023

hmellor commented Mar 8, 2024

Question about sampler. It takes too much time #249

Question about sampler. It takes too much time #249

Comments

sleepwalker2017 commented Jun 26, 2023 • edited

WoosukKwon commented Jun 26, 2023

zhuohan123 commented Jun 26, 2023 • edited

sleepwalker2017 commented Jun 27, 2023

hmellor commented Mar 8, 2024

sleepwalker2017 commented Jun 26, 2023 •

edited

zhuohan123 commented Jun 26, 2023 •

edited