Minor fix in prefill cache example #2494

JasonZhu1313 · 2024-01-18T15:19:59Z

In offline_inference_with_prefix.py, we pass a batch of prompts with prefix_pos to the llm.generate call. However, the llm.generate call will batch all prompts and send the batch at once if resources allow. The prefix will only be cached after the first batch is processed, so we need to call generate once to calculate the prefix, cache it, and then use a subsequent call to leverage the cached prefix.

Note: This issue was identified while attempting to do prefix cache for mistral7b, which is not supported with a sliding window. Nevertheless, this call will succeed because only the initial prefix attention computation is executed.

Test

Test done for llama7b model

JasonZhu1313 · 2024-01-18T15:22:55Z

@zhuohan123 @DouHappy @caoshiyi Thanks for adding prefill cache capability, could you help review this pr for a minor fix?

zhuohan123

LGTM! Thanks for the fix!

[Bug fix] Fix prefill cache test example

967c4d8

fix minor issue

680b376

zhuohan123 approved these changes Jan 18, 2024

View reviewed changes

zhuohan123 merged commit 5d80a91 into vllm-project:main Jan 18, 2024
9 of 16 checks passed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Jan 18, 2024

Minor fix in prefill cache example (vllm-project#2494)

4e46285

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Minor fix in prefill cache example (vllm-project#2494)

42453d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor fix in prefill cache example #2494

Minor fix in prefill cache example #2494

JasonZhu1313 commented Jan 18, 2024 •

edited

JasonZhu1313 commented Jan 18, 2024

zhuohan123 left a comment

Minor fix in prefill cache example #2494

Minor fix in prefill cache example #2494

Conversation

JasonZhu1313 commented Jan 18, 2024 • edited

Test

JasonZhu1313 commented Jan 18, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

JasonZhu1313 commented Jan 18, 2024 •

edited