Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor fix in prefill cache example #2494

Merged

Conversation

JasonZhu1313
Copy link
Contributor

@JasonZhu1313 JasonZhu1313 commented Jan 18, 2024

In offline_inference_with_prefix.py, we pass a batch of prompts with prefix_pos to the llm.generate call. However, the llm.generate call will batch all prompts and send the batch at once if resources allow. The prefix will only be cached after the first batch is processed, so we need to call generate once to calculate the prefix, cache it, and then use a subsequent call to leverage the cached prefix.

Note: This issue was identified while attempting to do prefix cache for mistral7b, which is not supported with a sliding window. Nevertheless, this call will succeed because only the initial prefix attention computation is executed.

Test

Test done for llama7b model

@JasonZhu1313
Copy link
Contributor Author

@zhuohan123 @DouHappy @caoshiyi Thanks for adding prefill cache capability, could you help review this pr for a minor fix?

Copy link
Collaborator

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix!

@zhuohan123 zhuohan123 merged commit 5d80a91 into vllm-project:main Jan 18, 2024
9 of 16 checks passed
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Jan 18, 2024
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants