Add option to completion API to truncate prompt tokens #3144

tdoublep · 2024-03-01T14:34:33Z

Hey vLLM team - thanks for the awesome project.

This PR adds an additional option to the OpenAI completion API that allows one to truncate the number of input tokens for an individual request. We find this feature extremely useful for benchmarking and performance evaluation.

Without this option, if we want precise control of the number of input tokens, we need to implement tokenization on the client-side (e.g., in our load test environment) which introduces a bunch of dependencies. In this sense, we can live without this feature but it is super convenient to be able to do truncation on the server-side.

I have tried to keep the changes to a minimum, but if there is interest I have also implemented this for the AsyncLLMEngine and LLMEngine.

njhill

Thanks @tdoublep! I'm wondering whether to align with existing vLLM parlance it might be better to rename this to truncate_prompt_tokens?

vllm/entrypoints/openai/serving_engine.py

tokenizer_kwargs: more efficient allocation Co-authored-by: Nick Hill <nickhill@us.ibm.com>

vllm-project#2879 added support for using ray to offload tokenization from the asyncio event loop. This PR extends that to support using a thread pool instead of ray, and makes that the default, with the default pool size determined based on the number of available CPU cores and the tensor parallel size. The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example vllm-project#3144). Also includes some type hint additions to related parts of the code. This replaces the original PR vllm-project#3206 from before vllm-project#2879 was reworked and merged.

simon-mo

lgtm! i believe there are merge conflict since documentations were added to protocol.py

simon-mo · 2024-03-25T17:19:23Z

vllm/entrypoints/openai/protocol.py

@@ -183,6 +183,7 @@ class CompletionRequest(BaseModel):
    guided_json: Optional[Union[str, dict, BaseModel]] = None
    guided_regex: Optional[str] = None
    guided_choice: Optional[List[str]] = None
+    truncate_prompt_tokens: Optional[int] = None


nit can you make this a constrained integer: https://docs.pydantic.dev/2.3/api/types/#pydantic.types.conint

tdoublep · 2024-03-26T15:33:10Z

CI errors do not look related to these changes

vllm/entrypoints/openai/serving_engine.py

diego898 · 2024-03-29T14:20:37Z

Apologies @tdoublep - it seems I submitted a simple comment as a review. That was not my intention.

I think this is great and would be very excited to have this merged as we have also had to implement tokenization clientside to prevent vllm rejecting requests that are too long.

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

Thanks!

simon-mo · 2024-03-29T16:46:44Z

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

If this is a valid use case, maybe let's support it in this PR as well?

diego898 · 2024-03-29T16:50:00Z

For reference, this is the HF tokenizer docs around truncation_side. This PR hard-codes that to "left"

tdoublep · 2024-04-03T10:58:12Z

@diego898 @simon-mo I made the truncation_side configurable via a new option in the ModelConfig. Please take a look and let me know what you think.

njhill · 2024-04-03T14:11:26Z

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

@diego898 @simon-mo could you give an example of a case where it would make sense to truncate on the right for autoregressive text generation? I can't think of one but could be a failure of imagination.

I don't think the fact that HF tokenizers support this in itself is a good reason to support it here, they are much more general and used in many other contexts such as data prep and training, and with different kinds of models.

diego898 · 2024-04-03T16:32:12Z

@njhill - great point. I guess I was thinking it would depend how someone structures their context window. Ex: for RAG:

|--[----chat history----]-[System prompt]--[----------------Context docs------------------]-[current question]-|

You may want to only to take from the left. but if isntead you did:

|[System prompt]--[----------------Context docs------------------]-[current question]---[----chat history----]-|

You may want trim from the right? Or

|[System prompt]-[----chat history----]---[current question]--[----------------Context docs------------------]|

again from the right?

I'm not really suggesting any of these layouts are better/worse. Several wouldnt even make sense.... Just relaying what I was thinking when suggsting to make it confirgurable.

But, if vLLM decides left-side truncation is the norm/default, that would also be fne and end-users can structure their context windows accordingly!

njhill · 2024-04-04T02:13:04Z

Thanks @diego898 TBH I don't think it would make sense to use this truncate option at all in conjunction with a system prompt. Some other form of truncation would need to be used, i.e. when applying the chat template, to retain the system prompt and then exclude the beginning of the user prompt or conversation.

And I don't think that right-truncation would be sensible either for any of the examples you gave. In a chat the last thing you would want to exclude is the most recent messages, and if you truncated in the middle of some arbitrary context docs, the LM would just attempt to continue writing that particular doc.

I would still vote for keeping this left-truncated only. If a concrete need arose in future it would still be easy to add the option to configure the side.

tdoublep · 2024-04-04T13:22:48Z

@simon-mo Do you agree with @njhill that it makes sense to hard-code truncation_side to left? If so, I will revert this branch to the earlier commit.

diego898 · 2024-04-04T19:46:04Z

I apologize for the confusion my request caused! I 100% agree with you @njhill - that is in fact what we do - truncate the chat history and not either side.

I stretched my brain to try and describe a situation where right truncation may make sense, but didn't convince even myself!

I confess, my initial request was based solely on the fact that HF has it configurable.

I apologize @tdoublep for the extra work this caused you!

simon-mo · 2024-04-04T22:50:21Z

I trust @njhill to decide and merge.

tdoublep · 2024-04-05T08:47:20Z

@njhill I've reverted back to the version with truncation side fixed to left, and resolves some minor conflicts with changes on main branch. I think it is ready to merge.

tdoublep · 2024-04-05T08:50:56Z

IDK why ruff is failing - the formatting checks are passing for me locally:

$ bash format.sh 
vLLM yapf: Done
vLLM codespell: Done
vLLM ruff:
vLLM isort: Done

njhill

Thanks for the contribution @tdoublep!

satpalsr · 2024-04-09T04:18:38Z

Why not also add it for Chat Completion API?

…3144)

tdoublep added 3 commits March 1, 2024 14:03

Add truncate_input_tokens to openai completions api

d36ba54

Better docstring

586b4a0

Apply formatting

559e441

tdoublep changed the title ~~Add option to completions API to truncate input tokens for each request~~ Add option to completion API to truncate input tokens for each request Mar 1, 2024

njhill reviewed Mar 1, 2024

View reviewed changes

vllm/entrypoints/openai/serving_engine.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/serving_engine.py Outdated Show resolved Hide resolved

tdoublep and others added 3 commits March 1, 2024 19:52

Update vllm/entrypoints/openai/serving_engine.py

32bc1f8

tokenizer_kwargs: more efficient allocation Co-authored-by: Nick Hill <nickhill@us.ibm.com>

Use truncate_prompt_tokens naming

4b92849

serving_engine.py: fix bug

b6554a8

tdoublep changed the title ~~Add option to completion API to truncate input tokens for each request~~ Add option to completion API to truncate prompt tokens Mar 1, 2024

sampling_params: Docstring update

6efc8d1

njhill mentioned this pull request Mar 5, 2024

Async tokenization using thread pool #3206

Closed

njhill mentioned this pull request Mar 16, 2024

[Core] Support thread-based async tokenizer pools #3449

Open

simon-mo approved these changes Mar 25, 2024

View reviewed changes

tdoublep added 4 commits March 26, 2024 10:36

Resolve conflicts with latest changes on main

7cf8a98

Use pydantic.conint

520f8bb

fix formatting

08b3e19

Fix formatting

c8f3429

njhill mentioned this pull request Mar 27, 2024

scheduler max-length #447

Closed

diego898 reviewed Mar 29, 2024

View reviewed changes

vllm/entrypoints/openai/serving_engine.py Show resolved Hide resolved

Merge branch 'main' into openai-truncate-input

e08bb84

Fix conflicts

c88be4f

tdoublep force-pushed the openai-truncate-input branch from 073b208 to c88be4f Compare April 5, 2024 08:45

njhill approved these changes Apr 5, 2024

View reviewed changes

njhill merged commit 1d7c940 into vllm-project:main Apr 5, 2024
35 checks passed

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

Add option to completion API to truncate prompt tokens (vllm-project#…

62ece88

…3144)

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

tdoublep mentioned this pull request Jun 3, 2024

Remove transformers dependency fmperf-project/fmperf#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to completion API to truncate prompt tokens #3144

Add option to completion API to truncate prompt tokens #3144

tdoublep commented Mar 1, 2024 •

edited

njhill left a comment

simon-mo left a comment

simon-mo Mar 25, 2024

tdoublep Mar 26, 2024

tdoublep commented Mar 26, 2024

diego898 commented Mar 29, 2024

simon-mo commented Mar 29, 2024

diego898 commented Mar 29, 2024

tdoublep commented Apr 3, 2024

njhill commented Apr 3, 2024

diego898 commented Apr 3, 2024

njhill commented Apr 4, 2024

tdoublep commented Apr 4, 2024

diego898 commented Apr 4, 2024

simon-mo commented Apr 4, 2024

tdoublep commented Apr 5, 2024

tdoublep commented Apr 5, 2024

njhill left a comment

satpalsr commented Apr 9, 2024

Add option to completion API to truncate prompt tokens #3144

Add option to completion API to truncate prompt tokens #3144

Conversation

tdoublep commented Mar 1, 2024 • edited

njhill left a comment

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

simon-mo Mar 25, 2024

Choose a reason for hiding this comment

tdoublep Mar 26, 2024

Choose a reason for hiding this comment

tdoublep commented Mar 26, 2024

diego898 commented Mar 29, 2024

simon-mo commented Mar 29, 2024

diego898 commented Mar 29, 2024

tdoublep commented Apr 3, 2024

njhill commented Apr 3, 2024

diego898 commented Apr 3, 2024

njhill commented Apr 4, 2024

tdoublep commented Apr 4, 2024

diego898 commented Apr 4, 2024

simon-mo commented Apr 4, 2024

tdoublep commented Apr 5, 2024

tdoublep commented Apr 5, 2024

njhill left a comment

Choose a reason for hiding this comment

satpalsr commented Apr 9, 2024

tdoublep commented Mar 1, 2024 •

edited