feat: implement the min_tokens sampling parameter #3124

tjohnson31415 · 2024-02-29T20:46:05Z

Adds the min_tokens sampling parameter to ensure a minimum number of generated tokens.

The implementation here is meant to align with https://github.com/IBM/text-generation-inference (IBM's fork of HF TGI). In particular, we want to ignore stop sequences and penalize the EOS token until min_tokens have been generated. Stop sequence can be generated within min_tokens tokens but generation will not terminate. stop_token_ids are treated like the EOS token and penalized so that they are not generated until min_tokens tokens have been generated.

Related PR that stalled: #1945

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

This can be used to prevent the EOS token (and other stop tokens) from being generated by the model when using min_tokens. Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

tjohnson31415 · 2024-02-29T20:55:56Z

@simon-mo I am unable to add you as a reviewer but tagging you RE: #1945 (comment).

I made the change to use a Logits Processor to penalize the tokens, but the changes are more than just the logits processor. The additional changes are to skip the stop sequences check if min_tokens have not yet been generated. Let me know if you had something else in mind for the Processor. I'm also looking for a good way to inject the MinNewTokensProcessor automatically when min_tokens is specified in the sampling params.

njhill · 2024-02-29T22:24:19Z

@tjohnson31415 IMHO since an explicit parameter is being introduced for this, it would be best for the eos token suppression to be tied to that without having to additionally pass a LogitsProcessor. There is already an ignore_eos parameter which is similar but I think is practically only useful for performance testing; I think min_tokens wouldn't have much utility unless used in conjunction with the MinTokensProcessor anyhow.

An additional advantage is that it would then be possible to have a vectorized implementation (not necessarily in this first PR).

One question in this case is whether other provided stop tokens (if any) should be suppressed in addition to eos, I'd lean towards yes but would be good to get input from others on that too.

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

njhill · 2024-03-04T00:03:49Z

vllm/model_executor/layers/sampler.py

@@ -139,6 +142,35 @@ def _get_bin_counts_and_mask(
    return bin_counts, mask


+def _apply_min_tokens_penalty(


This looks great @tjohnson31415. But I think technically the token_ids_to_penalize should be determined per seq_group (i.e. also within the loop) since they may be different per seq group. The indexing gets a bit tricker but I think it might be possible with scatter_ with src=-torch.inf. Or else could group the sequences that share the same list of tokens to pernalize.

Heh, yup. Thanks for pointing that out. I still need to write some tests for this 😅. I pushed a fix to build a list of coordinates to penalize within the loop so the stop ids are per seq_group.

I was trying to use scatter initially, but couldn't figure out how to get it to work. In particular, scatter uses a rectangular tensor and doesn't seem to have a way to "skip" rows where we don't want to scatter into. So I think a gather-modify-scatter (where we gather across all sequences and stop token ids) would work, but we'd still need to index into the gather'd tensor to set the -inf values.

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

njhill

Thanks @tjohnson31415, LGTM!

We should add a test for this too, though perhaps lets wait for confirmation from the maintainers that this would be accepted.

vllm/model_executor/layers/sampler.py

vllm/model_executor/logits_processors.py

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

simon-mo · 2024-03-16T23:15:24Z

Thank you for the contribution, and thank Nick for the review.

My original intention is that min_tokens can be implemented using a built-in logits processor so the interface is cleaner. But current approach is fine as well.

There are some readability issue with _apply_min_tokens_penalty, please redo the list comprehension to make future devs easier to understand. And please add this to OpenAI compatible server as well (see protocols.py in the entrypoints/openai` directory).

vllm/model_executor/layers/sampler.py

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

* upstream/main: [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (vllm-project#3551) [Misc][Log] Add log for tokenizer length not equal to vocabulary size (vllm-project#3500) [🚀 Ready to be merged] Added support for Jais models (vllm-project#3183) Fix 1D query issue from `_prune_hidden_states` (vllm-project#3539) [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (vllm-project#3431) [BugFix] Hot fix in setup.py for neuron build (vllm-project#3537) Migrate `logits` computation and gather to `model_runner` (vllm-project#3233) [1/n][Chunked Prefill] Refactor input query shapes (vllm-project#3236) [1/n] Triton sampling kernel (vllm-project#3186) [Bugfix] Fix ROCm support in CMakeLists.txt (vllm-project#3534)

njhill · 2024-03-21T21:51:48Z

@simon-mo this should be ready now!

vllm/engine/llm_engine.py

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>

tjohnson31415 added 2 commits February 29, 2024 13:44

feat: implement the min_tokens sampling parameter

e3fb690

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

feat: add MinNewTokensLogitsProcessor logits_processor

e849dd2

This can be used to prevent the EOS token (and other stop tokens) from being generated by the model when using min_tokens. Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

simon-mo mentioned this pull request Feb 29, 2024

SamplingParams add min_tokens #1945

Closed

simon-mo self-assigned this Feb 29, 2024

feat: implement _apply_min_tokens_penalty

7fedf5f

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

njhill reviewed Mar 4, 2024

View reviewed changes

fix: determine stop tokens per seq group

f31f737

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

njhill reviewed Mar 13, 2024

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

vllm/model_executor/logits_processors.py Outdated Show resolved Hide resolved

review/fix: actually set logits to -inf

396934d

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

prashantgupta24 reviewed Mar 18, 2024

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

tjohnson31415 added 3 commits March 19, 2024 16:00

review: refactor _apply_min_tokens_penalty, remove logits_processors.py

8f92c57

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

fix: min_tokens should default to 0

d54d2dd

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

feat: add min_tokens to openai server protocol

2f189ec

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

simon-mo mentioned this pull request Mar 20, 2024

[Usage]: Specify the number of tokens to be generated #3518

Closed

tjohnson31415 and others added 3 commits March 19, 2024 22:15

test: add tests for sampler_min_tokens_penalty

c0d0a42

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

test: add randomization to min_tokens test

a4c3907

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

Merge branch 'main' into min-new-tokens

6393a50

tjohnson31415 force-pushed the min-new-tokens branch from ab85f6d to 6393a50 Compare March 20, 2024 19:56

lint: formatting changes from linter

b93e18f

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

tjohnson31415 marked this pull request as ready for review March 20, 2024 21:22

tjohnson31415 force-pushed the min-new-tokens branch from ce72382 to 90ac00c Compare March 21, 2024 17:13

njhill reviewed Mar 21, 2024

View reviewed changes

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

refactor: simplify setting of eos_token_id

fadd8da

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

njhill approved these changes Mar 22, 2024

View reviewed changes

simon-mo approved these changes Mar 25, 2024

View reviewed changes

simon-mo merged commit c13ad1b into vllm-project:main Mar 25, 2024
32 checks passed

tjohnson31415 deleted the min-new-tokens branch March 25, 2024 17:16

njhill mentioned this pull request Mar 25, 2024

Min_tokens #688

Closed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 31, 2024

feat: implement the min_tokens sampling parameter (vllm-project#3124)

a0f1366

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>

tjohnson31415 mentioned this pull request Apr 5, 2024

[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty #3876

Merged

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

DarkLight1337 mentioned this pull request Jun 1, 2024

[Usage]: Generate specified number of tokens for each request individually #3650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement the min_tokens sampling parameter #3124

feat: implement the min_tokens sampling parameter #3124

tjohnson31415 commented Feb 29, 2024 •

edited

tjohnson31415 commented Feb 29, 2024 •

edited

njhill commented Feb 29, 2024

njhill Mar 4, 2024

tjohnson31415 Mar 4, 2024 •

edited

njhill left a comment

simon-mo commented Mar 16, 2024

njhill commented Mar 21, 2024

		@@ -139,6 +142,35 @@ def _get_bin_counts_and_mask(
		return bin_counts, mask


		def _apply_min_tokens_penalty(

feat: implement the min_tokens sampling parameter #3124

feat: implement the min_tokens sampling parameter #3124

Conversation

tjohnson31415 commented Feb 29, 2024 • edited

tjohnson31415 commented Feb 29, 2024 • edited

njhill commented Feb 29, 2024

njhill Mar 4, 2024

Choose a reason for hiding this comment

tjohnson31415 Mar 4, 2024 • edited

Choose a reason for hiding this comment

njhill left a comment

Choose a reason for hiding this comment

simon-mo commented Mar 16, 2024

njhill commented Mar 21, 2024

tjohnson31415 commented Feb 29, 2024 •

edited

tjohnson31415 commented Feb 29, 2024 •

edited

tjohnson31415 Mar 4, 2024 •

edited