-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
add dtype and max_num_batched_tokens to classify example so that it can be run out of box #24965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: chenfengjin <1871653365@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses an issue in the classify.py
example where it produced NaN
probabilities. The fix, which involves setting dtype="bfloat16"
, is appropriate and resolves the problem. However, the change also introduces a hardcoded max_num_batched_tokens
value of 131072
. This is unrelated to the primary fix and could lead to out-of-memory errors on systems with less VRAM, undermining the goal of making the example work out-of-the-box. I recommend removing this line to ensure broader compatibility.
model="jason9693/Qwen2.5-1.5B-apeach", | ||
runner="pooling", | ||
enforce_eager=True, | ||
max_num_batched_tokens=131072, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While adding dtype="bfloat16"
is the correct fix for the NaN
issue, this line introducing max_num_batched_tokens
seems unrelated to the main purpose of the PR. Setting this to a large value like 131072
could cause out-of-memory (OOM) errors for users with less VRAM, which would prevent the example from running 'out of the box'. It's better to let vLLM use its default value, which is dynamically determined based on the hardware. To keep this change focused and ensure broader compatibility, please consider removing this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_num_batched_tokens is also neccessary as default 4096 is less than max_model_len, which result in errors.
Value error, max_num_batched_tokens (4096) is smaller than max_model_len (131072). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'runner_t...ync_scheduling': False}), input_type=ArgsKwargs]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is demonstrating the classify api, and internlm/internlm2-1_8b-reward is a reward model. The current jason9693/Qwen2.5-1.5B-apeach should be the most suitable model. (and almost all classification-related APIs in CI use jason9693/Qwen2.5-1.5B-apeach. To be honest, I don't know why it's used, but it seems to have become a tradition in vllm I did not encounter nan when running this example locally. python examples/offline_inference/basic/classify.py
Almost all classification-related APIs in CI use jason9693/Qwen2.5-1.5B-apeach, CI and myself have never encountered results being nan ??? @chenfengjin jason9693/Qwen2.5-1.5B-apeach defaults torch_dtype to float32. From my experience, using bfloat16 inference for a float32 pooling model leads to a significant drop in numerical precision.Using float16 inference has almost no loss of precision. |
Purpose
the example basic/classify.py failed to generate probabilities
it works after add to parameters to LLM arrgs
dtype="bfloat16"
jason9693/Qwen2.5-1.5B-apeach
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.