Skip to content

Conversation

chenfengjin
Copy link
Contributor

@chenfengjin chenfengjin commented Sep 16, 2025

Purpose

the example basic/classify.py failed to generate probabilities

Processed prompts: 100%|█████████████████████| 4/4 [00:00<00:00, 11.52it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt: 'Hello, my name is' 
Class Probabilities: [nan, nan] (size=2)
------------------------------------------------------------
Prompt: 'The president of the United States is' 
Class Probabilities: [nan, nan] (size=2)
------------------------------------------------------------
Prompt: 'The capital of France is' 
Class Probabilities: [nan, nan] (size=2)
------------------------------------------------------------
Prompt: 'The future of AI is' 
Class Probabilities: [nan, nan] (size=2)
------------------------------------------------------------

it works after add to parameters to LLM arrgs

  • add dtype="bfloat16"
  • add max_num_batched_tokens=131072, which keeps consistent with the model used in the example jason9693/Qwen2.5-1.5B-apeach

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: chenfengjin <1871653365@qq.com>
@mergify mergify bot added the documentation Improvements or additions to documentation label Sep 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue in the classify.py example where it produced NaN probabilities. The fix, which involves setting dtype="bfloat16", is appropriate and resolves the problem. However, the change also introduces a hardcoded max_num_batched_tokens value of 131072. This is unrelated to the primary fix and could lead to out-of-memory errors on systems with less VRAM, undermining the goal of making the example work out-of-the-box. I recommend removing this line to ensure broader compatibility.

model="jason9693/Qwen2.5-1.5B-apeach",
runner="pooling",
enforce_eager=True,
max_num_batched_tokens=131072,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While adding dtype="bfloat16" is the correct fix for the NaN issue, this line introducing max_num_batched_tokens seems unrelated to the main purpose of the PR. Setting this to a large value like 131072 could cause out-of-memory (OOM) errors for users with less VRAM, which would prevent the example from running 'out of the box'. It's better to let vLLM use its default value, which is dynamically determined based on the hardware. To keep this change focused and ensure broader compatibility, please consider removing this line.

Copy link
Contributor Author

@chenfengjin chenfengjin Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_num_batched_tokens is also neccessary as default 4096 is less than max_model_len, which result in errors.

  Value error, max_num_batched_tokens (4096) is smaller than max_model_len (131072). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'runner_t...ync_scheduling': False}), input_type=ArgsKwargs]

@chenfengjin chenfengjin changed the title add dtype to classify example so that it can be run out of box add dtype and max_num_batched_tokens to classify example so that it can be run out of box Sep 16, 2025
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually change the model to internlm/internlm2-1_8b-reward, like in #24853

cc @noooop

@noooop
Copy link
Contributor

noooop commented Sep 16, 2025

Let's actually change the model to internlm/internlm2-1_8b-reward, like in #24853

Here is demonstrating the classify api, and internlm/internlm2-1_8b-reward is a reward model.

The current jason9693/Qwen2.5-1.5B-apeach should be the most suitable model. (and almost all classification-related APIs in CI use jason9693/Qwen2.5-1.5B-apeach. To be honest, I don't know why it's used, but it seems to have become a tradition in vllm


I did not encounter nan when running this example locally.

python examples/offline_inference/basic/classify.py

Generated Outputs:
------------------------------------------------------------
Prompt: 'Hello, my name is'
Class Probabilities: [0.2699090242385864, 0.7300909757614136] (size=2)
------------------------------------------------------------
Prompt: 'The president of the United States is'
Class Probabilities: [0.002644370775669813, 0.9973556995391846] (size=2)
------------------------------------------------------------
Prompt: 'The capital of France is'
Class Probabilities: [0.0021786594297736883, 0.9978213310241699] (size=2)
------------------------------------------------------------
Prompt: 'The future of AI is'
Class Probabilities: [0.03444216027855873, 0.9655578136444092] (size=2)
------------------------------------------------------------

Almost all classification-related APIs in CI use jason9693/Qwen2.5-1.5B-apeach, CI and myself have never encountered results being nan ???

@chenfengjin
Did you use other devices, not NVIDIA, AMD, or CPU. if this model has nan issues, CI should definitely report an error.


jason9693/Qwen2.5-1.5B-apeach defaults torch_dtype to float32.

From my experience, using bfloat16 inference for a float32 pooling model leads to a significant drop in numerical precision.Using float16 inference has almost no loss of precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants