Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions examples/offline_inference/basic/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ def parse_args():
model="jason9693/Qwen2.5-1.5B-apeach",
runner="pooling",
enforce_eager=True,
max_num_batched_tokens=131072,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While adding dtype="bfloat16" is the correct fix for the NaN issue, this line introducing max_num_batched_tokens seems unrelated to the main purpose of the PR. Setting this to a large value like 131072 could cause out-of-memory (OOM) errors for users with less VRAM, which would prevent the example from running 'out of the box'. It's better to let vLLM use its default value, which is dynamically determined based on the hardware. To keep this change focused and ensure broader compatibility, please consider removing this line.

Copy link
Contributor Author

@chenfengjin chenfengjin Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_num_batched_tokens is also neccessary as default 4096 is less than max_model_len, which result in errors.

  Value error, max_num_batched_tokens (4096) is smaller than max_model_len (131072). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'runner_t...ync_scheduling': False}), input_type=ArgsKwargs]

dtype="bfloat16",
)
return parser.parse_args()

Expand Down