Skip to content

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Sep 22, 2025

Purpose

"Distributed Tests (B200)" has been failing due to this assert on the output being contiguous. This doesn't seem to be required and if I remove the assert the test is green

https://buildkite.com/vllm/ci/builds/31718/steps/canvas?sid=01996a6e-b6da-4752-9ec9-dfb3968b9a2e

Test Plan

Test Result

lm_eval --model local-completions --model_args model=deepseek-ai/DeepSeek-V2-Lite-Chat,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=500,tokenized_requests=False --tasks gsm8k --num_fewshot 5

# TP=2
vllm serve deepseek-ai/DeepSeek-V2-Lite-Chat -tp 2  
...
(EngineCore_DP0 pid=1740038) INFO 09-22 15:19:08 [kv_cache_utils.py:1087] GPU KV cache size: 4,924,288 tokens
...
Requesting API: 100%|███████████| 1319/1319 [00:21<00:00, 60.35it/s]
local-completions (model=deepseek-ai/DeepSeek-V2-Lite-Chat,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=500,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6490|±  |0.0131|
|     |       |strict-match    |     5|exact_match|↑  |0.6422|±  |0.0132|

# TP=2, DCP=2
vllm serve deepseek-ai/DeepSeek-V2-Lite-Chat -dcp 2 -tp 2
...
(EngineCore_DP0 pid=1723671) INFO 09-22 15:10:41 [kv_cache_utils.py:1083] Multiplying the GPU KV cache size by the dcp_world_size 2.
(EngineCore_DP0 pid=1723671) INFO 09-22 15:10:41 [kv_cache_utils.py:1087] GPU KV cache size: 9,809,408 tokens
...
Requesting API: 100%|███████████| 1319/1319 [00:25<00:00, 51.92it/s]
local-completions (model=deepseek-ai/DeepSeek-V2-Lite-Chat,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=500,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6573|±  |0.0131|
|     |       |strict-match    |     5|exact_match|↑  |0.6520|±  |0.0131|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Michael Goin <mgoin64@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a failing distributed test by removing an assertion that checks for tensor contiguity. While this makes the test pass, it might introduce a silent correctness or performance issue in the subsequent reduce_scatter operation, which often relies on contiguous tensors. I've suggested ensuring the tensor is contiguous instead of just removing the check, which is a safer approach.

@mgoin mgoin added the ci-failure Issue about an unexpected test failure in CI label Sep 22, 2025
@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Sep 22, 2025
@vllm-bot vllm-bot merged commit 78237e4 into main Sep 23, 2025
56 checks passed
@vllm-bot vllm-bot deleted the fix-test_context_parallel branch September 23, 2025 03:26
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
…project#25414)

Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: charlifu <charlifu@amd.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-failure Issue about an unexpected test failure in CI ready ONLY add when PR is ready to merge/full CI is needed
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants