-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[CI] Add Decode Context Parallelism (DCP) test to CI #24487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds CI tests for Decode Context Parallelism (DCP) on Hopper and Blackwell architectures. The changes in the CI configuration and the test file are mostly correct and achieve this goal. I've found one potential issue in the test file that could make the test framework fragile for future changes and have suggested a fix.
[ | ||
params for model_id, settings in CP_TEXT_GENERATION_MODELS.items() | ||
for params in settings.iter_params(model_id) | ||
for setting in settings for params in setting.iter_params(model_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list comprehension here assumes that the values in CP_TEXT_GENERATION_MODELS
are always lists of CPTestSettings
. This might be fragile if a new model is added with a single CPTestSettings
object instead of a list. This would cause the test parameter generation to fail. To make this more robust, you could handle both cases by ensuring you're always iterating over a list.
for setting in settings for params in setting.iter_params(model_id) | |
for setting in (settings if isinstance(settings, list) else [settings]) for params in setting.iter_params(model_id) |
##### H200 test ##### | ||
- label: Qwen MoE EP Test # optional | ||
gpu: h200 | ||
optional: true | ||
num_gpus: 2 | ||
commands: | ||
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 /vllm-workspace/examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 | ||
|
||
|
||
- label: Hopper Decode Context Parallelism Test # optional | ||
gpu: h200 | ||
optional: true | ||
num_gpus: 2 | ||
commands: | ||
- pytest -v -s tests/distributed/test_context_parallel.py | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge into one hopper distributed tests
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Ming Yang <minos.future@gmail.com>
Can you check they actually run in CI by unblocking them? |
Signed-off-by: Ming Yang <minos.future@gmail.com>
Head branch was pushed to by a user without write access
Signed-off-by: Ming Yang <minos.future@gmail.com>
unblocked added CI tests and they've passed. But the existing qwen moe test has been failing. I can reproduce that locally. I wonder if this qwen moe test is never executed as part of CI. |
sounds good, let's merge it first. |
Purpose
Guard both hopper and blackwell support for DCP
Test Plan
local test
CUDA_VISIBLE_DEVICES=0,1 pytest -v -s tests/distributed/test_context_parallel.py
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.