-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI #29420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request effectively addresses the issue of running NVIDIA-only Prime-RL tests in AMD CI environments by introducing a conditional skip. The logic to detect AMD GPUs using rocm-smi or rocminfo and exit early is clear and appropriate for the intended purpose. The change is well-placed within the script to prevent unnecessary setup for unsupported environments. No high or critical issues were identified in this change.
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this test run on AMD CI at all? I don't see any mention of AMD/ROCm in the test-pipeline.yaml
- label: Prime-RL Integration Test # 15min
timeout_in_minutes: 30
optional: true
num_gpus: 2
working_dir: "/vllm-workspace"
source_file_dependencies:
- vllm/
- .buildkite/scripts/run-prime-rl-test.sh
commands:
- bash .buildkite/scripts/run-prime-rl-test.sh
Yeah fair question. So AMD CI is based on |
|
@micah-wil There are AMD support for prime-RL since PR PrimeIntellect-ai/prime-rl#365 . They released docker images at https://hub.docker.com/r/primeintellect/prime-rl-rocm/tags . . Just that it seems they have stopped releasing more images. However, we should skip for now as it requires additional effort to understand the state of AMD support in that repository. Regarding to the model updates, is it possible for us to get permission for the HF account that we used for CI? iirc for our CI, the HF access tokens are provided to make sure CIs can download the gated models. |
|
@tjtanaa Thanks for the additional context, I did see at one point they had built AMD dockers. However, in a later PR, they removed the ROCm docker build and updated their README to remove any promise of ROCm support: PrimeIntellect-ai/prime-rl#630. I gave an attempt at getting it to run on an MI325 machine, to no avail. Let me try again to be absolutely certain that we need to skip this test. A proper solution will likely involve a PR to Prime-RL to at least add ROCm installation steps. For the model updates, I would love to be able to add model permissions to our HF token, however I have not yet figured out how to do that. If someone knows I would really appreciate some help with that (or a pointer to someone that can help). |
|
@micah-wil If you want to come back to this issue later, we can first get current changes into main to get the CI green. |
|
@tjtanaa Yeah let's go ahead and do that please. Are the changes fine as-is? |
|
@micah-wil I tried to find if the @zhewenl Do you know if this test is run, and can the AMD CI access the gated model like |
|
@tjtanaa Here is an error log from our last full CI run for |
This reverts commit c62724b. Signed-off-by: Micah Williamson <micah.williamson@amd.com>
|
@tjtanaa Does everything look good here now? |
tjtanaa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: mayoohee <yiweiii.fang@gmail.com>
Currently, the test
bash .buildkite/scripts/run-prime-rl-test.shfails in AMD CI. Looking at the Prime-RL repo https://github.com/PrimeIntellect-ai/prime-rl, it seems clear that it is not expected to run on ROCm as it claimsCurrently, you need at least one NVIDIA GPU to use PRIME-RL.There is a GH issue where AMD support will be tracked: PrimeIntellect-ai/prime-rl#961. In the meantime, this test should not run in AMD CI, so we skip it in this PR.