[ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI #29420

micah-wil · 2025-11-25T16:17:47Z

Currently, the test bash .buildkite/scripts/run-prime-rl-test.sh fails in AMD CI. Looking at the Prime-RL repo https://github.com/PrimeIntellect-ai/prime-rl, it seems clear that it is not expected to run on ROCm as it claims Currently, you need at least one NVIDIA GPU to use PRIME-RL. There is a GH issue where AMD support will be tracked: PrimeIntellect-ai/prime-rl#961. In the meantime, this test should not run in AMD CI, so we skip it in this PR.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

gemini-code-assist

Code Review

The pull request effectively addresses the issue of running NVIDIA-only Prime-RL tests in AMD CI environments by introducing a conditional skip. The logic to detect AMD GPUs using rocm-smi or rocminfo and exit early is clear and appropriate for the intended purpose. The change is well-placed within the script to prevent unnecessary setup for unsupported environments. No high or critical issues were identified in this change.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

mgoin

Why does this test run on AMD CI at all? I don't see any mention of AMD/ROCm in the test-pipeline.yaml

- label: Prime-RL Integration Test # 15min
  timeout_in_minutes: 30
  optional: true
  num_gpus: 2
  working_dir: "/vllm-workspace"
  source_file_dependencies:
  - vllm/
  - .buildkite/scripts/run-prime-rl-test.sh
  commands:
    - bash .buildkite/scripts/run-prime-rl-test.sh

micah-wil · 2025-11-25T18:38:15Z

Why does this test run on AMD CI at all? I don't see any mention of AMD/ROCm in the test-pipeline.yaml

- label: Prime-RL Integration Test # 15min
  timeout_in_minutes: 30
  optional: true
  num_gpus: 2
  working_dir: "/vllm-workspace"
  source_file_dependencies:
  - vllm/
  - .buildkite/scripts/run-prime-rl-test.sh
  commands:
    - bash .buildkite/scripts/run-prime-rl-test.sh

Yeah fair question. So AMD CI is based on test-amd.yaml, and we have had this Prime-RL test there: https://github.com/vllm-project/vllm/blob/main/.buildkite/test-amd.yaml#L1452. I had considered just removing the test there but we decided to skip it since there is an indication that the library will support AMD eventually. Another thing I could do if you prefer that I don't modify the test script itself is just replace the command in our yaml file with some echo stating the reason for skipping it. Let me know your thoughts. Thank you for taking a look!

tjtanaa · 2025-11-26T02:08:30Z

@micah-wil There are AMD support for prime-RL since PR PrimeIntellect-ai/prime-rl#365 . They released docker images at https://hub.docker.com/r/primeintellect/prime-rl-rocm/tags . . Just that it seems they have stopped releasing more images. However, we should skip for now as it requires additional effort to understand the state of AMD support in that repository.

Regarding to the model updates, is it possible for us to get permission for the HF account that we used for CI? iirc for our CI, the HF access tokens are provided to make sure CIs can download the gated models.

micah-wil · 2025-11-26T16:03:00Z

@tjtanaa Thanks for the additional context, I did see at one point they had built AMD dockers. However, in a later PR, they removed the ROCm docker build and updated their README to remove any promise of ROCm support: PrimeIntellect-ai/prime-rl#630. I gave an attempt at getting it to run on an MI325 machine, to no avail. Let me try again to be absolutely certain that we need to skip this test. A proper solution will likely involve a PR to Prime-RL to at least add ROCm installation steps.

For the model updates, I would love to be able to add model permissions to our HF token, however I have not yet figured out how to do that. If someone knows I would really appreciate some help with that (or a pointer to someone that can help).

tjtanaa · 2025-11-27T04:12:23Z

@micah-wil If you want to come back to this issue later, we can first get current changes into main to get the CI green.

micah-wil · 2025-11-27T04:28:32Z

@tjtanaa Yeah let's go ahead and do that please. Are the changes fine as-is?

tjtanaa · 2025-11-27T06:12:06Z

@micah-wil I tried to find if the Weight Loading Multiple GPU Test tests are running on AMD CI in buildkite I can't seem to find it. Did you find that the AMD CI cannot access the model? You can access the gated model locally?

@zhewenl Do you know if this test is run, and can the AMD CI access the gated model like amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV

micah-wil · 2025-12-01T16:17:52Z

@tjtanaa Here is an error log from our last full CI run for Weight Loading Multiple GPU Test - Large Models: https://buildkite.com/vllm/amd-ci/builds/1127/steps/canvas?sid=019ac110-89ea-4c00-ac39-9d4749bd8457. You'll see that it fails the amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV test with "Cannot access gated repo for url https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV/resolve/main/config.json." The other models tested pass, and amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV passes locally with my personal token.

tests/weight_loading/models-large-amd.txt

This reverts commit c62724b. Signed-off-by: Micah Williamson <micah.williamson@amd.com>

micah-wil · 2025-12-05T17:21:17Z

@tjtanaa Does everything look good here now?

tjtanaa

LGTM

Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: mayoohee <yiweiii.fang@gmail.com>

exit prime rl test on amd

be054b6

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

mergify bot added ci/build nvidia rocm Related to AMD ROCm labels Nov 25, 2025

github-project-automation bot added this to NVIDIA Nov 25, 2025

gemini-code-assist bot reviewed Nov 25, 2025

View reviewed changes

sneaking in a change in models for weight loading tests

c62724b

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

micah-wil requested review from mgoin, yewentao256 and youkaichao as code owners November 25, 2025 16:56

micah-wil changed the title ~~[ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI~~ [ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI + Updated Models List For Weight Loading Test Nov 25, 2025

mgoin reviewed Nov 25, 2025

View reviewed changes

micah-wil mentioned this pull request Nov 26, 2025

[CI Failure]: mi325_2: Prime-RL Integration Test #29465

Closed

3 tasks

tjtanaa reviewed Dec 3, 2025

View reviewed changes

tests/weight_loading/models-large-amd.txt Outdated Show resolved Hide resolved

Revert "sneaking in a change in models for weight loading tests"

27d1aec

This reverts commit c62724b. Signed-off-by: Micah Williamson <micah.williamson@amd.com>

micah-wil changed the title ~~[ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI + Updated Models List For Weight Loading Test~~ [ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI Dec 3, 2025

Merge branch 'main' into micah/CI_skip_primeRL

56cfb01

tjtanaa approved these changes Dec 9, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 9, 2025

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 9, 2025

tjtanaa enabled auto-merge (squash) December 9, 2025 01:38

tjtanaa merged commit 78c7503 into vllm-project:main Dec 9, 2025
18 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Dec 9, 2025

micah-wil deleted the micah/CI_skip_primeRL branch December 9, 2025 04:39

mayoohee pushed a commit to mayoohee/vllm that referenced this pull request Dec 9, 2025

[ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI (vllm-project#29420)

d6c3e77

Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: mayoohee <yiweiii.fang@gmail.com>

Uh oh!

[ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI #29420

[ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI #29420

Uh oh!

Conversation

micah-wil commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

micah-wil commented Nov 25, 2025

Uh oh!

tjtanaa commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

micah-wil commented Nov 26, 2025

Uh oh!

tjtanaa commented Nov 27, 2025

Uh oh!

micah-wil commented Nov 27, 2025

Uh oh!

tjtanaa commented Nov 27, 2025

Uh oh!

micah-wil commented Dec 1, 2025

Uh oh!

Uh oh!

micah-wil commented Dec 5, 2025

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

micah-wil commented Nov 25, 2025 •

edited

Loading

tjtanaa commented Nov 26, 2025 •

edited

Loading