Skip to content

Conversation

@micah-wil
Copy link
Contributor

@micah-wil micah-wil commented Nov 25, 2025

Currently, the test bash .buildkite/scripts/run-prime-rl-test.sh fails in AMD CI. Looking at the Prime-RL repo https://github.com/PrimeIntellect-ai/prime-rl, it seems clear that it is not expected to run on ROCm as it claims Currently, you need at least one NVIDIA GPU to use PRIME-RL. There is a GH issue where AMD support will be tracked: PrimeIntellect-ai/prime-rl#961. In the meantime, this test should not run in AMD CI, so we skip it in this PR.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@mergify mergify bot added ci/build nvidia rocm Related to AMD ROCm labels Nov 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the issue of running NVIDIA-only Prime-RL tests in AMD CI environments by introducing a conditional skip. The logic to detect AMD GPUs using rocm-smi or rocminfo and exit early is clear and appropriate for the intended purpose. The change is well-placed within the script to prevent unnecessary setup for unsupported environments. No high or critical issues were identified in this change.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@micah-wil micah-wil changed the title [ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI [ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI + Updated Models List For Weight Loading Test Nov 25, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this test run on AMD CI at all? I don't see any mention of AMD/ROCm in the test-pipeline.yaml

- label: Prime-RL Integration Test # 15min
  timeout_in_minutes: 30
  optional: true
  num_gpus: 2
  working_dir: "/vllm-workspace"
  source_file_dependencies:
  - vllm/
  - .buildkite/scripts/run-prime-rl-test.sh
  commands:
    - bash .buildkite/scripts/run-prime-rl-test.sh

@micah-wil
Copy link
Contributor Author

Why does this test run on AMD CI at all? I don't see any mention of AMD/ROCm in the test-pipeline.yaml

- label: Prime-RL Integration Test # 15min
  timeout_in_minutes: 30
  optional: true
  num_gpus: 2
  working_dir: "/vllm-workspace"
  source_file_dependencies:
  - vllm/
  - .buildkite/scripts/run-prime-rl-test.sh
  commands:
    - bash .buildkite/scripts/run-prime-rl-test.sh

Yeah fair question. So AMD CI is based on test-amd.yaml, and we have had this Prime-RL test there: https://github.com/vllm-project/vllm/blob/main/.buildkite/test-amd.yaml#L1452. I had considered just removing the test there but we decided to skip it since there is an indication that the library will support AMD eventually. Another thing I could do if you prefer that I don't modify the test script itself is just replace the command in our yaml file with some echo stating the reason for skipping it. Let me know your thoughts. Thank you for taking a look!

@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 26, 2025

@micah-wil There are AMD support for prime-RL since PR PrimeIntellect-ai/prime-rl#365 . They released docker images at https://hub.docker.com/r/primeintellect/prime-rl-rocm/tags . . Just that it seems they have stopped releasing more images. However, we should skip for now as it requires additional effort to understand the state of AMD support in that repository.

Regarding to the model updates, is it possible for us to get permission for the HF account that we used for CI? iirc for our CI, the HF access tokens are provided to make sure CIs can download the gated models.

@micah-wil
Copy link
Contributor Author

@tjtanaa Thanks for the additional context, I did see at one point they had built AMD dockers. However, in a later PR, they removed the ROCm docker build and updated their README to remove any promise of ROCm support: PrimeIntellect-ai/prime-rl#630. I gave an attempt at getting it to run on an MI325 machine, to no avail. Let me try again to be absolutely certain that we need to skip this test. A proper solution will likely involve a PR to Prime-RL to at least add ROCm installation steps.

For the model updates, I would love to be able to add model permissions to our HF token, however I have not yet figured out how to do that. If someone knows I would really appreciate some help with that (or a pointer to someone that can help).

@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 27, 2025

@micah-wil If you want to come back to this issue later, we can first get current changes into main to get the CI green.

@micah-wil
Copy link
Contributor Author

@tjtanaa Yeah let's go ahead and do that please. Are the changes fine as-is?

@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 27, 2025

@micah-wil I tried to find if the Weight Loading Multiple GPU Test tests are running on AMD CI in buildkite I can't seem to find it. Did you find that the AMD CI cannot access the model? You can access the gated model locally?

@zhewenl Do you know if this test is run, and can the AMD CI access the gated model like amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV

@micah-wil
Copy link
Contributor Author

@tjtanaa Here is an error log from our last full CI run for Weight Loading Multiple GPU Test - Large Models: https://buildkite.com/vllm/amd-ci/builds/1127/steps/canvas?sid=019ac110-89ea-4c00-ac39-9d4749bd8457. You'll see that it fails the amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV test with "Cannot access gated repo for url https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV/resolve/main/config.json." The other models tested pass, and amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV passes locally with my personal token.

This reverts commit c62724b.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@micah-wil micah-wil changed the title [ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI + Updated Models List For Weight Loading Test [ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI Dec 3, 2025
@micah-wil
Copy link
Contributor Author

@tjtanaa Does everything look good here now?

Copy link
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Dec 9, 2025
@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 9, 2025
@tjtanaa tjtanaa enabled auto-merge (squash) December 9, 2025 01:38
@tjtanaa tjtanaa merged commit 78c7503 into vllm-project:main Dec 9, 2025
18 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Dec 9, 2025
@micah-wil micah-wil deleted the micah/CI_skip_primeRL branch December 9, 2025 04:39
mayoohee pushed a commit to mayoohee/vllm that referenced this pull request Dec 9, 2025
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: mayoohee <yiweiii.fang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants