Skip to content

Fix ROCM CI: skipped some tests, added checks, install torchvision#4218

Merged
Xia-Weiwen merged 3 commits intopytorch:mainfrom
Xia-Weiwen:skip_cpu_ut_on_rocm
Apr 2, 2026
Merged

Fix ROCM CI: skipped some tests, added checks, install torchvision#4218
Xia-Weiwen merged 3 commits intopytorch:mainfrom
Xia-Weiwen:skip_cpu_ut_on_rocm

Conversation

@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

@Xia-Weiwen Xia-Weiwen commented Apr 2, 2026

As the title.
These CPU tests require latest torch nightly. However, ROCM CI uses some old version of torch nightly without the latest changes these tests require.
An additional torch version check is also added to scaled_int_mm.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 2, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4218

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 5e040d2 with merge base 6807454 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 2, 2026

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/benchmark
  • ciflow/tutorials
  • ciflow/rocm
  • ciflow/4xh100
  • ciflow/xpu

Please add the new label to .github/pytorch-probot.yml

@Xia-Weiwen Xia-Weiwen added ciflow/rocm module: not user facing Use this tag if you don't want this PR to show up in release notes and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ciflow/rocm-mi300 module: rocm labels Apr 2, 2026
@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 April 2, 2026 02:47
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review April 2, 2026 02:47
@Xia-Weiwen Xia-Weiwen requested a review from Copilot April 2, 2026 02:47
@Xia-Weiwen
Copy link
Copy Markdown
Collaborator Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 2, 2026

Claude finished @Xia-Weiwen's task in 5m 44s —— View job


Review of PR #4218: Fix CI: skip some CPU tests if torch is ROCM

  • Read PR diff and changed files
  • Analyze skip conditions and patterns
  • Post review feedback

Overall this is a reasonable approach to unblock ROCm CI. The changes are straightforward — skip CPU tests that depend on recent torch nightly features not available in the ROCm torch build. A few observations:

1. Misleading skip message on test_dynamic_quant_gpu_singleline

In test/quantization/test_quant_api.py:167-169, the test test_dynamic_quant_gpu_singleline is skipped with the message "Don't test CPU for ROCM version of torch", but the test name says "gpu". Looking at the implementation, the model defaults to CPU (device="cpu" in example_inputs), so the skip is functionally correct — the test does run on CPU despite the name. However, this is confusing. Consider either:

  • Using a more accurate skip message like "Skipping due to incompatibility with ROCm torch nightly", or
  • Noting that this test actually runs on CPU (it's not a "GPU" test in practice).

2. Inconsistent skip patterns across the PR

The PR uses two different styles for the same purpose:

  • Decorator style (test_integration.py:259): @unittest.skipIf(is_ROCM(), "Don't test CPU for ROCM version of torch")
  • Inline style (everywhere else): if device == "cpu" and is_ROCM(): self.skipTest(...)

The inline style makes sense for parametrized tests where the device varies, but test_per_token_linear_cpu uses the decorator style while test_dynamic_quant_gpu_singleline (also non-parametrized, always CPU) uses the inline style. It would be more consistent to use the decorator for both non-parametrized cases.

3. Existing skip_if_rocm decorator is available but unused

There's an existing skip_if_rocm decorator in torchao/testing/utils.py:74 (already imported in test_quant_api.py). For non-parametrized tests like test_per_token_linear_cpu and test_dynamic_quant_gpu_singleline, you could use @skip_if_rocm("CPU test requires latest torch nightly") for consistency with the rest of the codebase. However, skip_if_rocm uses pytest.skip() while the current code uses unittest.skipIf/self.skipTest — so this may be a deliberate choice for unittest compatibility.

4. Minor: Consider a comment explaining why

The skip message "Don't test CPU for ROCM version of torch" tells us what is happening but not why. The PR description explains it well (ROCm CI uses an older torch nightly), but a brief code comment would help future maintainers understand when these skips can be removed (e.g., # ROCm CI torch nightly doesn't include the latest CPU-related changes yet).

Summary

The changes are correct and will achieve the goal of unblocking ROCm CI. The issues above are minor style/consistency nits that don't affect correctness. The PR looks good to merge as-is, but addressing the inconsistencies would improve maintainability.


Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the test suite to avoid running a subset of CPU-targeted quantization/kernel tests when the installed PyTorch build is ROCm, because ROCm CI is pinned to an older nightly that lacks required changes.

Changes:

  • Added ROCm-based skips for specific CPU test variants in quantization workflow tests.
  • Added ROCm-based skips for specific CPU test variants in kernel autotuner tests.
  • Added ROCm-based skips for specific CPU integration tests (including one decorator-based skip).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
test/quantization/test_quant_api.py Skips a CPU-executed quantization test when running under ROCm.
test/quantization/quantize_/workflows/int8/test_int8_tensor.py Skips multiple CPU parametrizations for int8 workflow tests under ROCm.
test/kernel/test_autotuner.py Skips the CPU parametrization of an autotuner-related test under ROCm.
test/integration/test_integration.py Skips specific CPU integration tests/paths under ROCm (decorator + runtime skip).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 167 to +169
def test_dynamic_quant_gpu_singleline(self):
if is_ROCM():
self.skipTest("Don't test CPU for ROCM version of torch")
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_ROCM() currently requires torch.cuda.is_available(), so a ROCm-built torch running without a visible GPU would not be detected and this skip would not trigger. If the intent is to skip when torch is a ROCm build (as the PR description suggests), prefer checking torch.version.hip is not None or using the existing skip_if_rocm helper so the condition matches the message/reason.

Also consider making the skip reason more specific (e.g., depends on newer torch nightly) to avoid implying that CPU testing is unsupported on ROCm in general.

Suggested change
def test_dynamic_quant_gpu_singleline(self):
if is_ROCM():
self.skipTest("Don't test CPU for ROCM version of torch")
@skip_if_rocm(
"test_dynamic_quant_gpu_singleline not supported on ROCm builds (uses CUDA-only path)"
)
def test_dynamic_quant_gpu_singleline(self):

Copilot uses AI. Check for mistakes.
sizes: tuple,
):
"""Test linear operation supports including shape and compile"""
if device == "cpu" and is_ROCM():
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as elsewhere: is_ROCM() is gated on torch.cuda.is_available(), so it won’t skip on a ROCm build if no GPU is visible. If the goal is to skip CPU variants specifically for ROCm torch nightlies, consider using torch.version.hip is not None / skip_if_rocm for a build-based check, and keep the skip reason aligned with the actual failure mode (old ROCm nightly).

Suggested change
if device == "cpu" and is_ROCM():
if device == "cpu" and torch.version.hip is not None:

Copilot uses AI. Check for mistakes.
if device == "cuda" and not torch.cuda.is_available():
self.skipTest(f"{device} not available")

if device == "cpu" and is_ROCM():
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_ROCM() depends on torch.cuda.is_available(), so this won’t skip on a ROCm build if the GPU is not visible to the process. If the skip is meant to apply to ROCm builds (per PR description), prefer torch.version.hip is not None or the existing skip_if_rocm helper for consistency and correctness.

Suggested change
if device == "cpu" and is_ROCM():
if device == "cpu" and (torch.version.hip is not None):

Copilot uses AI. Check for mistakes.
sqnr = compute_error(y_ref, y)
self.assertTrue(sqnr >= 39.0, f"{sqnr=} too low")

@unittest.skipIf(is_ROCM(), "Don't test CPU for ROCM version of torch")
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using @unittest.skipIf(is_ROCM(), ...) inherits is_ROCM()’s requirement that torch.cuda.is_available() is true. If this is intended to skip on ROCm torch builds (even when no GPU is visible), use a build-level check like torch.version.hip is not None or the repo’s skip_if_rocm helper so the decorator matches the intended condition.

Suggested change
@unittest.skipIf(is_ROCM(), "Don't test CPU for ROCM version of torch")
@skip_if_rocm("Don't test CPU for ROCM version of torch")

Copilot uses AI. Check for mistakes.
):
self.skipTest("Need CUDA and SM80+ available.")

if test_device == "cpu" and is_ROCM():
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skip uses is_ROCM() which requires torch.cuda.is_available(). If the intent is to skip CPU export tests whenever torch is built with ROCm (regardless of GPU visibility), switch to a build-based condition such as torch.version.hip is not None / skip_if_rocm so the skip reliably triggers in all ROCm environments.

Suggested change
if test_device == "cpu" and is_ROCM():
if test_device == "cpu" and getattr(torch.version, "hip", None) is not None:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should just have an allowlist of tests that runs in rocm, currently it runs all tests: https://github.com/pytorch/ao/blob/main/.github/workflows/regression_test_rocm.yml#L51

Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we can land this to unblock CI first, we'll figure allowlist for rocm a bit later

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 2, 2026

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/benchmark
  • ciflow/tutorials
  • ciflow/rocm
  • ciflow/4xh100
  • ciflow/xpu

Please add the new label to .github/pytorch-probot.yml

@jerryzh168
Copy link
Copy Markdown
Contributor

jerryzh168 commented Apr 2, 2026

can you fix the torchvision issue in rocm CI following da257b5 as well

should be fixed by adding torchvision here: https://github.com/pytorch/ao/blob/main/.github/workflows/regression_test_rocm.yml#L25 I think

@Xia-Weiwen
Copy link
Copy Markdown
Collaborator Author

@jerryzh168 Sure.

@Xia-Weiwen Xia-Weiwen changed the title Fix CI: skip some CPU tests if torch is ROCM Fix ROCM CI: skipped some tests, added checks, install torchvision Apr 2, 2026
@Xia-Weiwen Xia-Weiwen merged commit c49a60e into pytorch:main Apr 2, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm ciflow/rocm-mi300 CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants