Skip to content

Make op_upsample_bilinear2d_aa_test deterministic#19357

Merged
psiddh merged 1 commit intopytorch:mainfrom
psiddh:export-D104150928
May 8, 2026
Merged

Make op_upsample_bilinear2d_aa_test deterministic#19357
psiddh merged 1 commit intopytorch:mainfrom
psiddh:export-D104150928

Conversation

@psiddh
Copy link
Copy Markdown
Contributor

@psiddh psiddh commented May 7, 2026

Summary:
Three test methods in
fbcode/executorch/kernels/portable/test/op_upsample_bilinear2d_aa_test.py
have been auto-disabled as flaky on the test-issues dashboard
(owner ai_infra_mobile_platform):

  • test_upsample_bilinear2d_aa_aten_parity_u8
  • test_upsample_bilinear2d_aa_aggressive_downsampling
  • test_upsample_bilinear2d_aa_align_corners_downsampling

Root cause: each test builds its input via torch.randint(...) or
torch.randn(...) with no seed pinned, so each run sees a different
sample. The configured atol was tight enough that on some draws the
ATen-vs-ExecuTorch divergence (driven by separable-vs-direct
anti-aliased interpolation differences) crossed the threshold and the
test flipped to FAIL. The kernel implementations themselves are not
changing across runs.

Fix:

  1. Add setUp(self): torch.manual_seed(0) so every run sees the same
    input tensor and the same divergence, eliminating the run-to-run
    FAIL/PASS oscillation.
  2. Bump two atol thresholds to cover the worst-case observed
    divergence with the now-pinned input:
    • u8 parity: 3.5 -> 5 (observed max abs error 4 / 255)
    • aggressive 4x downsampling: 0.4 -> 1.0 (observed max abs error
      ~0.59 for N(0,1) input)
  3. The pre-existing atol=0.25 on align_corners_downsampling is left
    unchanged - with seed 0 it now passes consistently.

The relaxed tolerances are still well below any change that would
indicate an actual kernel regression; the comprehensive C++ test
suite in op_upsample_bilinear2d_aa_test.cpp still validates the
kernel under tighter constraints.

Reviewed By: rascani

Differential Revision: D104150928

Copilot AI review requested due to automatic review settings May 7, 2026 00:46
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 7, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19357

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 20 Pending

As of commit f2388dd with merge base b3baac5 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 7, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 7, 2026

@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104150928.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses test flakiness in the portable anti-aliased bilinear upsample parity tests by making their randomized inputs deterministic, and updates the iOS LLM Apple BUCK rule’s test timeout/label configuration to accommodate very long-running XCTest inference runs.

Changes:

  • Seed PyTorch RNG in UpsampleBilinear2dAATest to make randn/randint inputs deterministic across runs.
  • Relax atol for two specific flaky parity tests to match the observed worst-case divergence under the pinned seed.
  • Update Apple LLM BUCK test labeling/timeouts to allow significantly longer XCTest execution.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
kernels/portable/test/op_upsample_bilinear2d_aa_test.py Pins RNG for deterministic inputs and adjusts tolerances to eliminate flakes in ATen-vs-ExecuTorch parity checks.
extension/llm/apple/BUCK Changes test label and increases overall test rule timeout to support long-running iOS-sim LLM inference tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +28
def setUp(self) -> None:
# Pin RNG so torch.randn / torch.randint inputs are deterministic.
# Without this, the parity tests below occasionally see input values
# that produce ATen-vs-ExecuTorch differences just above the
# configured atol, surfacing as flakes on the test-issues dashboard.
torch.manual_seed(0)

@meta-codesync meta-codesync Bot changed the title Make op_upsample_bilinear2d_aa_test deterministic Make op_upsample_bilinear2d_aa_test deterministic (#19357) May 7, 2026
psiddh added a commit to psiddh/executorch that referenced this pull request May 7, 2026
Summary:

Three test methods in
`fbcode/executorch/kernels/portable/test/op_upsample_bilinear2d_aa_test.py`
have been auto-disabled as flaky on the test-issues dashboard
(owner ai_infra_mobile_platform):

- test_upsample_bilinear2d_aa_aten_parity_u8
- test_upsample_bilinear2d_aa_aggressive_downsampling
- test_upsample_bilinear2d_aa_align_corners_downsampling

Root cause: each test builds its input via `torch.randint(...)` or
`torch.randn(...)` with no seed pinned, so each run sees a different
sample. The configured `atol` was tight enough that on some draws the
ATen-vs-ExecuTorch divergence (driven by separable-vs-direct
anti-aliased interpolation differences) crossed the threshold and the
test flipped to FAIL. The kernel implementations themselves are not
changing across runs.

Fix:

1. Add `setUp(self): torch.manual_seed(0)` so every run sees the same
   input tensor and the same divergence, eliminating the run-to-run
   FAIL/PASS oscillation.
2. Bump two atol thresholds to cover the worst-case observed
   divergence with the now-pinned input:
   - u8 parity: 3.5 -> 5 (observed max abs error 4 / 255)
   - aggressive 4x downsampling: 0.4 -> 1.0 (observed max abs error
     ~0.59 for N(0,1) input)
3. The pre-existing `atol=0.25` on align_corners_downsampling is left
   unchanged - with seed 0 it now passes consistently.

The relaxed tolerances are still well below any change that would
indicate an actual kernel regression; the comprehensive C++ test
suite in `op_upsample_bilinear2d_aa_test.cpp` still validates the
kernel under tighter constraints.

Differential Revision: D104150928
@psiddh psiddh force-pushed the export-D104150928 branch from 4fa5cd8 to 12bfb5f Compare May 7, 2026 00:50
@meta-codesync meta-codesync Bot changed the title Make op_upsample_bilinear2d_aa_test deterministic (#19357) Make op_upsample_bilinear2d_aa_test deterministic May 7, 2026
Copilot AI review requested due to automatic review settings May 7, 2026 03:34
@psiddh psiddh force-pushed the export-D104150928 branch 2 times, most recently from bd692d8 to bdfcb80 Compare May 7, 2026 03:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@psiddh
Copy link
Copy Markdown
Contributor Author

psiddh commented May 8, 2026

@claude review this PR and the comment from Copilot , seems reasonable

@claude
Copy link
Copy Markdown

claude Bot commented May 8, 2026

Claude finished @psiddh's task in 1m 3s —— View job


Review of PR #19357

  • Read the changed file and understand the diff
  • Evaluate Copilot's suggestion about RNG state isolation
  • Provide review feedback

Overall: This PR looks good. The fix is straightforward and the approach (pinning the seed + adjusting tolerances) is the standard pattern for making flaky numerical-parity tests deterministic.

On Copilot's suggestion (RNG state save/restore)

Copilot's suggestion to save/restore RNG state via torch.get_rng_state()/set_rng_state() in tearDown() (or use torch.random.fork_rng()) is technically correct but not practically necessary here. Reasons:

  1. This test file runs in its own process — Python unittest discovery typically isolates test modules. Even if it doesn't, the standard unittest runner doesn't guarantee RNG state across classes anyway.
  2. Other tests in this repo also call torch.manual_seed() at module/function scope without restoring state (e.g., kernels/test/test_case_gen.py calls it at module level, backends/test/suite/runner.py calls it per test). This is the established pattern in the codebase.
  3. The added complexity isn't worth itsetUp+tearDown RNG save/restore adds 4+ lines for a scenario that doesn't occur in practice. If a future test in this file needs different randomness, it can simply call torch.manual_seed(N) with a different seed.

So the current approach of just torch.manual_seed(0) in setUp is fine and consistent with the rest of the repo. I'd dismiss the Copilot comment.

Minor observations

  • The tolerance bumps (3.5→5 for u8, 0.4→1.0 for aggressive downsampling) are well-justified in the comments and the PR description explains the observed error magnitudes.
  • The comments on lines 26-27 and at the tolerance sites are helpful since they explain why the values were chosen — this is one of the few cases where comments genuinely add value.
  • The align_corners_downsampling test keeping atol=0.25 unchanged (passes with seed 0) is a good sign that the seed choice isn't just masking broader issues.

LGTM — no blocking concerns.


Summary:
Three test methods in
`fbcode/executorch/kernels/portable/test/op_upsample_bilinear2d_aa_test.py`
have been auto-disabled as flaky on the test-issues dashboard
(owner ai_infra_mobile_platform):

- test_upsample_bilinear2d_aa_aten_parity_u8
- test_upsample_bilinear2d_aa_aggressive_downsampling
- test_upsample_bilinear2d_aa_align_corners_downsampling

Root cause: each test builds its input via `torch.randint(...)` or
`torch.randn(...)` with no seed pinned, so each run sees a different
sample. The configured `atol` was tight enough that on some draws the
ATen-vs-ExecuTorch divergence (driven by separable-vs-direct
anti-aliased interpolation differences) crossed the threshold and the
test flipped to FAIL. The kernel implementations themselves are not
changing across runs.

Fix:

1. Add `setUp(self): torch.manual_seed(0)` so every run sees the same
   input tensor and the same divergence, eliminating the run-to-run
   FAIL/PASS oscillation.
2. Bump two atol thresholds to cover the worst-case observed
   divergence with the now-pinned input:
   - u8 parity: 3.5 -> 5 (observed max abs error 4 / 255)
   - aggressive 4x downsampling: 0.4 -> 1.0 (observed max abs error
     ~0.59 for N(0,1) input)
3. The pre-existing `atol=0.25` on align_corners_downsampling is left
   unchanged - with seed 0 it now passes consistently.

The relaxed tolerances are still well below any change that would
indicate an actual kernel regression; the comprehensive C++ test
suite in `op_upsample_bilinear2d_aa_test.cpp` still validates the
kernel under tighter constraints.

Reviewed By: rascani

Differential Revision: D104150928
Copilot AI review requested due to automatic review settings May 8, 2026 22:12
@psiddh psiddh force-pushed the export-D104150928 branch from aba702a to f2388dd Compare May 8, 2026 22:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +22 to +26
def setUp(self) -> None:
# Save RNG state so we can restore it in tearDown; without this,
# `torch.manual_seed` would leak determinism into other test
# modules that share the same process.
self._torch_rng_state = torch.get_rng_state()
@psiddh psiddh merged commit 9e4e497 into pytorch:main May 8, 2026
177 of 181 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants