[inductor][autotune cache] add torch_key() to configs hash #150494

davidberard98 · 2025-04-02T01:17:15Z

Summary:
Context: #150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes.

In particular, it introduces num_consumer_groups and adds it to the cached config. In versions of torch that include the WS diff, num_consumer_groups is treated as a class variable on a triton.Config object (i.e. triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg.

But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret num_consumer_groups as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message.

The fix: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues).

Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched.

Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails).

Differential Revision: D72285303

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-04-02T01:17:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150494

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 92a9c58 with merge base 0198e44 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / unit-test / linux-jammy-cpu-py3.12-gcc11-inductor-halide / test (inductor-halide, 1, 1, linux.12xlarge) (gh) (trunk failure)
inductor/test_halide.py::HalideCpuTests::test_special_polygamma_cpu_halide

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
REGRESSION: benchmark ('aotdispatcher_training_subclass_cpu', 'compile_time_instruction_count') failed, actual result 10001976312 is 1.75% higher than expected 9830000000 ±+1.50% if this is an expected regression, please update the expected results.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-04-02T01:17:28Z

This pull request was exported from Phabricator. Differential Revision: D72285303

oulgen

Instead of introducing yet another place that does hashing of source code, i think you should do similar to what we do with backend_hash here:

pytorch/torch/_inductor/runtime/autotune_cache.py

Line 145 in 0da8127

key = backend_hash + self.configs_hash + salt

backend_hash here is triton device related hashing. FWIW i think local version will also need this.

also please add a unit test using mocking

…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303

facebook-github-bot · 2025-04-02T19:14:48Z

This pull request was exported from Phabricator. Differential Revision: D72285303

…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303

facebook-github-bot · 2025-04-02T19:29:19Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

davidberard98 · 2025-04-02T21:26:44Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-04-02T21:28:29Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-04-02T21:28:32Z

Successfully rebased export-D72285303 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout export-D72285303 && git pull --rebase)

…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303

facebook-github-bot · 2025-04-02T21:29:24Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

davidberard98 · 2025-04-02T22:58:57Z

@pytorchbot merge

pytorchmergebot · 2025-04-02T23:00:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-04-02T23:00:57Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 2, 5, ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303

facebook-github-bot · 2025-04-03T04:44:57Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-04-03T04:58:50Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

davidberard98 · 2025-04-03T05:03:38Z

@pytorchbot merge

pytorchmergebot · 2025-04-03T05:05:23Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

davidberard98 · 2025-04-03T05:21:50Z

@pytorchbot merge

pytorchmergebot · 2025-04-03T05:23:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-04-03T07:08:37Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, ephemeral.linux.12xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

davidberard98 · 2025-04-03T15:54:17Z

@pytorchbot merge

pytorchmergebot · 2025-04-03T15:56:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched. Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails). Differential Revision: D72285303 Pull Request resolved: pytorch#150494 Approved by: https://github.com/oulgen

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 2, 2025

facebook-github-bot added the fb-exported label Apr 2, 2025

davidberard98 added the release notes: inductor label Apr 2, 2025

davidberard98 requested a review from oulgen April 2, 2025 03:29

oulgen requested changes Apr 2, 2025

View reviewed changes

davidberard98 force-pushed the export-D72285303 branch from 63a8227 to de2903a Compare April 2, 2025 19:14

davidberard98 force-pushed the export-D72285303 branch from de2903a to f835735 Compare April 2, 2025 19:29

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 2, 2025

davidberard98 requested a review from oulgen April 2, 2025 20:10

oulgen approved these changes Apr 2, 2025

View reviewed changes

pytorchmergebot force-pushed the export-D72285303 branch from f835735 to 5116022 Compare April 2, 2025 21:28

pytorchmergebot added the merging label Apr 2, 2025

pytorchmergebot removed the merging label Apr 2, 2025

davidberard98 force-pushed the export-D72285303 branch from 5116022 to 13dbc1e Compare April 3, 2025 04:38

davidberard98 force-pushed the export-D72285303 branch from 13dbc1e to 92a9c58 Compare April 3, 2025 04:43

pytorchmergebot added the merging label Apr 3, 2025

pytorchmergebot removed the merging label Apr 3, 2025

pytorchmergebot added the merging label Apr 3, 2025

pytorchmergebot removed the merging label Apr 3, 2025

pytorchmergebot added the merging label Apr 3, 2025

pytorchmergebot added the Merged label Apr 3, 2025

pytorchmergebot closed this in 5be5cfe Apr 3, 2025

pytorchmergebot removed the merging label Apr 3, 2025

[inductor][autotune cache] add torch_key() to configs hash #150494

[inductor][autotune cache] add torch_key() to configs hash #150494

Uh oh!

Conversation

davidberard98 commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150494

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

facebook-github-bot commented Apr 2, 2025

Uh oh!

oulgen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 2, 2025

Uh oh!

facebook-github-bot commented Apr 2, 2025

Uh oh!

davidberard98 commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Uh oh!

facebook-github-bot commented Apr 2, 2025

Uh oh!

davidberard98 commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Merge started

Uh oh!

pytorchmergebot commented Apr 2, 2025

Merge failed

Uh oh!

facebook-github-bot commented Apr 3, 2025

Uh oh!

facebook-github-bot commented Apr 3, 2025

Uh oh!

davidberard98 commented Apr 3, 2025

Uh oh!

pytorchmergebot commented Apr 3, 2025

Merge failed

Uh oh!

davidberard98 commented Apr 3, 2025

Uh oh!

pytorchmergebot commented Apr 3, 2025

Merge started

Uh oh!

pytorchmergebot commented Apr 3, 2025

Merge failed

Uh oh!

davidberard98 commented Apr 3, 2025

Uh oh!

pytorchmergebot commented Apr 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

davidberard98 commented Apr 2, 2025 •

edited

Loading

pytorch-bot bot commented Apr 2, 2025 •

edited

Loading

oulgen left a comment •

edited

Loading