-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[inductor][autotune cache] add torch_key() to configs hash #150494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150494
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 92a9c58 with merge base 0198e44 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D72285303 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing yet another place that does hashing of source code, i think you should do similar to what we do with backend_hash here:
key = backend_hash + self.configs_hash + salt |
backend_hash here is triton device related hashing. FWIW i think local version will also need this.
also please add a unit test using mocking
63a8227
to
de2903a
Compare
…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303
This pull request was exported from Phabricator. Differential Revision: D72285303 |
…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303
de2903a
to
f835735
Compare
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
Successfully rebased |
f835735
to
5116022
Compare
…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303
5116022
to
13dbc1e
Compare
…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8 can repro the failure, and the failure is fixed after this PR is patched. Differential Revision: D72285303
13dbc1e
to
92a9c58
Compare
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge failedReason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR! Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…50494) Summary: **Context**: pytorch#150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. **The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched. Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails). Differential Revision: D72285303 Pull Request resolved: pytorch#150494 Approved by: https://github.com/oulgen
Summary:
Context: #150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes.
In particular, it introduces
num_consumer_groups
and adds it to the cached config. In versions of torch that include the WS diff,num_consumer_groups
is treated as a class variable on a triton.Config object (i.e.triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...
). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg.But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret
num_consumer_groups
as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message.The fix: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues).
Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched.
Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails).
Differential Revision: D72285303
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov