Fill config2launcher with correct launchers during cache hit coordinate descent #150860

jamesjwu · 2025-04-08T18:09:33Z

Stack from ghstack (oldest at bottom):

-> Fill config2launcher with correct launchers during cache hit coordinate descent #150860

This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging.

Here's a short TLDR of the bug:

Due to D71983456(OSS: cache loaded python modules #149910), we cache CachingAutotuners in memory.
Importantly: Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk. By saving it in memory, CachingAutotuners do not reset global state.
It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code.
Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code)
CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS:

pytorch/torch/_inductor/runtime/coordinate_descent_tuner.py

Line 69 in aafc4b6

self.cached_benchmark_results[triton_config_to_hashable(config)] = timing

)
Because we are caching these in memory and not on disk, this cache is not cleared between runs.
However, this variable is not saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje
(OSS:

pytorch/torch/_inductor/runtime/triton_heuristics.py

Line 933 in aafc4b6

config2launcher = {launcher.config: launcher}

)
config2launcher is added when we call benchmark_one_config, but on a CoorDesc cache hit, we never call benchmark_one_config! So we end up returning None, and erroring with:

AttributeError: 'NoneType' object has no attribute 'store_cubin'

This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now.

Note that this error only reproduces if:

None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory
We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner.
The autotune cache doesn't already have the best config stored in it

So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only.

Differential Revision: D72655382

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

…te descent This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456, we cache CachingAutotuners in memory. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! [ghstack-poisoned]

pytorch-bot · 2025-04-08T18:09:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150860

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 739dc07 with merge base cdf3b63 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…te descent This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456, we cache CachingAutotuners in memory. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! ghstack-source-id: 276832508 Pull Request resolved: #150860

facebook-github-bot · 2025-04-08T18:09:42Z

This pull request was exported from Phabricator. Differential Revision: D72655382

…it coordinate descent" This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456, we cache CachingAutotuners in memory. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

…te descent Pull Request resolved: #150860 This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456, we cache CachingAutotuners in memory. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! ghstack-source-id: 276839777

facebook-github-bot · 2025-04-08T18:13:21Z

This pull request was exported from Phabricator. Differential Revision: D72655382

facebook-github-bot · 2025-04-08T22:15:30Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-04-08T22:17:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-04-09T04:15:51Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jamesjwu · 2025-04-09T04:37:48Z

@pytorchbot merge -f

pytorch-bot · 2025-04-09T04:37:51Z

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

jamesjwu · 2025-04-09T04:38:05Z

@pytorchbot merge -f "Landed internally tests passing"

pytorchmergebot · 2025-04-09T04:39:26Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…te descent (pytorch#150860) This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456(OSS: pytorch#149910), we cache CachingAutotuners in memory. - Importantly: **Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk**. By saving it in memory, CachingAutotuners do not reset global state. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS: https://github.com/pytorch/pytorch/blob/aafc4b6188b70cf808f756f23b1a05355bcb7696/torch/_inductor/runtime/coordinate_descent_tuner.py#L69) - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje (OSS: https://github.com/pytorch/pytorch/blob/aafc4b6188b70cf808f756f23b1a05355bcb7696/torch/_inductor/runtime/triton_heuristics.py#L933) - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! Pull Request resolved: pytorch#150860 Approved by: https://github.com/oulgen

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 8, 2025

facebook-github-bot added the fb-exported label Apr 8, 2025

oulgen approved these changes Apr 8, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 8, 2025

jamesjwu added the topic: not user facing topic category label Apr 8, 2025

pytorchmergebot added the merging label Apr 8, 2025

pytorchmergebot added the Merged label Apr 9, 2025

pytorchmergebot closed this in 4d6ff6c Apr 9, 2025

pytorchmergebot removed the merging label Apr 9, 2025

github-actions bot deleted the gh/jamesjwu/134/head branch May 15, 2025 02:17

dc3671 mentioned this pull request Jul 9, 2025

[https://nvbugs/5355316] fix: update torch.compile option to fix triton store_cubin error NVIDIA/TensorRT-LLM#5865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fill config2launcher with correct launchers during cache hit coordinate descent #150860

Fill config2launcher with correct launchers during cache hit coordinate descent #150860

Uh oh!

jamesjwu commented Apr 8, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 8, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 8, 2025

Uh oh!

facebook-github-bot commented Apr 8, 2025

Uh oh!

facebook-github-bot commented Apr 8, 2025

Uh oh!

pytorchmergebot commented Apr 8, 2025

Uh oh!

pytorchmergebot commented Apr 9, 2025

Uh oh!

jamesjwu commented Apr 9, 2025

Uh oh!

pytorch-bot bot commented Apr 9, 2025

Uh oh!

jamesjwu commented Apr 9, 2025

Uh oh!

pytorchmergebot commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fill config2launcher with correct launchers during cache hit coordinate descent #150860

Fill config2launcher with correct launchers during cache hit coordinate descent #150860

Uh oh!

Conversation

jamesjwu commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150860

⏳ No Failures, 1 Pending

Uh oh!

facebook-github-bot commented Apr 8, 2025

Uh oh!

facebook-github-bot commented Apr 8, 2025

Uh oh!

facebook-github-bot commented Apr 8, 2025

Uh oh!

pytorchmergebot commented Apr 8, 2025

Merge started

Uh oh!

pytorchmergebot commented Apr 9, 2025

Uh oh!

jamesjwu commented Apr 9, 2025

Uh oh!

pytorch-bot bot commented Apr 9, 2025

Uh oh!

jamesjwu commented Apr 9, 2025

Uh oh!

pytorchmergebot commented Apr 9, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jamesjwu commented Apr 8, 2025 •

edited

Loading

pytorch-bot bot commented Apr 8, 2025 •

edited

Loading