-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fill config2launcher with correct launchers during cache hit coordinate descent #150860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…te descent This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456, we cache CachingAutotuners in memory. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150860
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit 739dc07 with merge base cdf3b63 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…te descent This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456, we cache CachingAutotuners in memory. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! ghstack-source-id: 276832508 Pull Request resolved: #150860
This pull request was exported from Phabricator. Differential Revision: D72655382 |
…it coordinate descent" This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456, we cache CachingAutotuners in memory. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
…te descent Pull Request resolved: #150860 This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456, we cache CachingAutotuners in memory. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! ghstack-source-id: 276839777
This pull request was exported from Phabricator. Differential Revision: D72655382 |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchbot merge -f |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot merge -f "Landed internally tests passing" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…te descent (pytorch#150860) This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456(OSS: pytorch#149910), we cache CachingAutotuners in memory. - Importantly: **Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk**. By saving it in memory, CachingAutotuners do not reset global state. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS: https://github.com/pytorch/pytorch/blob/aafc4b6188b70cf808f756f23b1a05355bcb7696/torch/_inductor/runtime/coordinate_descent_tuner.py#L69) - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje (OSS: https://github.com/pytorch/pytorch/blob/aafc4b6188b70cf808f756f23b1a05355bcb7696/torch/_inductor/runtime/triton_heuristics.py#L933) - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! Pull Request resolved: pytorch#150860 Approved by: https://github.com/oulgen
…te descent (pytorch#150860) This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456(OSS: pytorch#149910), we cache CachingAutotuners in memory. - Importantly: **Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk**. By saving it in memory, CachingAutotuners do not reset global state. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS: https://github.com/pytorch/pytorch/blob/aafc4b6188b70cf808f756f23b1a05355bcb7696/torch/_inductor/runtime/coordinate_descent_tuner.py#L69) - Because we are caching these in memory and not on disk, this cache is **not cleared** between runs. - However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje (OSS: https://github.com/pytorch/pytorch/blob/aafc4b6188b70cf808f756f23b1a05355bcb7696/torch/_inductor/runtime/triton_heuristics.py#L933) - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! Pull Request resolved: pytorch#150860 Approved by: https://github.com/oulgen
Stack from ghstack (oldest at bottom):
This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging.
Here's a short TLDR of the bug:
pytorch/torch/_inductor/runtime/coordinate_descent_tuner.py
Line 69 in aafc4b6
(OSS:
pytorch/torch/_inductor/runtime/triton_heuristics.py
Line 933 in aafc4b6
config2launcher
is added when we callbenchmark_one_config
, but on a CoorDesc cache hit, we never callbenchmark_one_config
! So we end up returning None, and erroring with:This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now.
Note that this error only reproduces if:
So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only.
Differential Revision: D72655382
NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov