-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add current cuda device index to FXGraphCache key #147464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147464
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2adfd4d with merge base f63db62 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/_inductor/codecache.py
Outdated
# This device index is usually already encoded by the device of the inputs | ||
# but fx graphs don't necessarily have tensor inputs | ||
if torch.cuda.is_available(): | ||
self.default_cuda_device_index = torch.cuda.current_device() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partially a PSA / question for @albanD - to what extent do you expect people to be using torch.accelerator.current_device()
+ PT2 today (and I guess... changing their device index from run to run)? James and I talked about it offline a bit - we'll need to do something similar for other accelerators in the long run to avoid accelerator-specific warm cache problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you should use torch.accelerator instead of cuda for these things. They lead to the same thing for the cuda device but you also get rocm/mtia/xpu support for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed that:
if torch.cuda.is_available(), then torch.accelerator.is_available()
?
And that torch.cuda.current_device() == torch.accelerator.current_device
?
If so happy to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it works in my tests, will run with it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm! Although we should fix the dynamo guard issue too :)
@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This PR intends to fix the cache related issues from #147405. It does *not* handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key. Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key. A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts. I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939) [ghstack-poisoned]
@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Rebase to rerun tests |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
This PR intends to fix the cache related issues from #147405.
It does not handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key.
Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key.
A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts.
I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov
Differential Revision: D69875939