-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[export] Skip guard propagation for export only. #112685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112685
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit cba8099 with merge base 2337d8d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D50914819 |
df7129f
to
0716db8
Compare
This pull request was exported from Phabricator. Differential Revision: D50914819 |
0716db8
to
6d65663
Compare
This pull request was exported from Phabricator. Differential Revision: D50914819 |
6d65663
to
83c41be
Compare
This pull request was exported from Phabricator. Differential Revision: D50914819 |
Summary: IIUC, from the beginning of torch.export to the current moment, we never really find a way to meaningfully use generated guards from torch dynamo, since export by design must output a non Python dependent artifact to users. Therefore I think in the forseeable future we still don't need dynamo guards to be present. Recently we ran into some slowness in dynamo for internal models and observed a substantial time consumed in guard propagation logic, especially when we're dealing with list VTs. This PR effectively detects whether we're exporting and if so skip the entire guard propagation logic. This doesn't resolve the fundemental issue of O(n^2) algorithm used in VT but greatly reduce the constant factor of it, thus still gives us great speedup for exporting time. We've discussed the resolution in pt2 core meeting but as a quick unblocker in the short term I think we could also try to skip guards to unblock several internal models. Some experiment ran on the internal FM model: local_fm (training): 15min -> 4min ads launcher ir generation (training, D50847630): >1hr -> 30min Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D50914819
This pull request was exported from Phabricator. Differential Revision: D50914819 |
83c41be
to
52d1ba5
Compare
This pull request was exported from Phabricator. Differential Revision: D50914819 |
52d1ba5
to
cba8099
Compare
Added some comments inline about why we skip guards propagation for export. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be made obsolete by:
#111726
Does that PR fix the performance issues you are seeing? If so maybe we don't need to add an extra mode.
I will do a test with that PR. |
Summary:
IIUC, from the beginning of torch.export to the current moment, we never really find a way to meaningfully use generated guards from torch dynamo, since export by design must output a non Python dependent artifact to users. Therefore I think in the forseeable future we still don't need dynamo guards to be present.
Recently we ran into some slowness in dynamo for internal models and observed a substantial time consumed in guard propagation logic, especially when we're dealing with list VTs. This PR effectively detects whether we're exporting and if so skip the entire guard propagation logic.
This doesn't resolve the fundemental issue of O(n^2) algorithm used in VT but greatly reduce the constant factor of it, thus still gives us great speedup for exporting time. We've discussed the resolution in pt2 core meeting but as a quick unblocker in the short term I think we could also try to skip guards to unblock several internal models.
Some experiment ran on the internal FM model:
local_fm (training): 15min -> 4min
ads launcher ir generation (training, D50847630): >1hr -> 30min
Test Plan: CI
Differential Revision: D50914819
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @aakhundov @kadeng