New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DDP] multiple forward support for static graph #103487
Conversation
Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103487
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 2 Unrelated FailuresAs of commit 1edd7d2: NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base 3f6f508:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) ghstack-source-id: 191995486 Pull Request resolved: #103487
Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) [ghstack-poisoned]
Pull Request resolved: #103487 Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. ghstack-source-id: 192041100 Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) [ghstack-poisoned]
Pull Request resolved: #103487 Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. ghstack-source-id: 192115313 Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me!
Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) [ghstack-poisoned]
Pull Request resolved: #103487 Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. ghstack-source-id: 192138853 Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) [ghstack-poisoned]
Pull Request resolved: #103487 Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. ghstack-source-id: 192139275 Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
@pytorchbot merge -f "CI failure unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "This causes distributed/test_c10d_gloo.py::DistributedDataParallelTest::test_ddp_checkpointing_weight_sharing_use_reentrant_False to fail on trunk starting this commit. See https://github.com/pytorch/pytorch/actions/runs/5269586345/jobs/9528524605 for more details" -c landrace |
@pytorchbot successfully started a revert job. Check the current status here. |
Can't revert PR that was landed via phabricator as D46673736. Please revert by going to the internal diff and clicking Unland. |
…)" This reverts commit 80139fc.
…103873) Per the discussion in #103629 (comment), I preemptively create this revert PR to revert all commits in the stack. This seems like a safer option than using the bot as the commit has already been in trunk since last week. Pull Request resolved: #103873 Approved by: https://github.com/rohan-varma
…03487)" (#103873)" Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/) [ghstack-poisoned]
…03487)" (#103873)" Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/) ghstack-source-id: 192716301 Pull Request resolved: #103938
…03487)" (#103873)" (#103938) Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/) Pull Request resolved: #103938 Approved by: https://github.com/awgu, https://github.com/fegin
This also fixes torch.compile w/staticgraph=True |
Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) [ghstack-poisoned]
Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) ghstack-source-id: 208386778 Pull Request resolved: #114621
…_graph=True" Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab mrshenli zhaojuanmao kiukchung d4l3k lucasllc XilunWu tianyu-l [ghstack-poisoned]
Pull Request resolved: #114621 Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. ghstack-source-id: 208389966 @exported-using-ghexport Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)
…_graph=True" Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab mrshenli zhaojuanmao kiukchung d4l3k lucasllc XilunWu tianyu-l [ghstack-poisoned]
Pull Request resolved: #114621 Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. ghstack-source-id: 208392149 @exported-using-ghexport Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)
Pull Request resolved: #114621 Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. ghstack-source-id: 208875875 @exported-using-ghexport Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)
…e works w/static_graph=True" Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab mrshenli zhaojuanmao kiukchung d4l3k lucasllc XilunWu tianyu-l [ghstack-poisoned]
…_graph=True" Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab mrshenli zhaojuanmao kiukchung d4l3k lucasllc XilunWu tianyu-l [ghstack-poisoned]
…#114621) Resolves #93672. This was actually fixed by #103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) Pull Request resolved: #114621 Approved by: https://github.com/wconstab
…pytorch#114621) Resolves pytorch#93672. This was actually fixed by pytorch#103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) Pull Request resolved: pytorch#114621 Approved by: https://github.com/wconstab
…pytorch#114621) Resolves pytorch#93672. This was actually fixed by pytorch#103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) Pull Request resolved: pytorch#114621 Approved by: https://github.com/wconstab
Stack from ghstack (oldest at bottom):
Adds support for multiple forward before bwd call for
static_graph=True.
There are 2 changes:
from relying on forward iteration to backward calls
delay allreduce. Instead use a flag.
Differential Revision: D46673736