Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DDP] multiple forward support for static graph #103487

Closed
wants to merge 5 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Jun 13, 2023

Stack from ghstack (oldest at bottom):

Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:

  1. Change tracking of accounting of when to populate static grap related maps
    from relying on forward iteration to backward calls
  2. In DDP python, don't rely on num_forward iterations == 1 to enqueue the
    delay allreduce. Instead use a flag.

Differential Revision: D46673736

Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 13, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103487

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 2 Unrelated Failures

As of commit 1edd7d2:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base 3f6f508:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

rohan-varma added a commit that referenced this pull request Jun 13, 2023
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)

ghstack-source-id: 191995486
Pull Request resolved: #103487
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 13, 2023
Pull Request resolved: #103487

Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.
ghstack-source-id: 192041100

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 13, 2023
Pull Request resolved: #103487

Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.
ghstack-source-id: 192115313

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me!

torch/nn/parallel/distributed.py Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/reducer.cpp Show resolved Hide resolved
torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved
torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 14, 2023
Pull Request resolved: #103487

Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.
ghstack-source-id: 192138853

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 14, 2023
Pull Request resolved: #103487

Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.
ghstack-source-id: 192139275

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
@rohan-varma
Copy link
Member Author

@pytorchbot merge -f "CI failure unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@osalpekar
Copy link
Member

@pytorchbot revert -m "This causes distributed/test_c10d_gloo.py::DistributedDataParallelTest::test_ddp_checkpointing_weight_sharing_use_reentrant_False to fail on trunk starting this commit. See https://github.com/pytorch/pytorch/actions/runs/5269586345/jobs/9528524605 for more details" -c landrace

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Can't revert PR that was landed via phabricator as D46673736. Please revert by going to the internal diff and clicking Unland.

@rohan-varma rohan-varma mentioned this pull request Jun 14, 2023
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/707/head branch June 18, 2023 14:17
huydhn added a commit to huydhn/pytorch that referenced this pull request Jun 20, 2023
pytorchmergebot pushed a commit that referenced this pull request Jun 20, 2023
…103873)

Per the discussion in #103629 (comment), I preemptively create this revert PR to revert all commits in the stack.  This seems like a safer option than using the bot as the commit has already been in trunk since last week.
Pull Request resolved: #103873
Approved by: https://github.com/rohan-varma
rohan-varma added a commit that referenced this pull request Jun 21, 2023
rohan-varma added a commit that referenced this pull request Jun 21, 2023
@rohan-varma
Copy link
Member Author

This also fixes torch.compile w/staticgraph=True

rohan-varma added a commit that referenced this pull request Nov 27, 2023
Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 27, 2023
Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

ghstack-source-id: 208386778
Pull Request resolved: #114621
rohan-varma added a commit that referenced this pull request Nov 27, 2023
…_graph=True"

Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab mrshenli zhaojuanmao kiukchung d4l3k lucasllc XilunWu tianyu-l

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 27, 2023
Pull Request resolved: #114621

Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.
ghstack-source-id: 208389966
@exported-using-ghexport

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)
rohan-varma added a commit that referenced this pull request Nov 27, 2023
…_graph=True"

Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab mrshenli zhaojuanmao kiukchung d4l3k lucasllc XilunWu tianyu-l

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 27, 2023
Pull Request resolved: #114621

Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.
ghstack-source-id: 208392149
@exported-using-ghexport

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)
rohan-varma added a commit that referenced this pull request Dec 1, 2023
Pull Request resolved: #114621

Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.
ghstack-source-id: 208875875
@exported-using-ghexport

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)
rohan-varma added a commit that referenced this pull request Dec 1, 2023
…e works w/static_graph=True"

Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab mrshenli zhaojuanmao kiukchung d4l3k lucasllc XilunWu tianyu-l

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 1, 2023
…_graph=True"

Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab mrshenli zhaojuanmao kiukchung d4l3k lucasllc XilunWu tianyu-l

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Dec 1, 2023
…#114621)

Resolves #93672. This was
actually fixed by #103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

Pull Request resolved: #114621
Approved by: https://github.com/wconstab
hyperfraise pushed a commit to hyperfraise/pytorch that referenced this pull request Dec 21, 2023
…pytorch#114621)

Resolves pytorch#93672. This was
actually fixed by pytorch#103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

Pull Request resolved: pytorch#114621
Approved by: https://github.com/wconstab
hyperfraise pushed a commit to hyperfraise/pytorch that referenced this pull request Dec 21, 2023
…pytorch#114621)

Resolves pytorch#93672. This was
actually fixed by pytorch#103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

Pull Request resolved: pytorch#114621
Approved by: https://github.com/wconstab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants