-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Expose get_active_ddp_module api for torchdynamo DDP #83333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
❌ 2 New Failures, 1 Flaky FailuresAs of commit e89d081 (more details on the Dr. CI page): Expand to see more
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
|
@albanD looks like the pr sanity check is wrong |
The PR to add the check is here #83295. It fails in a bunch of PRs, but passes in others. So I suspect that the |
|
Yes my bad sorry, here is the fix: #83344 |
torch/nn/parallel/distributed.py
Outdated
| # used to track whether the given thread is inside ddp forward for torchdynamo purposes | ||
| _tls_ctx = threading.local() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really track or enforce - it just stores. I would instead either store thread identifying info and fail if it stops matching, or, better yet, remove this and let the GIL handle this assumptions safely - even if multiple threads modify the _tls_ctx, they will do it only one at a time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess i wasn't sure i wanted to enforce users didn't use more than one thread as it could technically be valid, but at the same time I can't think of a good reason for it and the normal use case should be 1 py thread per process for DDP. So I may just delete the tls thing altogether.
torch/nn/parallel/distributed.py
Outdated
| # see torchdynamo/eval_frame.py TorchPatcher.patch for more details | ||
| @contextmanager | ||
| def _inside_ddp_forward(self): | ||
| assert DistributedDataParallel._active_ddp_module is None, "Only one thread should be running DDP at a time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one thread should be running DDP at a time
This restriction looks OK to me, as otherwise, collectives might desync.
Is there any other reason that you decided to not use thread_local for _active_ddp_module? Will Dynamo switch threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reasoned that i did not expect anyone to use more than one thread calling into DDP and therefore just avoid complexity. I could put the TLS thing back, it shouldn't hurt.
I am seeing this test failure though, with this assert enabled. I think the test would still fail in the same way if I were using TLS.
https://github.com/pytorch/pytorch/runs/8007501670?check_suite_focus=true
reading the test case it's not immediately clear to me why we are entering DDP twice. Any ideas?
|
@pytorchbot merge -g |
|
@pytorchbot successfully started a merge job. Check the current status here. |
|
Merge failed |
0374e3d to
e89d081
Compare
e89d081 to
c3e9841
Compare
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83333
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5a2ced8: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
c3e9841 to
5a2ced8
Compare
|
@pytorchbot merge -a |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here. |
|
Hey @wconstab. |
Pairs up with torchdynamo PR pytorch/torchdynamo#628 Exposes a new API that lets torchdynamo know when it is compiling the 'forward' of a module that is inside a DDPmodule. Pull Request resolved: #83333 Approved by: https://github.com/mrshenli
Pairs up with torchdynamo PR pytorch/torchdynamo#628
Exposes a new API that lets torchdynamo know when it is compiling the 'forward' of a module that is inside a DDPmodule.