Expose get_active_ddp_module api for torchdynamo DDP #83333

wconstab · 2022-08-12T16:06:33Z

Pairs up with torchdynamo PR pytorch/torchdynamo#628

Exposes a new API that lets torchdynamo know when it is compiling the 'forward' of a module that is inside a DDPmodule.

facebook-github-bot · 2022-08-12T16:06:42Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83333
✖️ Python docs build was skipped
✖️ C++ docs build was skipped
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 2 New Failures, 1 Flaky Failures

As of commit e89d081 (more details on the Dr. CI page):

Expand to see more

2/3 failures introduced in this PR
1/3 tentatively recognized as flaky ❄️

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/1)

Step: "Setup Windows" (full log | diagnosis details)

2022-09-01T17:42:32.9957753Z ##[error]Process completed with exit code 127.

2022-09-01T17:42:32.9602439Z   PYTORCH_RETRY_TEST_CASES: 1
2022-09-01T17:42:32.9602717Z   PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
2022-09-01T17:42:32.9603013Z   SHA1: e89d081ee7addcad838feef7bde0f2c09750342f
2022-09-01T17:42:32.9603257Z   TAG: 
2022-09-01T17:42:32.9603433Z   WORKFLOW_ID: 2973492028
2022-09-01T17:42:32.9603916Z   GITHUB_TOKEN: ***
2022-09-01T17:42:32.9604142Z   GHA_WORKFLOW_JOB_ID: 
2022-09-01T17:42:32.9604338Z ##[endgroup]
2022-09-01T17:42:32.9836121Z + python3 -m pip install -r requirements.txt
2022-09-01T17:42:32.9930892Z C:\actions-runner\_work\_temp\614552af-927a-41a2-a6b8-45c6a1322324.sh: line 2: python3: command not found
2022-09-01T17:42:32.9957753Z ##[error]Process completed with exit code 127.
2022-09-01T17:42:33.0078284Z Prepare all required actions
2022-09-01T17:42:33.0113229Z ##[group]Run ./.github/actions/teardown-win
2022-09-01T17:42:33.0113435Z with:
2022-09-01T17:42:33.0113622Z env:
2022-09-01T17:42:33.0113839Z   GIT_DEFAULT_BRANCH: master
2022-09-01T17:42:33.0114020Z ##[endgroup]
2022-09-01T17:42:33.0227446Z ##[group]Run .github\scripts\wait_for_ssh_to_drain.ps1
2022-09-01T17:42:33.0227776Z �[36;1m.github\scripts\wait_for_ssh_to_drain.ps1�[0m
2022-09-01T17:42:33.0251747Z shell: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.EXE -command ". '{0}'"
2022-09-01T17:42:33.0252102Z env:

🕵️‍♀️ 1 failure not recognized by patterns:

The following CI failures may be due to changes from the PR

Job	Step
^build	^Unknown

❄️ 1 failure tentatively classified as flaky:

trunk / android-emulator-build-test / build-and-test (1/1)

Step: "Install dependencies" (full log | diagnosis details) ❄️

2022-09-01T16:59:31.5243829Z CondaHTTPError: HT.../linux-64/mkl-include-2022.1.0-h06a4308_223.conda>

2022-09-01T16:59:31.5237420Z An HTTP error occurred when trying to retrieve this URL.
2022-09-01T16:59:31.5238233Z HTTP errors are often intermittent, and a simple retry will get you on your way.
2022-09-01T16:59:31.5238676Z 
2022-09-01T16:59:31.5239598Z CondaHTTPError: HTTP 404 NOT FOUND for url <https://repo.anaconda.com/pkgs/main/linux-64/mkl-2022.1.0-hc2b9512_223.conda>
2022-09-01T16:59:31.5240240Z Elapsed: 00:00.035380
2022-09-01T16:59:31.5241009Z CF-RAY: 743f8e888b010b76-DFW
2022-09-01T16:59:31.5241381Z 
2022-09-01T16:59:31.5241930Z An HTTP error occurred when trying to retrieve this URL.
2022-09-01T16:59:31.5242529Z HTTP errors are often intermittent, and a simple retry will get you on your way.
2022-09-01T16:59:31.5243170Z 
2022-09-01T16:59:31.5243829Z CondaHTTPError: HTTP 404 NOT FOUND for url <https://repo.anaconda.com/pkgs/main/linux-64/mkl-include-2022.1.0-h06a4308_223.conda>
2022-09-01T16:59:31.5244683Z Elapsed: 00:00.025273
2022-09-01T16:59:31.5245222Z CF-RAY: 743f8e8cd9760b76-DFW
2022-09-01T16:59:31.5245803Z 
2022-09-01T16:59:31.5246154Z An HTTP error occurred when trying to retrieve this URL.
2022-09-01T16:59:31.5246965Z HTTP errors are often intermittent, and a simple retry will get you on your way.
2022-09-01T16:59:31.5247388Z 
2022-09-01T16:59:31.5247773Z 
2022-09-01T16:59:31.7399803Z ##[error]Process completed with exit code 1.
2022-09-01T16:59:31.7572407Z Cleaning up orphan processes

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

janeyx99 · 2022-08-12T19:02:38Z

@albanD looks like the pr sanity check is wrong

huydhn · 2022-08-12T19:15:23Z

@albanD looks like the pr sanity check is wrong

The PR to add the check is here #83295. It fails in a bunch of PRs, but passes in others. So I suspect that the HEAD: ${{ github.event.pull_request.head.sha }} commit doesn't have what we think it does. Thus the diff returns wrong results for those failing PR

albanD · 2022-08-12T19:20:34Z

Yes my bad sorry, here is the fix: #83344

voznesenskym · 2022-08-24T19:16:12Z

torch/nn/parallel/distributed.py

+    # used to track whether the given thread is inside ddp forward for torchdynamo purposes
+    _tls_ctx = threading.local()


This doesn't really track or enforce - it just stores. I would instead either store thread identifying info and fail if it stops matching, or, better yet, remove this and let the GIL handle this assumptions safely - even if multiple threads modify the _tls_ctx, they will do it only one at a time, right?

I guess i wasn't sure i wanted to enforce users didn't use more than one thread as it could technically be valid, but at the same time I can't think of a good reason for it and the normal use case should be 1 py thread per process for DDP. So I may just delete the tls thing altogether.

torch/nn/parallel/distributed.py

mrshenli · 2022-08-25T01:12:45Z

torch/nn/parallel/distributed.py

+    # see torchdynamo/eval_frame.py TorchPatcher.patch for more details
+    @contextmanager
+    def _inside_ddp_forward(self):
+        assert DistributedDataParallel._active_ddp_module is None, "Only one thread should be running DDP at a time"


Only one thread should be running DDP at a time

This restriction looks OK to me, as otherwise, collectives might desync.

Is there any other reason that you decided to not use thread_local for _active_ddp_module? Will Dynamo switch threads?

I reasoned that i did not expect anyone to use more than one thread calling into DDP and therefore just avoid complexity. I could put the TLS thing back, it shouldn't hurt.

I am seeing this test failure though, with this assert enabled. I think the test would still fail in the same way if I were using TLS.
https://github.com/pytorch/pytorch/runs/8007501670?check_suite_focus=true

reading the test case it's not immediately clear to me why we are entering DDP twice. Any ideas?

wconstab · 2022-08-25T01:20:35Z

@pytorchbot merge -g

pytorchmergebot · 2022-08-25T01:21:51Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-08-25T01:31:56Z

Merge failed
Reason: View failures on hud. Refusing to merge as mandatory check(s) pull failed for rule Distributed.
Raised by workflow job

pytorch-bot · 2022-09-07T20:39:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83333

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5a2ced8:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2022-09-17T00:17:45Z

@pytorchbot merge -a

pytorch-bot · 2022-09-17T00:17:46Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: -a

usage: @pytorchbot [-h] {merge,revert,rebase,label} ...

Try @pytorchbot --help for more info.

wconstab · 2022-09-17T00:18:05Z

@pytorchbot merge

pytorchmergebot · 2022-09-17T00:19:24Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-09-17T02:11:01Z

Hey @wconstab.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Pairs up with torchdynamo PR pytorch/torchdynamo#628 Exposes a new API that lets torchdynamo know when it is compiling the 'forward' of a module that is inside a DDPmodule. Pull Request resolved: #83333 Approved by: https://github.com/mrshenli

wconstab requested a review from mrshenli August 12, 2022 16:06

wconstab requested review from H-Huang, awgu, mingzhe09088, pritamdamania87, rohan-varma and zhaojuanmao as code owners August 12, 2022 16:06

facebook-github-bot added the cla signed label Aug 12, 2022

wconstab removed the request for review from rohan-varma August 12, 2022 16:06

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 12, 2022

wconstab removed request for H-Huang, awgu, mingzhe09088, pritamdamania87 and zhaojuanmao August 12, 2022 16:06

wconstab mentioned this pull request Aug 12, 2022

DDP optimization via graph-breaks in Dynamo pytorch/torchdynamo#628

Merged

huydhn mentioned this pull request Aug 12, 2022

[WIP] Add pr sanity check workflow #83295

Closed

voznesenskym reviewed Aug 24, 2022

View reviewed changes

torch/nn/parallel/distributed.py Show resolved Hide resolved

voznesenskym reviewed Aug 24, 2022

View reviewed changes

torch/nn/parallel/distributed.py Show resolved Hide resolved

wconstab commented Aug 24, 2022

View reviewed changes

torch/nn/parallel/distributed.py Outdated Show resolved Hide resolved

wconstab changed the title ~~(WIP) Expose get_active_ddp_module api for torchdynamo DDP~~ Expose get_active_ddp_module api for torchdynamo DDP Aug 25, 2022

mrshenli approved these changes Aug 25, 2022

View reviewed changes

mrshenli added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 25, 2022

wconstab force-pushed the wconstab/ddp-dynamo branch 2 times, most recently from 0374e3d to e89d081 Compare September 1, 2022 16:58

wconstab force-pushed the wconstab/ddp-dynamo branch from e89d081 to c3e9841 Compare September 7, 2022 20:39

This was referenced Feb 1, 2023

[ddp] profiler::_record_function_enter() expected at most 2 argument(s)... #93668

Closed

[ddp] AssertionError: torch.* op returned non-Tensor MaskedLMOutput call_module self_model pytorch/torchdynamo#1236

Closed

Expose get_active_ddp_module api for torchdynamo DDP

5a2ced8

wconstab force-pushed the wconstab/ddp-dynamo branch from c3e9841 to 5a2ced8 Compare September 16, 2022 23:14

pytorchmergebot added the Merged label Sep 17, 2022

pytorchmergebot closed this in 32fc0b9 Sep 17, 2022

github-actions bot deleted the wconstab/ddp-dynamo branch March 18, 2024 01:50

		# used to track whether the given thread is inside ddp forward for torchdynamo purposes
		_tls_ctx = threading.local()

Expose get_active_ddp_module api for torchdynamo DDP #83333

Expose get_active_ddp_module api for torchdynamo DDP #83333

Uh oh!

Conversation

wconstab commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 2 New Failures, 1 Flaky Failures

🕵️ 1 new failure recognized by patterns

pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/1)

🕵️‍♀️ 1 failure not recognized by patterns:

❄️ 1 failure tentatively classified as flaky:

trunk / android-emulator-build-test / build-and-test (1/1)

Uh oh!

janeyx99 commented Aug 12, 2022

Uh oh!

huydhn commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD commented Aug 12, 2022

Uh oh!

voznesenskym Aug 24, 2022

Choose a reason for hiding this comment

Uh oh!

wconstab Aug 24, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli Aug 25, 2022

Choose a reason for hiding this comment

Uh oh!

wconstab Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

wconstab commented Aug 25, 2022

Uh oh!

pytorchmergebot commented Aug 25, 2022

Uh oh!

pytorchmergebot commented Aug 25, 2022

Uh oh!

pytorch-bot bot commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83333

✅ No Failures

Uh oh!

wconstab commented Sep 17, 2022

Uh oh!

pytorch-bot bot commented Sep 17, 2022

Uh oh!

wconstab commented Sep 17, 2022

Uh oh!

pytorchmergebot commented Sep 17, 2022

Uh oh!

github-actions bot commented Sep 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wconstab commented Aug 12, 2022 •

edited

Loading

facebook-github-bot commented Aug 12, 2022 •

edited

Loading

huydhn commented Aug 12, 2022 •

edited

Loading

pytorch-bot bot commented Sep 7, 2022 •

edited

Loading