Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Aug 12, 2022

Pairs up with torchdynamo PR pytorch/torchdynamo#628

Exposes a new API that lets torchdynamo know when it is compiling the 'forward' of a module that is inside a DDPmodule.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 12, 2022

🔗 Helpful links

❌ 2 New Failures, 1 Flaky Failures

As of commit e89d081 (more details on the Dr. CI page):

Expand to see more
  • 2/3 failures introduced in this PR
  • 1/3 tentatively recognized as flaky ❄️

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/1)

Step: "Setup Windows" (full log | diagnosis details)

2022-09-01T17:42:32.9957753Z ##[error]Process completed with exit code 127.
2022-09-01T17:42:32.9602439Z   PYTORCH_RETRY_TEST_CASES: 1
2022-09-01T17:42:32.9602717Z   PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
2022-09-01T17:42:32.9603013Z   SHA1: e89d081ee7addcad838feef7bde0f2c09750342f
2022-09-01T17:42:32.9603257Z   TAG: 
2022-09-01T17:42:32.9603433Z   WORKFLOW_ID: 2973492028
2022-09-01T17:42:32.9603916Z   GITHUB_TOKEN: ***
2022-09-01T17:42:32.9604142Z   GHA_WORKFLOW_JOB_ID: 
2022-09-01T17:42:32.9604338Z ##[endgroup]
2022-09-01T17:42:32.9836121Z + python3 -m pip install -r requirements.txt
2022-09-01T17:42:32.9930892Z C:\actions-runner\_work\_temp\614552af-927a-41a2-a6b8-45c6a1322324.sh: line 2: python3: command not found
2022-09-01T17:42:32.9957753Z ##[error]Process completed with exit code 127.
2022-09-01T17:42:33.0078284Z Prepare all required actions
2022-09-01T17:42:33.0113229Z ##[group]Run ./.github/actions/teardown-win
2022-09-01T17:42:33.0113435Z with:
2022-09-01T17:42:33.0113622Z env:
2022-09-01T17:42:33.0113839Z   GIT_DEFAULT_BRANCH: master
2022-09-01T17:42:33.0114020Z ##[endgroup]
2022-09-01T17:42:33.0227446Z ##[group]Run .github\scripts\wait_for_ssh_to_drain.ps1
2022-09-01T17:42:33.0227776Z �[36;1m.github\scripts\wait_for_ssh_to_drain.ps1�[0m
2022-09-01T17:42:33.0251747Z shell: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.EXE -command ". '{0}'"
2022-09-01T17:42:33.0252102Z env:

🕵️‍♀️ 1 failure not recognized by patterns:

The following CI failures may be due to changes from the PR
Job Step
CircleCI Checks build Unknown

❄️ 1 failure tentatively classified as flaky:

See GitHub Actions build trunk / android-emulator-build-test / build-and-test (1/1)

Step: "Install dependencies" (full log | diagnosis details) ❄️

2022-09-01T16:59:31.5243829Z CondaHTTPError: HT.../linux-64/mkl-include-2022.1.0-h06a4308_223.conda>
2022-09-01T16:59:31.5237420Z An HTTP error occurred when trying to retrieve this URL.
2022-09-01T16:59:31.5238233Z HTTP errors are often intermittent, and a simple retry will get you on your way.
2022-09-01T16:59:31.5238676Z 
2022-09-01T16:59:31.5239598Z CondaHTTPError: HTTP 404 NOT FOUND for url <https://repo.anaconda.com/pkgs/main/linux-64/mkl-2022.1.0-hc2b9512_223.conda>
2022-09-01T16:59:31.5240240Z Elapsed: 00:00.035380
2022-09-01T16:59:31.5241009Z CF-RAY: 743f8e888b010b76-DFW
2022-09-01T16:59:31.5241381Z 
2022-09-01T16:59:31.5241930Z An HTTP error occurred when trying to retrieve this URL.
2022-09-01T16:59:31.5242529Z HTTP errors are often intermittent, and a simple retry will get you on your way.
2022-09-01T16:59:31.5243170Z 
2022-09-01T16:59:31.5243829Z CondaHTTPError: HTTP 404 NOT FOUND for url <https://repo.anaconda.com/pkgs/main/linux-64/mkl-include-2022.1.0-h06a4308_223.conda>
2022-09-01T16:59:31.5244683Z Elapsed: 00:00.025273
2022-09-01T16:59:31.5245222Z CF-RAY: 743f8e8cd9760b76-DFW
2022-09-01T16:59:31.5245803Z 
2022-09-01T16:59:31.5246154Z An HTTP error occurred when trying to retrieve this URL.
2022-09-01T16:59:31.5246965Z HTTP errors are often intermittent, and a simple retry will get you on your way.
2022-09-01T16:59:31.5247388Z 
2022-09-01T16:59:31.5247773Z 
2022-09-01T16:59:31.7399803Z ##[error]Process completed with exit code 1.
2022-09-01T16:59:31.7572407Z Cleaning up orphan processes


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@janeyx99
Copy link
Contributor

@albanD looks like the pr sanity check is wrong

@huydhn
Copy link
Contributor

huydhn commented Aug 12, 2022

@albanD looks like the pr sanity check is wrong

The PR to add the check is here #83295. It fails in a bunch of PRs, but passes in others. So I suspect that the HEAD: ${{ github.event.pull_request.head.sha }} commit doesn't have what we think it does. Thus the diff returns wrong results for those failing PR

@albanD
Copy link
Collaborator

albanD commented Aug 12, 2022

Yes my bad sorry, here is the fix: #83344

Comment on lines 525 to 526
# used to track whether the given thread is inside ddp forward for torchdynamo purposes
_tls_ctx = threading.local()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really track or enforce - it just stores. I would instead either store thread identifying info and fail if it stops matching, or, better yet, remove this and let the GIL handle this assumptions safely - even if multiple threads modify the _tls_ctx, they will do it only one at a time, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess i wasn't sure i wanted to enforce users didn't use more than one thread as it could technically be valid, but at the same time I can't think of a good reason for it and the normal use case should be 1 py thread per process for DDP. So I may just delete the tls thing altogether.

@wconstab wconstab changed the title (WIP) Expose get_active_ddp_module api for torchdynamo DDP Expose get_active_ddp_module api for torchdynamo DDP Aug 25, 2022
# see torchdynamo/eval_frame.py TorchPatcher.patch for more details
@contextmanager
def _inside_ddp_forward(self):
assert DistributedDataParallel._active_ddp_module is None, "Only one thread should be running DDP at a time"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one thread should be running DDP at a time

This restriction looks OK to me, as otherwise, collectives might desync.

Is there any other reason that you decided to not use thread_local for _active_ddp_module? Will Dynamo switch threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reasoned that i did not expect anyone to use more than one thread calling into DDP and therefore just avoid complexity. I could put the TLS thing back, it shouldn't hurt.

I am seeing this test failure though, with this assert enabled. I think the test would still fail in the same way if I were using TLS.
https://github.com/pytorch/pytorch/runs/8007501670?check_suite_focus=true

reading the test case it's not immediately clear to me why we are entering DDP twice. Any ideas?

@wconstab
Copy link
Contributor Author

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@pytorchmergebot
Copy link
Collaborator

Merge failed
Reason: View failures on hud. Refusing to merge as mandatory check(s) pull failed for rule Distributed.
Raised by workflow job

@mrshenli mrshenli added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 25, 2022
@wconstab wconstab force-pushed the wconstab/ddp-dynamo branch 2 times, most recently from 0374e3d to e89d081 Compare September 1, 2022 16:58
@wconstab wconstab force-pushed the wconstab/ddp-dynamo branch from e89d081 to c3e9841 Compare September 7, 2022 20:39
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 7, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83333

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5a2ced8:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wconstab
Copy link
Contributor Author

@pytorchbot merge -a

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 17, 2022

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: -a

usage: @pytorchbot [-h] {merge,revert,rebase,label} ...

Try @pytorchbot --help for more info.

@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@github-actions
Copy link
Contributor

Hey @wconstab.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
Pairs up with torchdynamo PR pytorch/torchdynamo#628

Exposes a new API that lets torchdynamo know when it is compiling the 'forward' of a module that is inside a DDPmodule.
Pull Request resolved: #83333
Approved by: https://github.com/mrshenli
@github-actions github-actions bot deleted the wconstab/ddp-dynamo branch March 18, 2024 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants