Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPER] Introduce barrier operation to force synchronization of threads in async execution #49322

Closed
wants to merge 1 commit into from

Conversation

kennyhorror
Copy link
Contributor

Summary:
In some cases async execution might loose dependencies (Alias like ops) or produce suboptimal scheduling when there is an option which parts to schedule first. Example of the later behavior can happen in ModelParallel training where copy can get lower priority compared to the rest of the execution on the given GPU, which will caused other GPUs to starve.

This operator allows to address these issues by introducing extra explicit dependencies between ops.

Test Plan:
Unit-test/
E2E testing in the future diffs.

Reviewed By: xianjiec

Differential Revision: D24933471

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24933471

…s in async execution (pytorch#49322)

Summary:
Pull Request resolved: pytorch#49322

In some cases async execution might loose dependencies (Alias like ops) or produce suboptimal scheduling when there is an option which parts to schedule first. Example of the later behavior can happen in ModelParallel training where copy can get lower priority compared to the rest of the execution on the given GPU, which will caused other GPUs to starve.

This operator allows to address these issues by introducing extra explicit dependencies between ops.

Test Plan:
Unit-test/
E2E testing in the future diffs.

Reviewed By: xianjiec

Differential Revision: D24933471

fbshipit-source-id: 18e29c0899a97183115339528dc5c3c8b090205a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24933471

@codecov
Copy link

codecov bot commented Dec 15, 2020

Codecov Report

Merging #49322 (c6bb365) into master (5a5e576) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #49322   +/-   ##
=======================================
  Coverage   80.56%   80.56%           
=======================================
  Files        1875     1875           
  Lines      202701   202701           
=======================================
+ Hits       163307   163309    +2     
+ Misses      39394    39392    -2     

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 46debe7.

hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this pull request Jan 6, 2021
…s in async execution (pytorch#49322)

Summary:
Pull Request resolved: pytorch#49322

In some cases async execution might loose dependencies (Alias like ops) or produce suboptimal scheduling when there is an option which parts to schedule first. Example of the later behavior can happen in ModelParallel training where copy can get lower priority compared to the rest of the execution on the given GPU, which will caused other GPUs to starve.

This operator allows to address these issues by introducing extra explicit dependencies between ops.

Test Plan:
Unit-test/
E2E testing in the future diffs.

Reviewed By: xianjiec

Differential Revision: D24933471

fbshipit-source-id: 1668994c7856d73926cde022378a99e1e8db3567
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants