Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split FutureNCCL's CUDA-specific parts from generic future logic #48504

Closed
wants to merge 9 commits into from

Conversation

lw
Copy link
Contributor

@lw lw commented Nov 26, 2020

Stack from ghstack:

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).


FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This is already happening, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have only the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: D25180534

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Nov 26, 2020

💊 CI failures summary and remediations

As of commit 1aa056d (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_bionic_rocm3_9_py3_6_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Dec 09 15:33:46 caused by: Connection refused (os error 111)
Dec 09 15:33:46 ++++ extract_trap_cmd 
Dec 09 15:33:46 ++++ printf '%s\n' '' 
Dec 09 15:33:46 +++ printf '%s\n' cleanup 
Dec 09 15:33:46 ++ trap -- ' 
Dec 09 15:33:46 cleanup' EXIT 
Dec 09 15:33:46 ++ [[ pytorch-linux-bionic-rocm3.9-py3.6-build != *pytorch-win-* ]] 
Dec 09 15:33:46 ++ which sccache 
Dec 09 15:33:46 ++ sccache --stop-server 
Dec 09 15:33:46 Stopping sccache server... 
Dec 09 15:33:46 error: couldn't connect to server 
Dec 09 15:33:46 caused by: Connection refused (os error 111) 
Dec 09 15:33:46 ++ true 
Dec 09 15:33:46 ++ rm /var/lib/jenkins/sccache_error.log 
Dec 09 15:33:46 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Dec 09 15:33:46 ++ true 
Dec 09 15:33:46 ++ [[ pytorch-linux-bionic-rocm3.9-py3.6-build == *rocm* ]] 
Dec 09 15:33:46 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Dec 09 15:33:46 ++ SCCACHE_IDLE_TIMEOUT=0 
Dec 09 15:33:46 ++ sccache --start-server 
Dec 09 15:33:46 Starting sccache server... 
Dec 09 15:33:46 ++ sccache --zero-stats 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 25 times.

… logic"

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

… logic"

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
… logic"

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
… logic"

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
… logic"

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
… logic"

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
… logic"

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
… logic"

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: [D25180534](https://our.internmc.facebook.com/intern/diff/D25180534/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 9078088.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants