Skip to content

Conversation

lw
Copy link
Contributor

@lw lw commented Nov 29, 2020

Stack from ghstack:

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).


In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.

Differential Revision: D25210333

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.

Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Nov 29, 2020

💊 CI failures summary and remediations

As of commit 203376d (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.

Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/)

[ghstack-poisoned]
TORCH_INTERNAL_ASSERT(event.isCreated());
TORCH_INTERNAL_ASSERT(event.device_index() == deviceIndex_);
}
for (const at::DataPtr& data_ptr : extractDataPtrs(value_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (this should not block this PR, please ignore for now and fix if necessary later) can extractDataPtrs be expensive (e.g., using pickling)? If so, do we need to cache the extracted data ptrs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale was that the cost of "extracting" the data ptrs should be of the same order as "iterating" over them (after extraction). What I mean is that extracting data ptrs should have a linear complexity in the data size, and thus when combined with the linear-time iteration over the data ptrs, it doesn't effectively contribute to the overall (asymptotical) complexity.

I do realize though that this only holds when the data contains only (or mostly) tensors. If the value is a big user class with only one tensor, it won't be the case. And indeed we've ended up extracting data ptrs multiple times, so it makes sense to me to cache them. I'll add this as a separate diff.

@@ -231,8 +231,16 @@ class ProcessGroupNCCL : public ProcessGroup {
TORCH_INTERNAL_ASSERT(
cudaEvents_->size() == 1,
"FutureNCCL only supports single-process single-device mode.");
for (const at::cuda::CUDAEvent& event : *cudaEvents_) {
TORCH_INTERNAL_ASSERT(event.isCreated());
TORCH_INTERNAL_ASSERT(event.device_index() == deviceIndex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss sth. Why all events are on the same device? For multi-input collective calls, aren't we going to get one event per device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage there is in fact a single event, which is on deviceIndex_ (see the check just above). I had no real reason to use a for loop here rather than just check the first (and only) element of cudaEvents_. If you want I can change this.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.

Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.

Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.

Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.

Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 868a1a4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants