Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable GPU-to-GPU comm in TensorPipeAgent #44418
Enable GPU-to-GPU comm in TensorPipeAgent #44418
Changes from 39 commits
b0e83ac
6ae4c16
0a23d4d
c6dec67
7b073dc
7489147
2dc32dc
af38592
77b06e7
2b84077
b445a87
3115ad3
7fab7fc
5b33064
20ce42b
0314cfc
a25d60a
f642020
5d74596
31a1c64
c215331
dc846b5
0ff581a
209010c
402fa03
eedec85
0494738
9a22ec3
0b2cb77
c87efe8
27b1a0d
46c49e5
124d160
ced8292
6e86c60
6988c90
5960d23
4ab8e85
5a90106
bb94e5f
6834072
ed0c81e
84ecc25
59d95ba
e9133d0
a1dcfe2
9c98bc9
b90a8af
3a0651f
43349e4
e2b9581
0e5e5b6
fb6e4ea
8c935e4
9b5afb1
20294df
0994ff8
4d6ce07
2880a01
f27608f
36fc05b
28209c4
b302be9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we only create a context for devices involved in the operation? It looks like we get a stream from the pool for all available devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pritamdamania87. As mentioned above, I think it would be good if we could be "lazy" and only access devices once we need them, to avoid initializing them uselessly.
I suspect one reason for getting a stream for all devices is in case the remote user function returns a tensor on a device that wasn't used by any of the arguments. In such a case, we would need to have set a current stream for that device before invoking the remote user function.
However, we've had to deal with a similar concern regarding CUDAFuture callbacks, and there it proved to be impossible to get a stream for all devices (it ended up causing a deadlock with NCCL), hence we've resorted to just get a stream for the devices used by the future's value, and declare that if the callback uses another device that's undefined behavior.
Hence, I think we could do the same thing for remote users functions (it would be consistent!), and maybe we can find a better solution that covers both use cases in a later release? (As a general API approach, I'd rather start with more restrictions and gradually lift them, than start with a generic but problematic solution that cannot be changed without breaking backwards compatibility).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this OK for RPC user functions. Suppose we are running pipeline parallel on a 8GPU machine with 2 processes. The 1st process would take input on
cuda:0
and produce output oncuda:3
, and then send it tocuda:4
on the other process. If we only use devices listed in the input args, does it mean users will have to move the intermediate output tocuda:0
before sending it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, let me make this change first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is a common use-case? Correct me if I'm wrong, but I believe that all CUDA kernels (i.e., the computations themselves) operate with inputs and outputs that are on the same device. Hence if a function returns an output on a different device than its inputs, it must be because the function has manually explicitly performed such a transfer, for example by calling
.to(idx)
. I could try to argue that this should be considered non-idiomatic for RPC, since RPC should be used for all data transfers, in order to provide consistency, performance and "scalability" (as in resiliency to topology changes, since it allows to easily switch from local GPUs to remote GPUs without code changes). Hence such a function should probably be split into two parts (each one operating on a single device) which should be connected through RPC calls.Admittedly the code to do this might be a bit more convoluted than just using
.to(...)
, but I think we can always provide helpers and examples to make this easier to use. (For example, I believe #48790 might already help).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean we need an overhaul to our
torch.distributed.pipe
implementation. I am open to either revamp the existing one, or implement a new distributed pipe where each process only exclusively uses one device.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fill me in on what
torch.distributed.pipe
does? Is it calling.to()
explicitly?