[Core][Distributed] support both cpu and device tensor in broadcast tensor dict #4660
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prior to this PR,
broadcast_tensor_dict
can only work for cuda tensor.This PR enables both cuda tensor and cpu tensor for
broadcast_tensor_dict
.It will be useful when we have some metadata in cpu tensor, e.g.
blocks_to_swap_in
andblocks_to_swap_out
to be introduced in #4659 .Note:
blocks_to_copy
is still a cuda tensor, because the src and target for copy both lives in GPU, and we have a dedicated copy kernel for it.blocks_to_swap_in
andblocks_to_swap_out
has to be cpu tensor, because they are kernel launch arguments.