Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][experimental] Accelerated DAG NCCL-based p2p channels for torch.Tensors #45092

Merged
merged 55 commits into from
May 11, 2024

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented May 1, 2024

Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an example of the API:

    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()

When transport="nccl" is specified, upon compile(), Ray will initialize a NCCL group with the actors involved. The reading actor(s) will recv on the NCCL communicator instead of reading from the default shared-memory channel.

This PR also refactors channel types so that we now create ChannelInterfaces based on the type hints that appear in the DAG, either a TorchTensorType or the default SharedMemoryType.

Current limitations:

  • p2p only, no collectives
  • Synchronizes CUDA streams after receiving data. This is because kernels following the NCCL op have no guarantee that the op succeeded, so it is not safe to read the received buffer unless we know that the op succeeded.
  • Shape and dtype of the tensor must be declared at compile time.

TODO: Add unit and release tests to PR,

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

stephanie-wang and others added 30 commits April 17, 2024 15:56
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
GPU
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@jackhumphries
Copy link
Contributor

Is this ready for another pass?

stephanie-wang and others added 3 commits May 10, 2024 14:11
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Your Name <you@example.com>
@stephanie-wang
Copy link
Contributor Author

Is this ready for another pass?

Yes, it's ready! I removed the multi-GPU CI tests, will merge these in separately after #45253.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang stephanie-wang merged commit 79f3995 into ray-project:master May 11, 2024
5 checks passed
@stephanie-wang stephanie-wang deleted the dag-nccl branch May 11, 2024 00:08
stephanie-wang added a commit that referenced this pull request May 13, 2024
Fixes test that was broken due a hidden merge conflict between #45092 and #45119.
Related issue number

Closes #45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
stephanie-wang added a commit that referenced this pull request May 14, 2024
Fix another broken test from #45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
…h.Tensors (ray-project#45092)

## Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an
example of the API:

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```
When `transport="nccl"` is specified, upon compile(), Ray will
initialize a NCCL group with the actors involved. The reading actor(s)
will `recv` on the NCCL communicator instead of reading from the default
shared-memory channel.

This PR also refactors channel types so that we now create
`ChannelInterfaces` based on the type hints that appear in the DAG,
either a `TorchTensorType` or the default `SharedMemoryType`.

Current limitations:
- p2p only, no collectives
- Synchronizes CUDA stream after receiving data. This is because
kernels following the NCCL op have no guarantee that the op succeeded,
so it is not safe to read the received buffer unless we know that the op
succeeded.
- Shape and dtype of the tensor must be declared at compile time.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
Fix another broken test from ray-project#45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
…h.Tensors (ray-project#45092)

## Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an
example of the API:

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```
When `transport="nccl"` is specified, upon compile(), Ray will
initialize a NCCL group with the actors involved. The reading actor(s)
will `recv` on the NCCL communicator instead of reading from the default
shared-memory channel.

This PR also refactors channel types so that we now create
`ChannelInterfaces` based on the type hints that appear in the DAG,
either a `TorchTensorType` or the default `SharedMemoryType`.

Current limitations:
- p2p only, no collectives
- Synchronizes CUDA stream after receiving data. This is because
kernels following the NCCL op have no guarantee that the op succeeded,
so it is not safe to read the received buffer unless we know that the op
succeeded.
- Shape and dtype of the tensor must be declared at compile time.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
Fix another broken test from ray-project#45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
…h.Tensors (ray-project#45092)

## Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an
example of the API:

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```
When `transport="nccl"` is specified, upon compile(), Ray will
initialize a NCCL group with the actors involved. The reading actor(s)
will `recv` on the NCCL communicator instead of reading from the default
shared-memory channel.

This PR also refactors channel types so that we now create
`ChannelInterfaces` based on the type hints that appear in the DAG,
either a `TorchTensorType` or the default `SharedMemoryType`.

Current limitations:
- p2p only, no collectives
- Synchronizes CUDA stream after receiving data. This is because
kernels following the NCCL op have no guarantee that the op succeeded,
so it is not safe to read the received buffer unless we know that the op
succeeded.
- Shape and dtype of the tensor must be declared at compile time.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
Fix another broken test from ray-project#45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request Jun 11, 2024
…h.Tensors (ray-project#45092)

## Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an
example of the API:

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```
When `transport="nccl"` is specified, upon compile(), Ray will
initialize a NCCL group with the actors involved. The reading actor(s)
will `recv` on the NCCL communicator instead of reading from the default
shared-memory channel.

This PR also refactors channel types so that we now create
`ChannelInterfaces` based on the type hints that appear in the DAG,
either a `TorchTensorType` or the default `SharedMemoryType`.

Current limitations:
- p2p only, no collectives
- Synchronizes CUDA stream after receiving data. This is because
kernels following the NCCL op have no guarantee that the op succeeded,
so it is not safe to read the received buffer unless we know that the op
succeeded.
- Shape and dtype of the tensor must be declared at compile time.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: gchurch <gabe1church@gmail.com>
GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request Jun 11, 2024
Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: gchurch <gabe1church@gmail.com>
GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request Jun 11, 2024
Fix another broken test from ray-project#45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: gchurch <gabe1church@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants