New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[core][experimental] Accelerated DAG NCCL-based p2p channels for torch.Tensors #45092

Merged

stephanie-wang merged 55 commits into ray-project:master from stephanie-wang:dag-nccl

May 11, 2024

Contributor

stephanie-wang commented May 1, 2024 •

edited

Loading

Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an example of the API:

    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()

When transport="nccl" is specified, upon compile(), Ray will initialize a NCCL group with the actors involved. The reading actor(s) will recv on the NCCL communicator instead of reading from the default shared-memory channel.

This PR also refactors channel types so that we now create ChannelInterfaces based on the type hints that appear in the DAG, either a TorchTensorType or the default SharedMemoryType.

Current limitations:

p2p only, no collectives
Synchronizes CUDA streams after receiving data. This is because kernels following the NCCL op have no guarantee that the op succeeded, so it is not safe to read the received buffer unless we know that the op succeeded.
Shape and dtype of the tensor must be declared at compile time.

TODO: Add unit and release tests to PR,

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang and others added 30 commits

April 17, 2024 15:56


          TorchTensor wrappers

2d22636

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          test

3a75b5a

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          copy

0a8a8da

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          update

32eaec9

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          torch device

0f8d092

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          errors

e067f5f

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          test

e0774b8

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

GPU

d065935

Signed-off-by: Your Name <you@example.com>


          temp benchmark

f0813c4

Signed-off-by: Your Name <you@example.com>


          with_type_hint

5fb4166

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          skip GPU tests

84fe2c0

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          Merge remote-tracking branch 'upstream/master' into dag-gpu-channels

6375fd0


          clean

5bab410

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          init nccl group

611c6aa

Signed-off-by: Your Name <you@example.com>


          NCCL channel

3f16871

Signed-off-by: Your Name <you@example.com>


          NCCL group works

b917eb8

Signed-off-by: Your Name <you@example.com>


          micro

fea789b

Signed-off-by: Your Name <you@example.com>


          update

388621d

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          Merge remote-tracking branch 'upstream/master' into dag-gpu-channels

c35b3bb


          torch

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          TODO

e667dae

Signed-off-by: Your Name <you@example.com>


          TODO

fb93b06

Signed-off-by: Your Name <you@example.com>


          Merge branch 'master' into dag-gpu-channels

4fb6897


          typing

7334e8a

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          Merge branch 'dag-gpu-channels' of github.com:stephanie-wang/ray into…

fd59817

… dag-gpu-channels


          fix deadlock on shutdown

2788aa3

Signed-off-by: Your Name <you@example.com>


          Merge remote-tracking branch 'origin/dag-gpu-channels' into dag-nccl

cd8ae72


          files

601e600

Signed-off-by: Your Name <you@example.com>


          missing files

649cbea

Signed-off-by: Your Name <you@example.com>


          move

3d39945

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          core GPU build

97064af

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Contributor

jackhumphries commented May 10, 2024

Is this ready for another pass?

stephanie-wang and others added 3 commits

May 10, 2024 14:11


          remove multi-GPU CI tests

ed5e907

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>


          test

ba38c92

Signed-off-by: Your Name <you@example.com>


          Merge branch 'dag-nccl' of github.com:stephanie-wang/ray into dag-nccl

4e9df06

Contributor Author

stephanie-wang commented May 10, 2024

Is this ready for another pass?

Yes, it's ready! I removed the multi-GPU CI tests, will merge these in separately after #45253.

rkooo567 approved these changes

View reviewed changes

doc

2f7c78e

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang merged commit 79f3995 into ray-project:master

5 checks passed

stephanie-wang deleted the dag-nccl branch

May 11, 2024 00:08

stephanie-wang mentioned this pull request

[core][experimental] Fix broken channel test #45297

Merged

stephanie-wang added a commit that referenced this pull request


          [core][experimental] Fix broken channel test (#45297)

a563598

Fixes test that was broken due a hidden merge conflict between #45092 and #45119.
Related issue number

Closes #45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

This was referenced May 13, 2024

[core][experimental] Support nested and dynamically sized GPU tensors via NCCL #45306

Closed

[core][experimental] Meta-issue: Support transferring GPU tensors in accelerated DAG #43830

Open

[core] Fix broken experimental test_channel.py #45336

Merged

stephanie-wang added a commit that referenced this pull request


          [core] Fix broken experimental test_channel.py (#45336)

8b99609

Fix another broken test from #45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang mentioned this pull request

[ADAG] Support multi-args and kwargs in Accelerate DAG #45545

Merged

8 tasks

x13n mentioned this pull request

np.float was a deprecated alias for the builtin float #45641

Closed

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core][experimental] Accelerated DAG NCCL-based p2p channels for torc…

77b920b

…h.Tensors (ray-project#45092)

## Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an
example of the API:

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```
When `transport="nccl"` is specified, upon compile(), Ray will
initialize a NCCL group with the actors involved. The reading actor(s)
will `recv` on the NCCL communicator instead of reading from the default
shared-memory channel.

This PR also refactors channel types so that we now create
`ChannelInterfaces` based on the type hints that appear in the DAG,
either a `TorchTensorType` or the default `SharedMemoryType`.

Current limitations:
- p2p only, no collectives
- Synchronizes CUDA stream after receiving data. This is because
kernels following the NCCL op have no guarantee that the op succeeded,
so it is not safe to read the received buffer unless we know that the op
succeeded.
- Shape and dtype of the tensor must be declared at compile time.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core][experimental] Fix broken channel test (ray-project#45297)

64d0acc

Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core] Fix broken experimental test_channel.py (ray-project#45336)

4bb9a74

Fix another broken test from ray-project#45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core][experimental] Accelerated DAG NCCL-based p2p channels for torc…

82bfcf6

…h.Tensors (ray-project#45092)

## Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an
example of the API:

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```
When `transport="nccl"` is specified, upon compile(), Ray will
initialize a NCCL group with the actors involved. The reading actor(s)
will `recv` on the NCCL communicator instead of reading from the default
shared-memory channel.

This PR also refactors channel types so that we now create
`ChannelInterfaces` based on the type hints that appear in the DAG,
either a `TorchTensorType` or the default `SharedMemoryType`.

Current limitations:
- p2p only, no collectives
- Synchronizes CUDA stream after receiving data. This is because
kernels following the NCCL op have no guarantee that the op succeeded,
so it is not safe to read the received buffer unless we know that the op
succeeded.
- Shape and dtype of the tensor must be declared at compile time.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core][experimental] Fix broken channel test (ray-project#45297)

8a7e5aa

Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core] Fix broken experimental test_channel.py (ray-project#45336)

a220bf4

Fix another broken test from ray-project#45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core][experimental] Accelerated DAG NCCL-based p2p channels for torc…

af61081

…h.Tensors (ray-project#45092)

## Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an
example of the API:

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```
When `transport="nccl"` is specified, upon compile(), Ray will
initialize a NCCL group with the actors involved. The reading actor(s)
will `recv` on the NCCL communicator instead of reading from the default
shared-memory channel.

This PR also refactors channel types so that we now create
`ChannelInterfaces` based on the type hints that appear in the DAG,
either a `TorchTensorType` or the default `SharedMemoryType`.

Current limitations:
- p2p only, no collectives
- Synchronizes CUDA stream after receiving data. This is because
kernels following the NCCL op have no guarantee that the op succeeded,
so it is not safe to read the received buffer unless we know that the op
succeeded.
- Shape and dtype of the tensor must be declared at compile time.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core][experimental] Fix broken channel test (ray-project#45297)

911a37e

Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [core] Fix broken experimental test_channel.py (ray-project#45336)

e783ebf

Fix another broken test from ray-project#45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request


          [core][experimental] Accelerated DAG NCCL-based p2p channels for torc…

054e0a4

…h.Tensors (ray-project#45092)

## Why are these changes needed?

This adds a NCCL-based transport option for torch.tensors. Here is an
example of the API:

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE, transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```
When `transport="nccl"` is specified, upon compile(), Ray will
initialize a NCCL group with the actors involved. The reading actor(s)
will `recv` on the NCCL communicator instead of reading from the default
shared-memory channel.

This PR also refactors channel types so that we now create
`ChannelInterfaces` based on the type hints that appear in the DAG,
either a `TorchTensorType` or the default `SharedMemoryType`.

Current limitations:
- p2p only, no collectives
- Synchronizes CUDA stream after receiving data. This is because
kernels following the NCCL op have no guarantee that the op succeeded,
so it is not safe to read the received buffer unless we know that the op
succeeded.
- Shape and dtype of the tensor must be declared at compile time.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: gchurch <gabe1church@gmail.com>

GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request


          [core][experimental] Fix broken channel test (ray-project#45297)

de4c878

Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: gchurch <gabe1church@gmail.com>

GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request


          [core] Fix broken experimental test_channel.py (ray-project#45336)

c0eebfc

Fix another broken test from ray-project#45092.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: gchurch <gabe1church@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment