Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray.util.collective support torch.bfloat16 #39845

Merged
merged 2 commits into from
May 15, 2024

Conversation

wuxibin89
Copy link
Contributor

Why are these changes needed?

bfloat16 is widely used in LLM training and inference since it can achieve higher throughput and is less prone to weight growth. ray.util.collective use cupy.cuda.nccl for GPU communication, while cupy doesn't support bfloat16 for now (cupy/cupy#7527). So for allgather/reducescater operation, we should bypass cupy.array and use torch.tensor directly.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: wuxibin <wuxibin89@163.com>
@anyscalesam
Copy link
Collaborator

@stephanie-wang can you merge?

@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels May 15, 2024
@stephanie-wang stephanie-wang merged commit 04ad91c into ray-project:master May 15, 2024
5 checks passed
@stephanie-wang
Copy link
Contributor

Thanks for the contribution, @wuxibin89 !

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
is widely used in LLM training and inference since it can achieve higher
throughput and is less prone to weight growth. ray.util.collective use
cupy.cuda.nccl for GPU communication, while cupy doesn't support
bfloat16 for now (cupy/cupy#7527). So for
allgather/reducescater operation, we should bypass cupy.array and use
torch.tensor directly.

Signed-off-by: wuxibin <wuxibin89@163.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
is widely used in LLM training and inference since it can achieve higher
throughput and is less prone to weight growth. ray.util.collective use
cupy.cuda.nccl for GPU communication, while cupy doesn't support
bfloat16 for now (cupy/cupy#7527). So for
allgather/reducescater operation, we should bypass cupy.array and use
torch.tensor directly.

Signed-off-by: wuxibin <wuxibin89@163.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
is widely used in LLM training and inference since it can achieve higher
throughput and is less prone to weight growth. ray.util.collective use
cupy.cuda.nccl for GPU communication, while cupy doesn't support
bfloat16 for now (cupy/cupy#7527). So for
allgather/reducescater operation, we should bypass cupy.array and use
torch.tensor directly.

Signed-off-by: wuxibin <wuxibin89@163.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request Jun 11, 2024
[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
is widely used in LLM training and inference since it can achieve higher
throughput and is less prone to weight growth. ray.util.collective use
cupy.cuda.nccl for GPU communication, while cupy doesn't support
bfloat16 for now (cupy/cupy#7527). So for
allgather/reducescater operation, we should bypass cupy.array and use
torch.tensor directly.

Signed-off-by: wuxibin <wuxibin89@163.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: gchurch <gabe1church@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants