ray.util.collective support torch.bfloat16 #39845

wuxibin89 · 2023-09-26T03:40:05Z

Why are these changes needed?

bfloat16 is widely used in LLM training and inference since it can achieve higher throughput and is less prone to weight growth. ray.util.collective use cupy.cuda.nccl for GPU communication, while cupy doesn't support bfloat16 for now (cupy/cupy#7527). So for allgather/reducescater operation, we should bypass cupy.array and use torch.tensor directly.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: wuxibin <wuxibin89@163.com>

anyscalesam · 2024-05-15T00:27:30Z

@stephanie-wang can you merge?

stephanie-wang · 2024-05-15T02:30:11Z

Thanks for the contribution, @wuxibin89 !

[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) is widely used in LLM training and inference since it can achieve higher throughput and is less prone to weight growth. ray.util.collective use cupy.cuda.nccl for GPU communication, while cupy doesn't support bfloat16 for now (cupy/cupy#7527). So for allgather/reducescater operation, we should bypass cupy.array and use torch.tensor directly. Signed-off-by: wuxibin <wuxibin89@163.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) is widely used in LLM training and inference since it can achieve higher throughput and is less prone to weight growth. ray.util.collective use cupy.cuda.nccl for GPU communication, while cupy doesn't support bfloat16 for now (cupy/cupy#7527). So for allgather/reducescater operation, we should bypass cupy.array and use torch.tensor directly. Signed-off-by: wuxibin <wuxibin89@163.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>

[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) is widely used in LLM training and inference since it can achieve higher throughput and is less prone to weight growth. ray.util.collective use cupy.cuda.nccl for GPU communication, while cupy doesn't support bfloat16 for now (cupy/cupy#7527). So for allgather/reducescater operation, we should bypass cupy.array and use torch.tensor directly. Signed-off-by: wuxibin <wuxibin89@163.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: gchurch <gabe1church@gmail.com>

ray.util.collective support torch.bfloat16

c389468

Signed-off-by: wuxibin <wuxibin89@163.com>

jjyao assigned jackhumphries and stephanie-wang Mar 25, 2024

jackhumphries approved these changes Mar 25, 2024

View reviewed changes

Merge branch 'master' into feat/collectvie_bfloat16

fd0737a

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels May 15, 2024

stephanie-wang merged commit 04ad91c into ray-project:master May 15, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ray.util.collective support torch.bfloat16 #39845

ray.util.collective support torch.bfloat16 #39845

wuxibin89 commented Sep 26, 2023

anyscalesam commented May 15, 2024

stephanie-wang commented May 15, 2024

ray.util.collective support torch.bfloat16 #39845

ray.util.collective support torch.bfloat16 #39845

Conversation

wuxibin89 commented Sep 26, 2023

Why are these changes needed?

Related issue number

Checks

anyscalesam commented May 15, 2024

stephanie-wang commented May 15, 2024