gloo/cuda: use torch dtype bf16 #441

d4l3k · 2025-05-13T21:27:28Z

This adds support for using torch dtypes in CUDA kernels when building PyTorch.

Test plan:

import os
import time

transport = "TCP"
#transport = "IBVERBS"

os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)

ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")

device = "cpu"

iters = 10
warmup_iters = 2

for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
    t = torch.zeros(nelem, device=device)

    torch.cuda.current_stream().synchronize()
    for i in range(warmup_iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    start = time.perf_counter()

    for i in range(iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    dur = (time.perf_counter() - start)
    qps = iters/dur

    bandwidth_gb = t.nbytes * iters / dur / 1e9

    gb = t.nbytes / 1e9

    if rank == 0:
        print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")

fduwjj

LGTM, is there any other dtype we want to support aside from bf16?

fduwjj · 2025-05-15T00:04:33Z

gloo/CMakeLists.txt

+
+  message(STATUS "GLOO_USE_TORCH_DTYPES : ${GLOO_USE_TORCH_DTYPES} ${GLOO_TORCH_DIR}")
+  if(GLOO_USE_TORCH_DTYPES)
+    target_include_directories(gloo_hip PRIVATE ${GLOO_TORCH_DIR})


interesting... Gloo also supports RCOM?

Yup -- it's a bit of a mess but yes. It uses HIP the same way pytorch does

This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend. This requires some changes to Gloo which are in pytorch/gloo#441 Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both. The gloo submodule is updated to depend on the new Gloo changes Test plan: ```py import os import time transport = "TCP" #transport = "IBVERBS" os.environ["GLOO_DEVICE_TRANSPORT"] = transport rank = int(os.environ["RANK"]) os.environ["CUDA_VISIBLE_DEVICES"] = str(rank) ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank] ibv_name, ibv_port = ibv.split(":") os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port os.environ["TORCH_GLOO_IBV_INDEX"] = "3" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() # initial sanity check #device = "cpu" #t = torch.zeros(10, device=device) #dist.all_reduce(t) #print("sanity complete") device = "cpu" iters = 10 warmup_iters = 2 for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]: t = torch.zeros(nelem, device=device) torch.cuda.current_stream().synchronize() for i in range(warmup_iters): dist.all_reduce(t) torch.cuda.current_stream().synchronize() start = time.perf_counter() for i in range(iters): dist.all_reduce(t) torch.cuda.current_stream().synchronize() dur = (time.perf_counter() - start) qps = iters/dur bandwidth_gb = t.nbytes * iters / dur / 1e9 gb = t.nbytes / 1e9 if rank == 0: print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="") ``` Pull Request resolved: #153406 Approved by: https://github.com/fduwjj

d4l3k requested a review from fduwjj May 13, 2025 21:27

facebook-github-bot added the CLA Signed label May 13, 2025

d4l3k mentioned this pull request May 13, 2025

gloo: cuda pytorch/pytorch#153406

Closed

d4l3k force-pushed the d4l3k/torch_dtypes branch from fe15276 to 80cc076 Compare May 13, 2025 23:04

gloo/cuda: use torch dtype bf16

ff3d11a

d4l3k force-pushed the d4l3k/torch_dtypes branch from 80cc076 to ff3d11a Compare May 14, 2025 17:56

fduwjj approved these changes May 15, 2025

View reviewed changes

d4l3k merged commit fe67c4b into main May 15, 2025
7 checks passed

d4l3k deleted the d4l3k/torch_dtypes branch May 15, 2025 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gloo/cuda: use torch dtype bf16 #441

gloo/cuda: use torch dtype bf16 #441

Uh oh!

d4l3k commented May 13, 2025 •

edited

Loading

Uh oh!

fduwjj left a comment

Uh oh!

fduwjj May 15, 2025

Uh oh!

d4l3k May 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gloo/cuda: use torch dtype bf16 #441

gloo/cuda: use torch dtype bf16 #441

Uh oh!

Conversation

d4l3k commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj May 15, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

d4l3k commented May 13, 2025 •

edited

Loading