-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
module: c10dIssues/PRs related to collective communications and process groupsIssues/PRs related to collective communications and process groupsoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
import numpy as np
def exec_op(rank):
dist.init_process_group(backend='gloo', rank=rank, world_size=2, init_method=f'tcp://127.0.0.1:40001')
np.random.seed(1024 + rank)
x = np.random.uniform(-65504, 65504, [m, k]).astype(np.float16)
x = torch.from_numpy(x)
print(f"rank:{rank} before all_reduce x[7205]:{x[7205]}")
dist.all_reduce(x, op=dist.ReduceOp.SUM)
print(f"rank:{rank} after all_reduce x[7205]:{x[7205]}")
if __name__ == '__main__':
m, k = [24063328, 1]
p_list = []
for g_rank in range(2):
p = Process(target=exec_op, args=(g_rank,))
p_list.append(p)
for p in p_list:
p.start()
for p in p_list:
p.join()
about 0.007% points didn't match.
Versions
python3.8.5
torch2.4.0
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k
Metadata
Metadata
Assignees
Labels
module: c10dIssues/PRs related to collective communications and process groupsIssues/PRs related to collective communications and process groupsoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module