Skip to content

Unexpected behavior when using dist.all_reduce(x, op=dist.ReduceOp.SUM) #152300

@fhk357869050

Description

@fhk357869050

🐛 Describe the bug

import torch
import torch.distributed as dist
from torch.multiprocessing import Process
import numpy as np


def exec_op(rank):
    dist.init_process_group(backend='gloo', rank=rank, world_size=2, init_method=f'tcp://127.0.0.1:40001')
    np.random.seed(1024 + rank)
    x = np.random.uniform(-65504, 65504, [m, k]).astype(np.float16)
    x = torch.from_numpy(x)
    print(f"rank:{rank} before all_reduce x[7205]:{x[7205]}")
    dist.all_reduce(x, op=dist.ReduceOp.SUM)
    print(f"rank:{rank} after all_reduce x[7205]:{x[7205]}")


if __name__ == '__main__':
    m, k = [24063328, 1]
    p_list = []
    for g_rank in range(2):
        p = Process(target=exec_op, args=(g_rank,))
        p_list.append(p)
    for p in p_list:
        p.start()
    for p in p_list:
        p.join()

Image

about 0.007% points didn't match.

Image

Versions

python3.8.5
torch2.4.0

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: c10dIssues/PRs related to collective communications and process groupsoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions