Unexpected behavior when using dist.all_reduce(x, op=dist.ReduceOp.SUM)

### 🐛 Describe the bug

```
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
import numpy as np


def exec_op(rank):
    dist.init_process_group(backend='gloo', rank=rank, world_size=2, init_method=f'tcp://127.0.0.1:40001')
    np.random.seed(1024 + rank)
    x = np.random.uniform(-65504, 65504, [m, k]).astype(np.float16)
    x = torch.from_numpy(x)
    print(f"rank:{rank} before all_reduce x[7205]:{x[7205]}")
    dist.all_reduce(x, op=dist.ReduceOp.SUM)
    print(f"rank:{rank} after all_reduce x[7205]:{x[7205]}")


if __name__ == '__main__':
    m, k = [24063328, 1]
    p_list = []
    for g_rank in range(2):
        p = Process(target=exec_op, args=(g_rank,))
        p_list.append(p)
    for p in p_list:
        p.start()
    for p in p_list:
        p.join()
```

![Image](https://github.com/user-attachments/assets/5d96affe-8726-4127-9564-8c2dba3e2993)

about 0.007% points didn't match.

![Image](https://github.com/user-attachments/assets/e747790d-71d8-496d-ae99-cacce0d2012f)
### Versions
python3.8.5
torch2.4.0

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected behavior when using dist.all_reduce(x, op=dist.ReduceOp.SUM) #152300

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected behavior when using dist.all_reduce(x, op=dist.ReduceOp.SUM) #152300

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions