-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor parallel hangs on call to model #55
Comments
This seems like some configuration issue with using NCCL. Generally, communication collectives (and thus the process) hang when they're unable to connect to each other in some manner. Not sure if PyTorch distributed folks have any ideas offhand (cc: @yifuwang ). |
@Chillee Do you have a version / githash of PyTorch working with TP? |
Hey @briandw, can you try this small script to see if the issue reproduces? import os
import torch
import torch.distributed as dist
if __name__ == "__main__":
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
local_rank = int(os.environ["LOCAL_RANK"])
pid = os.getpid()
def log(msg) -> None:
print(f"[rank {rank}, pid {pid}] {msg}")
torch.cuda.set_device(f"cuda:{local_rank}")
log("Initializing process group...")
dist.init_process_group(backend="nccl")
log("Process group initialization completed")
log("Testing all_reduce...")
t = torch.full((8, 8), rank, device="cuda")
dist.all_reduce(t)
assert t.eq(world_size * (world_size - 1) // 2).all()
log("All_reduce completed") Run it with |
@yifuwang I ran the code and it hung. Here's the stack trace:
|
@briandw looks like it got past process group initialization but stuck in all_reduce. Curious have you had success with nccl before on your dual 4090 setup? I don't have access to such a setup so I can only provide some random ideas:
|
Wow looks like NCCL_P2P_DISABLE=1 did the trick!
Thanks for your help @yifuwang BTW I found good discussion on this issue: NVIDIA/nccl-tests#117 |
I'm trying to run the TP code, using this script.
It runs until the first call to the model in prefill.
It stops there and hangs.
What is the know configuration that TP works under?
torch.version
'2.2.0.dev20231213+cu121'
~ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Debug output
@Chillee chilli
Any help greatly appreciated.
The text was updated successfully, but these errors were encountered: