Tensor parallel hangs on call to model #55

briandw · 2023-12-15T18:10:04Z

I'm trying to run the TP code, using this script.

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
export OMP_NUM_THREADS=16
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_CPP_LOG_LEVEL=INFO
#time torchrun --standalone --nproc_per_node=2 generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "def quicksort(arr):" --max_new_tokens 200 --num_samples 50 --temperature 0

python -m torch.distributed.launch --nproc_per_node=2 \
                                   --nnodes=1 \
                                   --node_rank=0 \
                                   --master_addr="127.0.0.1" \
                                   --master_port=29501 \
                                   generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "def quicksort(arr):" --max_new_tokens 200 --num_samples 50 --temperature 0

It runs until the first call to the model in prefill.
It stops there and hangs.

What is the know configuration that TP works under?

torch.version
'2.2.0.dev20231213+cu121'

~ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                  Off |
|  0%   44C    P2              94W / 480W |   7508MiB / 24564MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:17:00.0 Off |                  Off |
| 30%   26C    P2             100W / 480W |   7508MiB / 24564MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Debug output

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
/home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:29501.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:29501.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 29501.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29501).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:299] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:35774.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:35774.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host 127.0.0.1:29501
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
/home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29501).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:299] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:35790.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:35790.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host 127.0.0.1:29501
[I ProcessGroupNCCL.cpp:785] [Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.19.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 120, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, NCCL_DEBUG: OFF, ID=94527839849424
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29501).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:299] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:35804.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:35804.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host 127.0.0.1:29501
[I ProcessGroupNCCL.cpp:785] [Rank 0] ProcessGroupNCCL initialization options: NCCL version: 2.19.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 120, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, NCCL_DEBUG: OFF, ID=94408215489568
Loading model ...
Applying tensor parallel to model ...
Time to load model: 2.36 seconds
no compile
Generating sample 1 of 50
generate
cache setup done
prefill
[rank0]:[I ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[7, 4096], TensorDtypes=BFloat16, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[7, 4096], TensorDtypes=BFloat16, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I ProcessGroupNCCL.cpp:1854] NCCL_DEBUG: N/A

@Chillee chilli
Any help greatly appreciated.

The text was updated successfully, but these errors were encountered:

Chillee · 2023-12-17T01:42:02Z

This seems like some configuration issue with using NCCL. Generally, communication collectives (and thus the process) hang when they're unable to connect to each other in some manner.

Not sure if PyTorch distributed folks have any ideas offhand (cc: @yifuwang ).

briandw · 2023-12-17T16:12:59Z

@Chillee Do you have a version / githash of PyTorch working with TP?
I’ve filed a issue with PyTorch also, pytorch/pytorch#115964

yifuwang · 2023-12-17T19:25:32Z

Hey @briandw, can you try this small script to see if the issue reproduces?

import os

import torch
import torch.distributed as dist


if __name__ == "__main__":
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ["LOCAL_RANK"])
    pid = os.getpid()

    def log(msg) -> None:
        print(f"[rank {rank}, pid {pid}] {msg}")

    torch.cuda.set_device(f"cuda:{local_rank}")

    log("Initializing process group...")
    dist.init_process_group(backend="nccl")
    log("Process group initialization completed")

    log("Testing all_reduce...")
    t = torch.full((8, 8), rank, device="cuda")
    dist.all_reduce(t)
    assert t.eq(world_size * (world_size - 1) // 2).all()
    log("All_reduce completed")

Run it with torchrun --nproc_per_node=2 --monitor-interval=1 [name].py. If it hangs, it would be very helpful if you can provide the stacktrace - find the pids in the log and run gdb -p [pid] then bt.

briandw · 2023-12-17T21:41:15Z

@yifuwang
Thanks for having a look at this.

I ran the code and it hung.

Here's the stack trace:

GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 136014
[New LWP 136019]
[New LWP 136020]
[New LWP 136021]
[New LWP 136039]
[New LWP 136040]
[New LWP 136041]
[New LWP 136042]
[New LWP 136049]
[New LWP 136052]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ffe1c5ae6e8 in ?? ()
(gdb) bt
#0  0x00007ffe1c5ae6e8 in ?? ()
#1  0x00007ffe1c5ae84a in ?? ()
#2  0x00007fc91c8e566d in __GI___clock_gettime (clock_id=<optimized out>, tp=<optimized out>) at ../sysdeps/unix/sysv/linux/clock_gettime.c:42
#3  0x00007fc8718b6e24 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fc87177cf56 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fc871b01e8a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fc87188cb66 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fc8718747be in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007fc871877140 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007fc8718d8d24 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007fc91ba37c5d in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#11 0x00007fc91ba383a0 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#12 0x00007fc91ba383ff in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#13 0x00007fc91ba3af84 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#14 0x00007fc91ba14930 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#15 0x00007fc91ba6bf5e in cudaLaunchKernel ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#16 0x00007fc8d1ae2f7b in void at::native::gpu_kernel_impl_nocast<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#17 0x00007fc8d1ae3575 in void at::native::gpu_kernel_impl<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#18 0x00007fc8d1ae3b0b in void at::native::gpu_kernel<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#19 0x00007fc8d1ae3c69 in void at::native::opmath_symmetric_gpu_kernel_with_scalars<long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> >(at::TensorIteratorBase&, at::native::(anonymous namespace)::CompareEqFunctor<long> const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#20 0x00007fc8d1ab22a9 in at::native::compare_eq_ne_kernel(at::TensorIteratorBase&, at::native::(anonymous namespace)::EqOpType) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#21 0x00007fc8d34a3ddb in at::(anonymous namespace)::wrapper_CUDA_eq_Scalar(at::Tensor const&, c10::Scalar const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#22 0x00007fc8d34a3e70 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tens--Type <RET> for more, q to quit, c to continue --Type <--Type--Type--Ty--T--T----------Type <RET> for more, q to quit, c to continue without paging--
or (at::Tensor const&, c10::Scalar const&), &at::(anonymous namespace)::wrapper_CUDA_eq_Scalar>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::Scalar const&> >, at::Tensor (at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#23 0x00007fc904c0a3ce in at::_ops::eq_Scalar::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007fc906aab99a in torch::autograd::VariableType::(anonymous namespace)::eq_Scalar(c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007fc906aab9e3 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::eq_Scalar>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fc904c5ecc1 in at::_ops::eq_Scalar::call(at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fc91a2c2850 in torch::autograd::THPVariable_eq(_object*, _object*, _object*) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#28 0x000056486a93bdb0 in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.0/Objects/descrobject.c:364
#29 0x000056486a927e91 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7fc91bd9ee80, tstate=0x56486acb1d98 <_PyRuntime+166328>)
    at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_call.h:92
#30 PyObject_Vectorcall (callable=0x7fc91bd9ee80, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.0/Objects/call.c:299
#31 0x000056486a91ac62 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:4772
#32 0x000056486a9d805e in _PyEval_EvalFrame (throwflag=0, frame=0x7fc91cc0b020, tstate=0x56486acb1d98 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_ceval.h:73
#33 _PyEval_Vector (tstate=0x56486acb1d98 <_PyRuntime+166328>, func=0x7fc91c1d1f80, locals=0x7fc91c1f2580, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Python/ceval.c:6428
#34 0x000056486a9d75ef in PyEval_EvalCode (co=<optimized out>, globals=0x7fc91c1f2580, locals=<optimized out>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:1154
#35 0x000056486a9fa12c in run_eval_code_obj (tstate=0x56486acb1d98 <_PyRuntime+166328>, co=0x7fc91c0ea670, globals=0x7fc91c1f2580, locals=0x7fc91c1f2580) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1714
#36 0x000056486a9f63a4 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7fc91c1f2580, locals=0x7fc91c1f2580, flags=<optimized out>, arena=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1735
#37 0x000056486aa0b372 in pyrun_file (fp=fp@entry=0x56486ca57030, filename=filename@entry=0x7fc91c01eba0, start=start@entry=257, globals=globals@entry=0x7fc91c1f2580, locals=locals@entry=0x7fc91c1f2580, closeit=closeit@entry=1, 
    flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1630
#38 0x000056486aa0aca5 in _PyRun_SimpleFileObject (fp=0x56486ca57030, filename=0x7fc91c01eba0, closeit=1, flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:440
#39 0x000056486aa0aa73 in _PyRun_AnyFileObject (fp=0x56486ca57030, filename=0x7fc91c01eba0, closeit=1, flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:79
#40 0x000056486aa04b76 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7fc91c01eba0, program_name=0x7fc91c0d7b10) at /usr/local/src/conda/python-3.11.0/Modules/main.c:360
#41 pymain_run_file (config=0x56486ac97de0 <_PyRuntime+59904>) at /usr/local/src/conda/python-3.11.0/Modules/main.c:379
#42 pymain_run_python (exitcode=0x7ffe1c4698a0) at /usr/local/src/conda/python-3.11.0/Modules/main.c:601
#43 Py_RunMain () at /usr/local/src/conda/python-3.11.0/Modules/main.c:680
#44 0x000056486a9c5e19 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.0/Modules/main.c:734
#45 0x00007fc91c829d90 in __libc_start_call_main (main=main@entry=0x56486a9c5d70 <main>, argc=argc@entry=3, argv=argv@entry=0x7ffe1c469af8) at ../sysdeps/nptl/libc_start_call_main.h:58
#46 0x00007fc91c829e40 in __libc_start_main_impl (main=0x56486a9c5d70 <main>, argc=3, argv=0x7ffe1c469af8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe1c469ae8) at ../csu/libc-start.c:392
#47 0x000056486a9c5cb1 in _start ()

yifuwang · 2023-12-18T00:23:03Z

@briandw looks like it got past process group initialization but stuck in all_reduce. Curious have you had success with nccl before on your dual 4090 setup?

I don't have access to such a setup so I can only provide some random ideas:

Try setting NCCL_P2P_DISABLE=1 (IIRC p2p access is locked on 4090s. Not sure if nccl handles it correctly for 4090s)
Check nvidia-smi topo -m to see how the cards are connected
Poke around nccl-tests and make sure it works

briandw · 2023-12-18T04:25:08Z

Wow looks like NCCL_P2P_DISABLE=1 did the trick!
Result for the test code:

All_reduce completed[rank 1, pid 151015] All_reduce completed.
The tp_example.py also works now.

Thanks for your help @yifuwang

BTW I found good discussion on this issue: NVIDIA/nccl-tests#117
and
https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/34

briandw closed this as completed Dec 18, 2023

briandw mentioned this issue Dec 18, 2023

tensor_parallel_example.py timeout pytorch/pytorch#115964

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor parallel hangs on call to model #55

Tensor parallel hangs on call to model #55

briandw commented Dec 15, 2023 •

edited

Loading

Chillee commented Dec 17, 2023

briandw commented Dec 17, 2023

yifuwang commented Dec 17, 2023

briandw commented Dec 17, 2023

yifuwang commented Dec 18, 2023 •

edited

Loading

briandw commented Dec 18, 2023

Tensor parallel hangs on call to model #55

Tensor parallel hangs on call to model #55

Comments

briandw commented Dec 15, 2023 • edited Loading

Chillee commented Dec 17, 2023

briandw commented Dec 17, 2023

yifuwang commented Dec 17, 2023

briandw commented Dec 17, 2023

yifuwang commented Dec 18, 2023 • edited Loading

briandw commented Dec 18, 2023

briandw commented Dec 15, 2023 •

edited

Loading

yifuwang commented Dec 18, 2023 •

edited

Loading