Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM running on a Ray Cluster Hanging on Initializing #2826

Closed
Kaotic3 opened this issue Feb 9, 2024 · 12 comments · Fixed by #3037
Closed

vLLM running on a Ray Cluster Hanging on Initializing #2826

Kaotic3 opened this issue Feb 9, 2024 · 12 comments · Fixed by #3037

Comments

@Kaotic3
Copy link

Kaotic3 commented Feb 9, 2024

It isn't clear what is at fault here. Whether it be vLLM or Ray.

There is a thread here on the ray forums that outlines the issue, it is 16 days old, there is no reply to it.

https://discuss.ray.io/t/running-vllm-script-on-multi-node-cluster/13533

Taking from that thread, but this is identical for me.

2024-01-24 13:57:17,308 INFO worker.py:1540 – Connecting to existing Ray cluster at address: HOST_IP_ADDRESS…
2024-01-24 13:57:17,317 INFO worker.py:1715 – Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO 01-24 13:57:39 llm_engine.py:70] Initializing an LLM engine with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)

But after that it hangs, and eventually quits.

I have exactly this same problem. The thread details the other points, that the "ray status" seems to show nodes working and communicating, that it stays like this for an age then eventually crashes with some error messages. Everything in that thread is identical to what is happening for me.

Unfortunately the Ray forums probably don't want to engage because it is vLLM - and I am concerned that vLLM won't want to engage because it is Ray.....

@valentinp72
Copy link

Hi,
Just started using vLLM two hours ago, and I had exactly the same issue.
I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1.

Let me know if this solves your issue as well :)

@Kaotic3
Copy link
Author

Kaotic3 commented Feb 9, 2024

Thanks for the idea, I did try it but didn't work for me.

Same hanging issue but I went off for dinner and came back to this message:

(RayWorkerVllm pid=7722, ip=.123) [E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.1.1, 55251)

Which I think is a new error message compared to the thread I linked - but googling didn't provide me with any great insight into fixing.

Your search - "RayWorkerVllm" The client socket has timed out after 1800s while trying to connect - did not match any documents.

Which is always a little impressive to be honest....

@davidsyoung
Copy link

davidsyoung commented Feb 10, 2024

I am experiencing the same at the moment. For me, it happens with GPTQ quantisation with tp=4.

I have tried the following settings / combinations of settings without any luck:

NCCL_P2P_DISABLE=1
disable_custom_all_reduce=True
enforce_eager=True

Latest vLLM, compiled from source. It hangs at approx 12995GB VRAM on each card across 4x3090. 70b model llama2.

Finally hung at this after approx 1h:

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
(RayWorkerVllm pid=1486) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747549 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,464 E 13 1719] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,501 E 13 1719] logging.cc:104: Stack trace: 
 /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x14af5a1ebb5a] ray::operator<<()
/opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x14af5a1ee298] ray::TerminateHandler()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb135a) [0x14afcccb135a] __cxxabiv1::__terminate()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb13c5) [0x14afcccb13c5]
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb134f) [0x14afcccb134f]
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xcc860b) [0x14af804c860b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xdbbf4) [0x14afcccdbbf4] execute_native_thread_routine
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x14b00f5e4609] start_thread
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x14b00f3af353] __clone

*** SIGABRT received at time=1707523587 on cpu 16 ***
PC: @     0x14b00f2d300b  (unknown)  raise
    @     0x14b00f5f0420       3792  (unknown)
    @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
    @     0x14afcccb1070  (unknown)  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: *** SIGABRT received at time=1707523587 on cpu 16 ***
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: PC: @     0x14b00f2d300b  (unknown)  raise
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14b00f5f0420       3792  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb1070  (unknown)  (unknown)
Fatal Python error: Aborted


Extension modules: mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, regex._regex, scipy._lib._ccallback_c, yaml._yaml, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, _brotli, markupsafe._speedups, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, zstandard.backend_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 98)


@ffolkes1911
Copy link

ffolkes1911 commented Feb 14, 2024

Tried same options as above, and using Ray, but it did not help.

What did work was using a GPTQ model, as it seems that only AWQ models hang (only tried those two on multi-GPU)
EDIT: tested on TheBloke/Llama-2-13B-chat-AWQ and GPTQ
EDIT2: seems that this issue is about Ray Cluster, whereas I was just adding --tensor-parallel-size to vllm, so might be different issue

@BilalKHA95
Copy link

BilalKHA95 commented Feb 22, 2024

i've the same issue did you found a solution ? @ffolkes1911

@jony0113
Copy link

I have the similar issue, but it can eventually work after about 40 minutes, I have describe the detail in #2959

@Kaotic3
Copy link
Author

Kaotic3 commented Feb 27, 2024

Hey WoosuKwon.

I just cloned the repo and built it and then started Ray on two machines and then initiated vLLM with tensors=4.

The result is that vLLM is hanging and not moving past the "Initializing an LLM engine with config:...."

While I think that PR no doubt fixed some problem, it doesn't appear to have fixed this problem - which is that using Ray Cluster across two different machines results in vLLM hanging and not starting.

@viewv
Copy link

viewv commented Mar 11, 2024

Hey WoosuKwon.

I just cloned the repo and built it and then started Ray on two machines and then initiated vLLM with tensors=4.

The result is that vLLM is hanging and not moving past the "Initializing an LLM engine with config:...."

While I think that PR no doubt fixed some problem, it doesn't appear to have fixed this problem - which is that using Ray Cluster across two different machines results in vLLM hanging and not starting.

I have this issue too, I don't know how to fix it.
Fixed: #2826 (comment)

@thelongestusernameofall
Copy link

thelongestusernameofall commented Mar 21, 2024

Hi, Just started using vLLM two hours ago, and I had exactly the same issue. I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1.

Let me know if this solves your issue as well :)

export NCCL_P2P_DISABLE=1
worked for me. I'm using A6000 * 8, loading model with vllm, hanging and followed by a core dump after very long waiting.

Thanks very much.

@viewv
Copy link

viewv commented Mar 22, 2024

Hi, Just started using vLLM two hours ago, and I had exactly the same issue. I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1.
Let me know if this solves your issue as well :)

export NCCL_P2P_DISABLE=1 worked for me. I'm using A6000 * 8, loading model with vllm, hanging and followed by a core dump after very long waiting.

Thanks very much.

Thank you very much, I have fixed the problem, the problem is that I have multiple network card, so I use the NCCL_SOCKET_IFNAME=eth0 to select the correct network card, and fix it.

@huiyeruzhou
Copy link

try “ray stop” command, it does work for me.

@ChristineSeven
Copy link

ChristineSeven commented May 30, 2024

I am experiencing the same at the moment. For me, it happens with GPTQ quantisation with tp=4.

I have tried the following settings / combinations of settings without any luck:

NCCL_P2P_DISABLE=1 disable_custom_all_reduce=True enforce_eager=True

Latest vLLM, compiled from source. It hangs at approx 12995GB VRAM on each card across 4x3090. 70b model llama2.

Finally hung at this after approx 1h:

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
(RayWorkerVllm pid=1486) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747549 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,464 E 13 1719] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,501 E 13 1719] logging.cc:104: Stack trace: 
 /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x14af5a1ebb5a] ray::operator<<()
/opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x14af5a1ee298] ray::TerminateHandler()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb135a) [0x14afcccb135a] __cxxabiv1::__terminate()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb13c5) [0x14afcccb13c5]
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb134f) [0x14afcccb134f]
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xcc860b) [0x14af804c860b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xdbbf4) [0x14afcccdbbf4] execute_native_thread_routine
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x14b00f5e4609] start_thread
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x14b00f3af353] __clone

*** SIGABRT received at time=1707523587 on cpu 16 ***
PC: @     0x14b00f2d300b  (unknown)  raise
    @     0x14b00f5f0420       3792  (unknown)
    @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
    @     0x14afcccb1070  (unknown)  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: *** SIGABRT received at time=1707523587 on cpu 16 ***
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: PC: @     0x14b00f2d300b  (unknown)  raise
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14b00f5f0420       3792  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb1070  (unknown)  (unknown)
Fatal Python error: Aborted


Extension modules: mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, regex._regex, scipy._lib._ccallback_c, yaml._yaml, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, _brotli, markupsafe._speedups, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, zstandard.backend_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 98)

How did you fix this? I got the same issue in vllm version 0.3.3 on A100 2 cards.Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants