[Bug]: NCCL timed out during inference #4653

enkiid · 2024-05-07T11:41:09Z

Your current environment

Using:

vllm 0.4.1
nccl 2.18.1
pytorch 2.2.1

🐛 Describe the bug

During inference I sometimes get this error:

(RayWorkerWrapper pid=2376582) [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50404, OpType=GATHER, NumelIn=8000, NumelOut=0, Timeout(ms)=600000) ran for 600327 milliseconds before timing out.

Havn't seen it in earlier versions of vllm, any thoughts?

Ch3ngY1 · 2024-05-08T01:58:15Z

The same issue, which occurs randomly on my dataset.
vllm 0.4.1
torch 2.2.0+cu118

DefTruth · 2024-05-08T05:03:08Z

i have encountered the same issue, try --disable-custom-all-reduce and --enforce-eager is worked for me.

changyuanzhangchina · 2024-05-10T08:07:14Z

Please refer to #4430

--disable-custom-all-reduce = True
--enforce-eager = True (may be unnecessary)
update to the [Core] Ignore infeasible swap requests. #4557

This three can solve the watchdog problem for me
before this, nccl watchdog error happens several times per day,
and now, it works well

yunfeng-scale · 2024-05-20T22:24:22Z

we're seeing this on 0.4.2 as well with mixtral 8x22b. --disable-custom-all-reduce resolves the problem.

yunfeng-scale · 2024-05-20T22:24:37Z

can we again disable custom all reduce by default?

syr-cn · 2024-06-15T08:58:47Z

i have encountered the same issue, try --disable-custom-all-reduce and --enforce-eager is worked for me.

Works for me! thanks a lot!

github-actions · 2024-10-27T02:07:11Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

enkiid added the bug Something isn't working label May 7, 2024

changyuanzhangchina mentioned this issue May 10, 2024

[Bug]: Error happen in async_llm_engine when use multiple GPUs #3839

Open

DarkLight1337 mentioned this issue May 28, 2024

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060

Open

wjj19950828 mentioned this issue Jun 13, 2024

[Bug]: NCCL hangs and causes timeout #5484

Open

github-actions bot added the stale label Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: NCCL timed out during inference #4653

[Bug]: NCCL timed out during inference #4653

enkiid commented May 7, 2024 •

edited

Loading

Ch3ngY1 commented May 8, 2024

DefTruth commented May 8, 2024

changyuanzhangchina commented May 10, 2024

yunfeng-scale commented May 20, 2024

yunfeng-scale commented May 20, 2024

syr-cn commented Jun 15, 2024

github-actions bot commented Oct 27, 2024

[Bug]: NCCL timed out during inference #4653

[Bug]: NCCL timed out during inference #4653

Comments

enkiid commented May 7, 2024 • edited Loading

Your current environment

🐛 Describe the bug

Ch3ngY1 commented May 8, 2024

DefTruth commented May 8, 2024

changyuanzhangchina commented May 10, 2024

yunfeng-scale commented May 20, 2024

yunfeng-scale commented May 20, 2024

syr-cn commented Jun 15, 2024

github-actions bot commented Oct 27, 2024

enkiid commented May 7, 2024 •

edited

Loading