Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: NCCL timed out during inference #4653

Open
enkiid opened this issue May 7, 2024 · 7 comments
Open

[Bug]: NCCL timed out during inference #4653

enkiid opened this issue May 7, 2024 · 7 comments
Labels
bug Something isn't working stale

Comments

@enkiid
Copy link

enkiid commented May 7, 2024

Your current environment

Using:

  • vllm 0.4.1
  • nccl 2.18.1
  • pytorch 2.2.1

🐛 Describe the bug

During inference I sometimes get this error:

(RayWorkerWrapper pid=2376582) [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50404, OpType=GATHER, NumelIn=8000, NumelOut=0, Timeout(ms)=600000) ran for 600327 milliseconds before timing out.

Havn't seen it in earlier versions of vllm, any thoughts?

@enkiid enkiid added the bug Something isn't working label May 7, 2024
@Ch3ngY1
Copy link

Ch3ngY1 commented May 8, 2024

The same issue, which occurs randomly on my dataset.
vllm 0.4.1
torch 2.2.0+cu118

@DefTruth
Copy link
Contributor

DefTruth commented May 8, 2024

i have encountered the same issue, try --disable-custom-all-reduce and --enforce-eager is worked for me.

@changyuanzhangchina
Copy link

Please refer to #4430

  1. --disable-custom-all-reduce = True
  2. --enforce-eager = True (may be unnecessary)
  3. update to the [Core] Ignore infeasible swap requests. #4557

This three can solve the watchdog problem for me
before this, nccl watchdog error happens several times per day,
and now, it works well

@yunfeng-scale
Copy link
Contributor

we're seeing this on 0.4.2 as well with mixtral 8x22b. --disable-custom-all-reduce resolves the problem.

@yunfeng-scale
Copy link
Contributor

can we again disable custom all reduce by default?

@syr-cn
Copy link

syr-cn commented Jun 15, 2024

i have encountered the same issue, try --disable-custom-all-reduce and --enforce-eager is worked for me.

Works for me! thanks a lot!

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

6 participants