-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: TRACKING ISSUE: AsyncEngineDeadError
#5901
Comments
AsyncEngineDeadError
[Bug] No available block found in 60 second in shm Your current environment/path/vllm/vllm/usage/usage_lib.py:19: RuntimeWarning: Failed to read commit hash: No module named 'vllm.commit_id' from vllm.version import version as VLLM_VERSION PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.0 Libc version: glibc-2.35 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB GPU 1: NVIDIA A800-SXM4-80GB GPU 2: NVIDIA A800-SXM4-80GB GPU 3: NVIDIA A800-SXM4-80GB GPU 4: NVIDIA A800-SXM4-80GB GPU 5: NVIDIA A800-SXM4-80GB GPU 6: NVIDIA A800-SXM4-80GB GPU 7: NVIDIA A800-SXM4-80GB HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Flags: Virtualization: VT-x L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 80 MiB (64 instances) L3 cache: 96 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-127 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchvision==0.18.0 [pip3] transformers==4.42.4 [pip3] triton==2.3.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchvision 0.18.0 pypi_0 pypi [conda] transformers 4.42.4 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_4 mlx5_5 CPU Affinity NUMA Affinity GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 0-127 N/A GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 0-127 N/A GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 SYS PXB SYS SYS 0-127 N/A GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 SYS PXB SYS SYS 0-127 N/A GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS PXB SYS 0-127 N/A GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS PXB SYS 0-127 N/A GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS PXB 0-127 N/A GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS PXB 0-127 N/A mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS SYS mlx5_1 SYS SYS PXB PXB SYS SYS SYS SYS SYS X SYS SYS mlx5_4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS X SYS mlx5_5 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS X Legend: X = Self To support our model, I built it from source code. 🐛 Describe the bugI printed out the rank information at the warning and found that one of the GPUs was stuck (a total of 4 GPUs for tensor parallelism). This bug is highly reproducible, especially when running models above 70B(like qwen2)and encountering a large number of requests. |
Traceback (most recent call last): File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m vllm.entrypoints.openai.api_server |
Here is mine VLLM serve with the command
The InputThe error
|
The crash seems to happen after a cold-start or long pauses before the next generation happens. Below is example of the engine working, but then failing after 30 minutes of multi-user (parallel requests) usage, and then a short pause. vLLM version: 0.5.3.post1 INFO 08-05 16:38:20 async_llm_engine.py:140] Finished request d37eacc1e8fb411e99362648eab38666.
INFO:root:Generated 745 tokens in 45.47s
INFO:root:Finished request d37eacc1e8fb411e99362648eab38666
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 84785d7df09b4128b11e2113fed6cde7
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 84785d7df09b4128b11e2113fed6cde7
INFO:root:Begin iteration for request 84785d7df09b4128b11e2113fed6cde7
INFO 08-05 16:38:23 async_llm_engine.py:173] Added request 84785d7df09b4128b11e2113fed6cde7.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:25 metrics.py:396] Avg prompt throughput: 676.5 tokens/s, Avg generation throughput: 6.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.0%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:30 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:35 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:40 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.2%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:45 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:50 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.4%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:56 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.4%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:01 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.5%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:06 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.6%, CPU KV cache usage: 0.0%.
INFO 08-05 16:39:06 async_llm_engine.py:140] Finished request 84785d7df09b4128b11e2113fed6cde7.
INFO:root:Generated 711 tokens in 42.73s
INFO:root:Finished request 84785d7df09b4128b11e2113fed6cde7
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 6476b49a79714e4d8bd6e5c65107b586
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 6476b49a79714e4d8bd6e5c65107b586
INFO:root:Begin iteration for request 6476b49a79714e4d8bd6e5c65107b586
INFO 08-05 16:39:53 async_llm_engine.py:173] Added request 6476b49a79714e4d8bd6e5c65107b586.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:54 metrics.py:396] Avg prompt throughput: 1.2 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:59 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:40:04 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-05 16:40:05 async_llm_engine.py:140] Finished request 6476b49a79714e4d8bd6e5c65107b586.
INFO:root:Generated 177 tokens in 10.35s
INFO:root:Finished request 6476b49a79714e4d8bd6e5c65107b586
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 512aeb24c57f4f1eba974ecaba0e9522
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 512aeb24c57f4f1eba974ecaba0e9522
INFO:root:Begin iteration for request 512aeb24c57f4f1eba974ecaba0e9522
INFO 08-05 16:51:50 async_llm_engine.py:173] Added request 512aeb24c57f4f1eba974ecaba0e9522.
ERROR 08-05 16:51:50 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
ERROR 08-05 16:51:50 async_llm_engine.py:56] Engine background task failed
ERROR 08-05 16:51:50 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-05 16:51:50 async_llm_engine.py:56] await asyncio.sleep(0)
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 640, in sleep
ERROR 08-05 16:51:50 async_llm_engine.py:56] await __sleep0()
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 634, in __sleep0
ERROR 08-05 16:51:50 async_llm_engine.py:56] yield
ERROR 08-05 16:51:50 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 08-05 16:51:50 async_llm_engine.py:56]
ERROR 08-05 16:51:50 async_llm_engine.py:56] The above exception was the direct cause of the following exception:
ERROR 08-05 16:51:50 async_llm_engine.py:56]
ERROR 08-05 16:51:50 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 08-05 16:51:50 async_llm_engine.py:56] return_value = task.result()
ERROR 08-05 16:51:50 async_llm_engine.py:56] ^^^^^^^^^^^^^
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
ERROR 08-05 16:51:50 async_llm_engine.py:56] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
ERROR 08-05 16:51:50 async_llm_engine.py:56] raise TimeoutError from exc_val
ERROR 08-05 16:51:50 async_llm_engine.py:56] TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=>)() at /home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:36
handle: >)() at /home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
await asyncio.sleep(0)
File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 640, in sleep
await __sleep0()
File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 634, in __sleep0
yield
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
raise TimeoutError from exc_val
TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. |
This error will occur during the process of my code randomly. I was running the Llama3.1 model. The responses of the previous few requests were normal. My environment is
T |
Hi, Ive been having this issue when using the llama 3.1 70b bf16 (no quants) on a 8xL4 node. I am using guided decoding for all my requests. What i noticed was it seems to happen when the server is overloaded with lots of parallel requests. I was trying to do 64 calls using the async openai package. And in the error i noticed i got a cuda oom error along with this error. hope this helps debug whats going on. |
As this was a controlled environment it was working but in production i cant estimate how many paralell calls will be made so this error might rise in real world testing. Is there a way to test and artificially limit the queue when using guided decoding so that the server doesnt go OOM? |
I think it's because there are requests still in the queue, but since the process is stuck, no tokens have been generated, which eventually caused the server to crash. |
Got the same issue across multiple versions of official vllm docker images. I first encountered this issue with version v0.4.3, then 0.5.0, 0.5.3.post1, now 0.5.4. envs and launch commandsMy environment info:
My launch command is: (part of a k8s yaml file)
This error seems ALWAYS occurs under a continuous inference workload, before version v0.5.4, the service can realize its own error status and restart automatically, but with v0.5.4, auto restart doesn't work. error log
Then the service does not restart, just stuck here. question
Thanks! |
I attempted to identify the issue by capturing system errors and found that this problem exists across different Nvidia driver versions, though the frequency has significantly decreased. Under high concurrency, it is barely usable. Environment: A800 *8 Nvidia driver 535.129.01 -> 535.169.07 The following errors can be observed through dmesg:
|
Environment: A800 *8 Nvidia driver 535.129.01 -> 535.169.07 or 535.183.06 |
Environment: 8xA100 - 80GB for each job Batch sizes of around 512, with max num seq = 256. Max output seq len = 1024, Average input seq length = ~2000. Ran it on 40 jobs independently, of which 5 ran to completion (no errors in engine), and 35 ran into this error after varying amounts of completions (500- 8000 samples generated before crashing).
|
I experienced the issue too. My setup:
The issue seems to happen when I send around 30 parallel requests each (few) seconds. The prompt is about 1400-tokens-long. |
vllm 0.5.3.post1+cu118 the error info is:
|
a new case..
|
any update? |
To everyone commenting on this issue, AsyncEngineDeadError occurs when we get into a bad state. It can be can be caused by anything - but especially if we hit an illegal memory access or something of the like. We are trying to remove as many bugs as possible. So please, as per the original comment: “When reporting an issue, please include a sample request that causes the issue so we can reproduce on our side.” |
Your current environment
🐛 Describe the bug
Recently, we have seen reports of
AsyncEngineDeadError
, including:If you see something like the following, please report here:
Key areas we are looking into include:
When reporting an issue, please include a sample request that causes the issue so we can reproduce on our side.
The text was updated successfully, but these errors were encountered: