Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TRACKING ISSUE: AsyncEngineDeadError #5901

Open
2 of 18 tasks
robertgshaw2-neuralmagic opened this issue Jun 27, 2024 · 18 comments
Open
2 of 18 tasks

[Bug]: TRACKING ISSUE: AsyncEngineDeadError #5901

robertgshaw2-neuralmagic opened this issue Jun 27, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Jun 27, 2024

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Recently, we have seen reports of AsyncEngineDeadError, including:

If you see something like the following, please report here:

2024-06-25 12:27:29.905   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 84, in health
2024-06-25 12:27:29.905     await openai_serving_chat.engine.check_health()
2024-06-25 12:27:29.905   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 839, in check_health
2024-06-25 12:27:29.905     raise AsyncEngineDeadError("Background loop is stopped.")
2024-06-25 12:27:29.905 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.

Key areas we are looking into include:

  • logprob usage
  • guided regex usage

When reporting an issue, please include a sample request that causes the issue so we can reproduce on our side.

@G-z-w
Copy link

G-z-w commented Jul 23, 2024

[Bug] No available block found in 60 second in shm

Your current environment

/path/vllm/vllm/usage/usage_lib.py:19: RuntimeWarning: Failed to read commit hash: No module named 'vllm.commit_id' from vllm.version import version as VLLM_VERSION PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.0 Libc version: glibc-2.35 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB GPU 1: NVIDIA A800-SXM4-80GB GPU 2: NVIDIA A800-SXM4-80GB GPU 3: NVIDIA A800-SXM4-80GB GPU 4: NVIDIA A800-SXM4-80GB GPU 5: NVIDIA A800-SXM4-80GB GPU 6: NVIDIA A800-SXM4-80GB GPU 7: NVIDIA A800-SXM4-80GB
Nvidia driver version: 525.60.13
cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5

HIP runtime version: N/A

MIOpen runtime version: N/A

Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 6 CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 5200.00

Flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities

Virtualization: VT-x L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 80 MiB (64 instances) L3 cache: 96 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-127 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchvision==0.18.0 [pip3] transformers==4.42.4 [pip3] triton==2.3.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchvision 0.18.0 pypi_0 pypi [conda] transformers 4.42.4 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.2 vLLM Build Flags:

CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

GPU Topology:

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_4 mlx5_5 CPU Affinity NUMA Affinity GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 0-127 N/A GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 0-127 N/A GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 SYS PXB SYS SYS 0-127 N/A GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 SYS PXB SYS SYS 0-127 N/A GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS PXB SYS 0-127 N/A GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS PXB SYS 0-127 N/A GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS PXB 0-127 N/A GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS PXB 0-127 N/A mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS SYS mlx5_1 SYS SYS PXB PXB SYS SYS SYS SYS SYS X SYS SYS mlx5_4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS X SYS mlx5_5 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

To support our model, I built it from source code.

🐛 Describe the bug

#6614.

I printed out the rank information at the warning and found that one of the GPUs was stuck (a total of 4 GPUs for tensor parallelism). This bug is highly reproducible, especially when running models above 70B(like qwen2)and encountering a large number of requests.

image
image
image

@LIUKAI0815
Copy link

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 604, in run_engine_loop
done, _ = await asyncio.wait(
File "/home/ubuntu/miniconda3/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/home/ubuntu/miniconda3/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError

File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 124, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 305, in create_chat_completion
return await self.chat_completion_full_generator(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 505, in chat_completion_full_generator
async for res in result_generator:
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 765, in generate
async for output in self._process_request(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 881, in _process_request
raise e
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 877, in _process_request
async for request_output in stream:
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 91, in anext
raise result
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 44, in _log_task_completion
return_value = task.result()
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 603, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m vllm.entrypoints.openai.api_server
--model /data/qwen/
--port 3004
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--trust-remote-code
--enforce-eager
--served-model-name Qwen2-72B-Instruct-awq \

@etiennebonnafoux
Copy link

Here is mine

VLLM serve with the command

export CUDA_VISIBLE_DEVICES=0

python -m vllm.entrypoints.openai.api_server \
        --port 31002 \
        --<some_path_on_my_computer>/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/\
        --served-model-name llama3 \
        --gpu-memory-utilization 0.4 > worker.out &

The Input

image

The error

    if sampling_params.seed is not None:
AttributeError: 'NoneType' object has no attribute 'seed'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/cors.py", line 93, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/cors.py", line 148, in simple_response
    await self.app(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in create_embedding
    generator = await openai_serving_embedding.create_embedding(
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_embedding.py", line 146, in create_embedding
    async for i, res in result_generator:
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/utils.py", line 329, in consumer
    raise e
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/utils.py", line 320, in consumer
    raise item
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/utils.py", line 304, in producer
    async for item in iterator:
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 850, in encode
    async for output in self._process_request(
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 873, in _process_request
    stream = await self.add_request(
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 676, in add_request
    self.start_background_loop()
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 516, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

@justinthelaw
Copy link

justinthelaw commented Aug 5, 2024

The crash seems to happen after a cold-start or long pauses before the next generation happens. Below is example of the engine working, but then failing after 30 minutes of multi-user (parallel requests) usage, and then a short pause.

vLLM version: 0.5.3.post1
OS/Machine: System76 adder WS, Ubuntu 22.04 Desktop, eGPU NVIDIA GeForce RTX 4090 (24Gb vRAM)
Isolated virtual env, Python version : 3.11.6
Base LLM: Mistral-7b Instruct v0.3

INFO 08-05 16:38:20 async_llm_engine.py:140] Finished request d37eacc1e8fb411e99362648eab38666.
INFO:root:Generated 745 tokens in 45.47s
INFO:root:Finished request d37eacc1e8fb411e99362648eab38666
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 84785d7df09b4128b11e2113fed6cde7
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 84785d7df09b4128b11e2113fed6cde7
INFO:root:Begin iteration for request 84785d7df09b4128b11e2113fed6cde7
INFO 08-05 16:38:23 async_llm_engine.py:173] Added request 84785d7df09b4128b11e2113fed6cde7.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:25 metrics.py:396] Avg prompt throughput: 676.5 tokens/s, Avg generation throughput: 6.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.0%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:30 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:35 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:40 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.2%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:45 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:50 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.4%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:56 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.4%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:01 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.5%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:06 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.6%, CPU KV cache usage: 0.0%.
INFO 08-05 16:39:06 async_llm_engine.py:140] Finished request 84785d7df09b4128b11e2113fed6cde7.
INFO:root:Generated 711 tokens in 42.73s
INFO:root:Finished request 84785d7df09b4128b11e2113fed6cde7
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 6476b49a79714e4d8bd6e5c65107b586
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 6476b49a79714e4d8bd6e5c65107b586
INFO:root:Begin iteration for request 6476b49a79714e4d8bd6e5c65107b586
INFO 08-05 16:39:53 async_llm_engine.py:173] Added request 6476b49a79714e4d8bd6e5c65107b586.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:54 metrics.py:396] Avg prompt throughput: 1.2 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:59 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:40:04 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-05 16:40:05 async_llm_engine.py:140] Finished request 6476b49a79714e4d8bd6e5c65107b586.
INFO:root:Generated 177 tokens in 10.35s
INFO:root:Finished request 6476b49a79714e4d8bd6e5c65107b586
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 512aeb24c57f4f1eba974ecaba0e9522
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 512aeb24c57f4f1eba974ecaba0e9522
INFO:root:Begin iteration for request 512aeb24c57f4f1eba974ecaba0e9522
INFO 08-05 16:51:50 async_llm_engine.py:173] Added request 512aeb24c57f4f1eba974ecaba0e9522.
ERROR 08-05 16:51:50 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
ERROR 08-05 16:51:50 async_llm_engine.py:56] Engine background task failed
ERROR 08-05 16:51:50 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-05 16:51:50 async_llm_engine.py:56] await asyncio.sleep(0)
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 640, in sleep
ERROR 08-05 16:51:50 async_llm_engine.py:56] await __sleep0()
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 634, in __sleep0
ERROR 08-05 16:51:50 async_llm_engine.py:56] yield
ERROR 08-05 16:51:50 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 08-05 16:51:50 async_llm_engine.py:56]
ERROR 08-05 16:51:50 async_llm_engine.py:56] The above exception was the direct cause of the following exception:
ERROR 08-05 16:51:50 async_llm_engine.py:56]
ERROR 08-05 16:51:50 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 08-05 16:51:50 async_llm_engine.py:56] return_value = task.result()
ERROR 08-05 16:51:50 async_llm_engine.py:56] ^^^^^^^^^^^^^
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
ERROR 08-05 16:51:50 async_llm_engine.py:56] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
ERROR 08-05 16:51:50 async_llm_engine.py:56] raise TimeoutError from exc_val
ERROR 08-05 16:51:50 async_llm_engine.py:56] TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=>)() at /home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:36
handle: >)() at /home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
    await asyncio.sleep(0)
  File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 640, in sleep
    await __sleep0()
  File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 634, in __sleep0
    yield
asyncio.exceptions.CancelledError
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

@yitianlian
Copy link

This error will occur during the process of my code randomly. I was running the Llama3.1 model. The responses of the previous few requests were normal. My environment is

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.19.91-014-kangaroo.2.10.13.5c249cdaf.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.199.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构:                           x86_64
CPU 运行模式:                   32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
字节序:                         Little Endian
CPU:                             96
在线 CPU 列表:                  0-95
厂商 ID:                        GenuineIntel
型号名称:                       Intel(R) Xeon(R) Processor @ 2.90GHz
CPU 系列:                       6
型号:                           106
每个核的线程数:                 1
每个座的核数:                   96
座:                             1
步进:                           6
BogoMIPS:                       5800.00
标记:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd avx512vbmi umip pku avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear arch_capabilities
虚拟化:                         VT-x
超管理器厂商:                   KVM
虚拟化类型:                     完全
L1d 缓存:                       4.5 MiB (96 instances)
L1i 缓存:                       3 MiB (96 instances)
L2 缓存:                        120 MiB (96 instances)
L3 缓存:                        48 MiB (1 instance)
NUMA 节点:                      1
NUMA 节点0 CPU:                 0-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.3
[pip3] triton==2.3.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.1                    pypi_0    pypi
[conda] torchvision               0.18.1                   pypi_0    pypi
[conda] transformers              4.43.3                   pypi_0    pypi
[conda] triton                    2.3.1                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PHB     PHB     PHB     PHB     0-95           N/A
mlx5_0  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     PHB     PHB
mlx5_1  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     PHB
mlx5_2  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB
mlx5_3  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

T

@nivibilla
Copy link

Hi,

Ive been having this issue when using the llama 3.1 70b bf16 (no quants) on a 8xL4 node. I am using guided decoding for all my requests.

What i noticed was it seems to happen when the server is overloaded with lots of parallel requests. I was trying to do 64 calls using the async openai package. And in the error i noticed i got a cuda oom error along with this error.
I reduced my parallel call size to 8 and it seems to be working fine now. I also reduced my max-model-len to 4096.
The same code works fine with the 8b with batch size 64, So for me it seems to be a GPU memory issue.

hope this helps debug whats going on.

@nivibilla
Copy link

As this was a controlled environment it was working but in production i cant estimate how many paralell calls will be made so this error might rise in real world testing. Is there a way to test and artificially limit the queue when using guided decoding so that the server doesnt go OOM?

@doYrobot
Copy link

echo "Activating conda environment..."
source activate vllm_infer || { echo "Failed to activate conda environment"; exit 1; }

# 定义变量
HOST="0.0.0.0"
PORT=11434
MODEL_PATH="xx/OpenGVLab/InternVL2-40B"
LOG_FILE="xx/internvl40b_autotriage-agent-server-backup.log"
TENSOR_PARALLEL_SIZE=4
MAX_NUM_SEQS=16
SERVE_MODEL_NAME="InternVL2-40B"
DISTRIBUTED_EXECUTOR_BACKEND="mp"

# 启动命令,并将输出重定向到指定的日志文件中,并在后台运行
echo "Starting API server..."
nohup python -m vllm.entrypoints.openai.api_server \
    --host $HOST \
    --port $PORT \
    --model $MODEL_PATH \
    --tensor_parallel_size $TENSOR_PARALLEL_SIZE \
    --trust_remote_code \
    --max-num-seqs $MAX_NUM_SEQS \
    --distributed-executor-backend $DISTRIBUTED_EXECUTOR_BACKEND \
    --served-model-name $SERVE_MODEL_NAME \
    --gpu-memory-utilization 0.9 \
    --swap-space 20 \
    --block-size 16 \
    --use-v2-block-manager \
    --preemption-mode "recompute" \
    > $LOG_FILE 2>&1 &

echo "API server started. Logs are being written to $LOG_FILE"

image

I think it's because there are requests still in the queue, but since the process is stuck, no tokens have been generated, which eventually caused the server to crash.

after server dead
img_v3_02do_130e0219-2a04-4477-abcc-4d7fd92c17dg

@changshivek
Copy link

changshivek commented Aug 15, 2024

Got the same issue across multiple versions of official vllm docker images. I first encountered this issue with version v0.4.3, then 0.5.0, 0.5.3.post1, now 0.5.4.

envs and launch commands

My environment info:

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB
GPU 2: NVIDIA A800-SXM4-80GB
GPU 3: NVIDIA A800-SXM4-80GB
GPU 4: NVIDIA A800-SXM4-80GB
GPU 5: NVIDIA A800-SXM4-80GB
GPU 6: NVIDIA A800-SXM4-80GB
GPU 7: NVIDIA A800-SXM4-80GB

Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 57 bits virtual
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Stepping:            6
Frequency boost:     enabled
CPU MHz:             3400.000
CPU max MHz:         3400.0000
CPU min MHz:         800.0000
BogoMIPS:            5200.00
Virtualization:      VT-x
L1d cache:           3 MiB
L1i cache:           2 MiB
L2 cache:            80 MiB
L3 cache:            96 MiB
NUMA node0 CPU(s):   0-31
NUMA node1 CPU(s):   32-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] flashinfer==0.1.2+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.1.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.43.4
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4@
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity NUMA Affinity    GPU NUMA ID
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     PXB     NODE    NODE    SYS     SYS     NODE    0-31 0N/A
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     PXB     NODE    NODE    SYS     SYS     NODE    0-31 0N/A
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     NODE    PXB     NODE    SYS     SYS     PXB     0-31 0N/A
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     NODE    PXB     NODE    SYS     SYS     PXB     0-31 0N/A
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     SYS     SYS     SYS     PXB     NODE    SYS     32-631N/A
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     SYS     SYS     SYS     PXB     NODE    SYS     32-631N/A
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     SYS     SYS     SYS     NODE    PXB     SYS     32-631N/A
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      SYS     SYS     SYS     NODE    PXB     SYS     32-631N/A
NIC0    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    SYS     SYS     NODE
NIC1    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE     X      NODE    SYS     SYS     PIX
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      SYS     SYS     NODE
NIC3    SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS      X      NODE    SYS
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     NODE     X      SYS
NIC5    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    PIX     NODE    SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_4
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_bond_0

My launch command is: (part of a k8s yaml file)

        command: ["/bin/bash", "-c"]
        args: [
        # "sudo sed -i '175,+2s/\"dns.google\"/\"8.8.8.8\"/g' /workspace/vllm/utils.py && \
        "nvidia-smi;python3 -m vllm.entrypoints.openai.api_server \
        --host 0.0.0.0 \
        --model /fl/nlp/common/plms/qwen2/Qwen2-72B-Instruct \
        --trust-remote-code \
        --enforce-eager \
        --max-model-len 32768 \
        --gpu-memory-utilization 0.9 \
        --served-model-name qwen2-72bc \
        --tensor-parallel-size 8"
         ]

This error seems ALWAYS occurs under a continuous inference workload, before version v0.5.4, the service can realize its own error status and restart automatically, but with v0.5.4, auto restart doesn't work.

error log

  • First this error log happens:
INFO 08-14 14:25:42 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 08-14 14:25:52 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
ERROR 08-14 14:25:53 async_llm_engine.py:663] Engine iteration timed out. This should never happen!
ERROR 08-14 14:25:53 async_llm_engine.py:57] Engine background task failed
ERROR 08-14 14:25:53 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
ERROR 08-14 14:25:53 async_llm_engine.py:57]     done, _ = await asyncio.wait(
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 08-14 14:25:53 async_llm_engine.py:57]     return await _wait(fs, timeout, return_when, loop)
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 08-14 14:25:53 async_llm_engine.py:57]     await waiter
ERROR 08-14 14:25:53 async_llm_engine.py:57] asyncio.exceptions.CancelledError
ERROR 08-14 14:25:53 async_llm_engine.py:57]
ERROR 08-14 14:25:53 async_llm_engine.py:57] During handling of the above exception, another exception occurred:
ERROR 08-14 14:25:53 async_llm_engine.py:57]
ERROR 08-14 14:25:53 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 08-14 14:25:53 async_llm_engine.py:57]     return_value = task.result()
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-14 14:25:53 async_llm_engine.py:57]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 08-14 14:25:53 async_llm_engine.py:57]     self._do_exit(exc_type)
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 08-14 14:25:53 async_llm_engine.py:57]     raise asyncio.TimeoutError
ERROR 08-14 14:25:53 async_llm_engine.py:57] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7f0a5d6569b0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=<bound method...7f0a5d6569b0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
INFO 08-14 14:25:53 async_llm_engine.py:181] Aborted request chat-afa7a52c064a466c952a2eaf29c376a9.
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 08-14 14:25:53 async_llm_engine.py:181] Aborted request chat-fe65c4670df04192becd6af726e294ca.
INFO:     10.233.99.0:48827 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
asyncio.exceptions.TimeoutError
INFO:     10.233.99.0:2747 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
asyncio.exceptions.TimeoutError
(VllmWorkerProcess pid=143) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=146) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=149) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=145) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=147) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=144) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=148) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
  • Then if new request was post, got this error log:
INFO:     10.233.99.0:24460 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
  • After a long wait without new post requests, this happens:
[rank0]:[F814 14:47:39.862277231 ProcessGroupNCCL.cpp:1224] [PG 3 Rank 0] [PG 3 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 10
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Then the service does not restart, just stuck here.

question

  • Since this issue was firstly raised in June, with more similar issues reported earlier, is there any recent update on this error? Is it still being tracked?
  • What can I do to avoid this error, Has anyone found any effective practices?

Thanks!

@fengyizhu
Copy link

fengyizhu commented Aug 23, 2024

I attempted to identify the issue by capturing system errors and found that this problem exists across different Nvidia driver versions, though the frequency has significantly decreased. Under high concurrency, it is barely usable.

Environment: A800 *8 Nvidia driver 535.129.01 -> 535.169.07

The following errors can be observed through dmesg:

[93297.029837] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x52ef30=0x0 0x52ef34=0x20 0x52ef28=0xc81eb60 0x52ef2c=0x1174
[93297.031618] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x52efb0=0x0 0x52efb4=0x20 0x52efa8=0xc81eb60 0x52efac=0x1174
[93297.040286] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x52f730=0x0 0x52f734=0x20 0x52f728=0xc81eb60 0x52f72c=0x1174
[93297.042086] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x52f7b0=0x0 0x52f7b4=0x20 0x52f7a8=0xc81eb60 0x52f7ac=0x1174
[93297.043905] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x52ff30=0x0 0x52ff34=0x20 0x52ff28=0xc81eb60 0x52ff2c=0x1174
[93297.045662] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x52ffb0=0x0 0x52ffb4=0x20 0x52ffa8=0xc81eb60 0x52ffac=0x1174
[93297.047477] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x534730=0x0 0x534734=0x20 0x534728=0xc81eb60 0x53472c=0x1174
[93297.049427] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x5347b0=0x0 0x5347b4=0x20 0x5347a8=0xc81eb60 0x5347ac=0x1174
[93297.051266] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x534f30=0x0 0x534f34=0x20 0x534f28=0xc81eb60 0x534f2c=0x1174
[93297.053045] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 1, SM 1): Out Of Range Address
[93297.054854] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 1, SM 1): Multiple Warp Errors
[93297.060872] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x534fb0=0x104000e 0x534fb4=0x4 0x534fa8=0xc81eb60 0x534fac=0x1174
[93297.066414] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 2, SM 0): Out Of Range Address
[93297.068051] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 2, SM 0): Multiple Warp Errors
[93297.069644] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x535730=0x103000e 0x535734=0x4 0x535728=0xc81eb60 0x53572c=0x1174
[93297.071464] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 2, SM 1): Out Of Range Address
[93297.073052] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 2, SM 1): Multiple Warp Errors
[93297.074659] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x5357b0=0x106000e 0x5357b4=0x4 0x5357a8=0xc81eb60 0x5357ac=0x1174
[93297.076505] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x535f30=0x0 0x535f34=0x20 0x535f28=0xc81eb60 0x535f2c=0x1174
[93297.078298] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 1): Out Of Range Address
[93297.079859] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 3, SM 1): Multiple Warp Errors
[93297.085779] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x535fb0=0x100000e 0x535fb4=0x4 0x535fa8=0xc81eb60 0x535fac=0x1174
[93297.091801] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 4, SM 0): Out Of Range Address
[93297.093387] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 4, SM 0): Multiple Warp Errors
[93297.094970] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x536730=0x104000e 0x536734=0x4 0x536728=0xc81eb60 0x53672c=0x1174
[93297.096772] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 4, SM 1): Out Of Range Address
[93297.098342] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 4, SM 1): Multiple Warp Errors
[93297.099913] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x5367b0=0x104000e 0x5367b4=0x4 0x5367a8=0xc81eb60 0x5367ac=0x1174
[93297.101766] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x536f30=0x0 0x536f34=0x20 0x536f28=0xc81eb60 0x536f2c=0x1174
[93297.103557] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x536fb0=0x0 0x536fb4=0x20 0x536fa8=0xc81eb60 0x536fac=0x1174
[93297.105377] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 6, SM 0): Out Of Range Address
[93297.106933] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 6, SM 0): Multiple Warp Errors
[93297.108510] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x537730=0x106000e 0x537734=0x4 0x537728=0xc81eb60 0x53772c=0x1174
[93297.110512] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 6, SM 1): Out Of Range Address
[93297.112097] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 6, SM 1): Multiple Warp Errors
[93297.113678] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x5377b0=0x104000e 0x5377b4=0x4 0x5377a8=0xc81eb60 0x5377ac=0x1174
[93297.115536] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 7, SM 0): Out Of Range Address
[93297.117251] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 7, SM 0): Multiple Warp Errors
[93297.118825] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x537f30=0x102000e 0x537f34=0x4 0x537f28=0xc81eb60 0x537f2c=0x1174
[93297.120644] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 7, SM 1): Out Of Range Address
[93297.122213] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 6, TPC 7, SM 1): Multiple Warp Errors
[93297.123823] NVRM: Xid (PCI:0000:43:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x537fb0=0x104000e 0x537fb4=0x4 0x537fa8=0xc81eb60 0x537fac=0x1174
[93297.125640] NVRM: Xid (PCI:0000:43:00): 13, pid=8044, name=ray::RayWorkerW, Graphics Exception: ChID 0008, Class 0000c6c0, Offset 00000000, Data 00000000
[93297.563110] NVRM: Xid (PCI:0000:43:00): 31, pid=8044, name=ray::RayWorkerW, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x69f8_e04c8000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

@fengyizhu
Copy link

Got the same issue across multiple versions of official vllm docker images. I first encountered this issue with version v0.4.3, then 0.5.0, 0.5.3.post1, now 0.5.4.

envs and launch commands

My environment info:

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB
GPU 2: NVIDIA A800-SXM4-80GB
GPU 3: NVIDIA A800-SXM4-80GB
GPU 4: NVIDIA A800-SXM4-80GB
GPU 5: NVIDIA A800-SXM4-80GB
GPU 6: NVIDIA A800-SXM4-80GB
GPU 7: NVIDIA A800-SXM4-80GB

Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 57 bits virtual
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Stepping:            6
Frequency boost:     enabled
CPU MHz:             3400.000
CPU max MHz:         3400.0000
CPU min MHz:         800.0000
BogoMIPS:            5200.00
Virtualization:      VT-x
L1d cache:           3 MiB
L1i cache:           2 MiB
L2 cache:            80 MiB
L3 cache:            96 MiB
NUMA node0 CPU(s):   0-31
NUMA node1 CPU(s):   32-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] flashinfer==0.1.2+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.1.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.43.4
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4@
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity NUMA Affinity    GPU NUMA ID
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     PXB     NODE    NODE    SYS     SYS     NODE    0-31 0N/A
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     PXB     NODE    NODE    SYS     SYS     NODE    0-31 0N/A
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     NODE    PXB     NODE    SYS     SYS     PXB     0-31 0N/A
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     NODE    PXB     NODE    SYS     SYS     PXB     0-31 0N/A
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     SYS     SYS     SYS     PXB     NODE    SYS     32-631N/A
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     SYS     SYS     SYS     PXB     NODE    SYS     32-631N/A
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     SYS     SYS     SYS     NODE    PXB     SYS     32-631N/A
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      SYS     SYS     SYS     NODE    PXB     SYS     32-631N/A
NIC0    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    SYS     SYS     NODE
NIC1    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE     X      NODE    SYS     SYS     PIX
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      SYS     SYS     NODE
NIC3    SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS      X      NODE    SYS
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     NODE     X      SYS
NIC5    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    PIX     NODE    SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_4
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_bond_0

My launch command is: (part of a k8s yaml file)

        command: ["/bin/bash", "-c"]
        args: [
        # "sudo sed -i '175,+2s/\"dns.google\"/\"8.8.8.8\"/g' /workspace/vllm/utils.py && \
        "nvidia-smi;python3 -m vllm.entrypoints.openai.api_server \
        --host 0.0.0.0 \
        --model /fl/nlp/common/plms/qwen2/Qwen2-72B-Instruct \
        --trust-remote-code \
        --enforce-eager \
        --max-model-len 32768 \
        --gpu-memory-utilization 0.9 \
        --served-model-name qwen2-72bc \
        --tensor-parallel-size 8"
         ]

This error seems ALWAYS occurs under a continuous inference workload, before version v0.5.4, the service can realize its own error status and restart automatically, but with v0.5.4, auto restart doesn't work.

error log

  • First this error log happens:
INFO 08-14 14:25:42 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 08-14 14:25:52 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
ERROR 08-14 14:25:53 async_llm_engine.py:663] Engine iteration timed out. This should never happen!
ERROR 08-14 14:25:53 async_llm_engine.py:57] Engine background task failed
ERROR 08-14 14:25:53 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
ERROR 08-14 14:25:53 async_llm_engine.py:57]     done, _ = await asyncio.wait(
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 08-14 14:25:53 async_llm_engine.py:57]     return await _wait(fs, timeout, return_when, loop)
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 08-14 14:25:53 async_llm_engine.py:57]     await waiter
ERROR 08-14 14:25:53 async_llm_engine.py:57] asyncio.exceptions.CancelledError
ERROR 08-14 14:25:53 async_llm_engine.py:57]
ERROR 08-14 14:25:53 async_llm_engine.py:57] During handling of the above exception, another exception occurred:
ERROR 08-14 14:25:53 async_llm_engine.py:57]
ERROR 08-14 14:25:53 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 08-14 14:25:53 async_llm_engine.py:57]     return_value = task.result()
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-14 14:25:53 async_llm_engine.py:57]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 08-14 14:25:53 async_llm_engine.py:57]     self._do_exit(exc_type)
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 08-14 14:25:53 async_llm_engine.py:57]     raise asyncio.TimeoutError
ERROR 08-14 14:25:53 async_llm_engine.py:57] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7f0a5d6569b0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=<bound method...7f0a5d6569b0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
INFO 08-14 14:25:53 async_llm_engine.py:181] Aborted request chat-afa7a52c064a466c952a2eaf29c376a9.
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 08-14 14:25:53 async_llm_engine.py:181] Aborted request chat-fe65c4670df04192becd6af726e294ca.
INFO:     10.233.99.0:48827 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
asyncio.exceptions.TimeoutError
INFO:     10.233.99.0:2747 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
asyncio.exceptions.TimeoutError
(VllmWorkerProcess pid=143) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=146) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=149) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=145) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=147) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=144) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=148) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
  • Then if new request was post, got this error log:
INFO:     10.233.99.0:24460 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
  • After a long wait without new post requests, this happens:
[rank0]:[F814 14:47:39.862277231 ProcessGroupNCCL.cpp:1224] [PG 3 Rank 0] [PG 3 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 10
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Then the service does not restart, just stuck here.

question

  • Since this issue was firstly raised in June, with more similar issues reported earlier, is there any recent update on this error? Is it still being tracked?
  • What can I do to avoid this error, Has anyone found any effective practices?

Thanks!

Environment: A800 *8 Nvidia driver 535.129.01 -> 535.169.07 or 535.183.06

@fengyizhu
Copy link

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Recently, we have seen reports of AsyncEngineDeadError, including:

If you see something like the following, please report here:

2024-06-25 12:27:29.905   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 84, in health
2024-06-25 12:27:29.905     await openai_serving_chat.engine.check_health()
2024-06-25 12:27:29.905   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 839, in check_health
2024-06-25 12:27:29.905     raise AsyncEngineDeadError("Background loop is stopped.")
2024-06-25 12:27:29.905 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.

Key areas we are looking into include:

  • logprob usage
  • guided regex usage

When reporting an issue, please include a sample request that causes the issue so we can reproduce on our side.

Environment: A800 *8 Nvidia driver 535.129.01 -> 535.169.07 or 535.183.06

@titu1994
Copy link

titu1994 commented Aug 29, 2024

Environment: 8xA100 - 80GB for each job
Pytorch 2.3.0
vLLM version - 0.5.3post1

Batch sizes of around 512, with max num seq = 256. Max output seq len = 1024, Average input seq length = ~2000.
Model = Mixtral 8x22B, bf16 execution, enforce_eager=True

Ran it on 40 jobs independently, of which 5 ran to completion (no errors in engine), and 35 ran into this error after varying amounts of completions (500- 8000 samples generated before crashing).

INFO 08-29 02:09:39 metrics.py:406] Avg prompt throughput: 612.2 tokens/s, Avg generation throughput: 116.8 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 442 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-29 02:09:44 metrics.py:406] Avg prompt throughput: 7407.4 tokens/s, Avg generation throughput: 524.4 tokens/s, Running: 65 reqs, Swapped: 0 reqs, Pending: 383 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%.
INFO 08-29 02:09:49 metrics.py:406] Avg prompt throughput: 5921.8 tokens/s, Avg generation throughput: 1209.4 tokens/s, Running: 109 reqs, Swapped: 0 reqs, Pending: 337 reqs, GPU KV cache usage: 5.9%, CPU KV cache usage: 0.0%.
INFO 08-29 02:09:54 metrics.py:406] Avg prompt throughput: 4725.1 tokens/s, Avg generation throughput: 1554.2 tokens/s, Running: 145 reqs, Swapped: 0 reqs, Pending: 296 reqs, GPU KV cache usage: 8.0%, CPU KV cache usage: 0.0%.
INFO 08-29 02:09:59 metrics.py:406] Avg prompt throughput: 3872.3 tokens/s, Avg generation throughput: 1691.0 tokens/s, Running: 167 reqs, Swapped: 0 reqs, Pending: 264 reqs, GPU KV cache usage: 9.6%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:04 metrics.py:406] Avg prompt throughput: 3495.4 tokens/s, Avg generation throughput: 1822.9 tokens/s, Running: 182 reqs, Swapped: 0 reqs, Pending: 237 reqs, GPU KV cache usage: 10.9%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:09 metrics.py:406] Avg prompt throughput: 3470.7 tokens/s, Avg generation throughput: 1969.9 tokens/s, Running: 188 reqs, Swapped: 0 reqs, Pending: 211 reqs, GPU KV cache usage: 11.8%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:14 metrics.py:406] Avg prompt throughput: 3389.8 tokens/s, Avg generation throughput: 1995.6 tokens/s, Running: 199 reqs, Swapped: 0 reqs, Pending: 180 reqs, GPU KV cache usage: 12.4%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:19 metrics.py:406] Avg prompt throughput: 3037.5 tokens/s, Avg generation throughput: 1923.4 tokens/s, Running: 202 reqs, Swapped: 0 reqs, Pending: 154 reqs, GPU KV cache usage: 12.8%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:24 metrics.py:406] Avg prompt throughput: 2964.9 tokens/s, Avg generation throughput: 1932.5 tokens/s, Running: 206 reqs, Swapped: 0 reqs, Pending: 130 reqs, GPU KV cache usage: 13.3%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:29 metrics.py:406] Avg prompt throughput: 2919.2 tokens/s, Avg generation throughput: 1925.4 tokens/s, Running: 205 reqs, Swapped: 0 reqs, Pending: 106 reqs, GPU KV cache usage: 13.5%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:35 metrics.py:406] Avg prompt throughput: 2895.4 tokens/s, Avg generation throughput: 1962.6 tokens/s, Running: 209 reqs, Swapped: 0 reqs, Pending: 78 reqs, GPU KV cache usage: 13.9%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:40 metrics.py:406] Avg prompt throughput: 2880.0 tokens/s, Avg generation throughput: 1954.9 tokens/s, Running: 209 reqs, Swapped: 0 reqs, Pending: 55 reqs, GPU KV cache usage: 14.2%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:45 metrics.py:406] Avg prompt throughput: 2834.2 tokens/s, Avg generation throughput: 1966.7 tokens/s, Running: 210 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usage: 14.0%, CPU KV cache usage: 0.0%.
INFO 08-29 02:10:50 metrics.py:406] Avg prompt throughput: 2847.6 tokens/s, Avg generation throughput: 1954.2 tokens/s, Running: 209 reqs, Swapped: 0 reqs, Pending: 4 reqs, GPU KV cache usage: 14.1%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.1:54038 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 08-29 02:10:55 metrics.py:406] Avg prompt throughput: 480.1 tokens/s, Avg generation throughput: 2030.4 tokens/s, Running: 183 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 12.7%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.1:54054 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 08-29 02:11:00 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1850.8 tokens/s, Running: 154 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.5%, CPU KV cache usage: 0.0%.
INFO 08-29 02:11:05 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1751.2 tokens/s, Running: 134 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 10.6%, CPU KV cache usage: 0.0%.
INFO 08-29 02:11:10 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1479.8 tokens/s, Running: 100 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.7%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.1:54058 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 08-29 02:11:15 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1221.1 tokens/s, Running: 71 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.8%, CPU KV cache usage: 0.0%.
INFO 08-29 02:11:20 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 841.5 tokens/s, Running: 38 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.9%, CPU KV cache usage: 0.0%.
INFO 08-29 02:11:32 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 96.1 tokens/s, Running: 27 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.9%, CPU KV cache usage: 0.0%.
INFO 08-29 02:12:02 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 27 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.9%, CPU KV cache usage: 0.0%.
INFO 08-29 02:12:12 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 27 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.9%, CPU KV cache usage: 0.0%.
INFO 08-29 02:12:22 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 27 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.9%, CPU KV cache usage: 0.0%.
ERROR 08-29 02:12:22 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
...
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
                                                                                                                                          5341,1         0%
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 145, in create_completion
    generator = await openai_serving_completion.create_completion(
File "/workspace/vllm/vllm/entrypoints/openai/serving_completion.py", line 175, in create_completion
    async for i, res in result_generator:
  File "/workspace/vllm/vllm/utils.py", line 329, in consumer
    raise e
  File "/workspace/vllm/vllm/utils.py", line 320, in consumer
    raise item
  File "/workspace/vllm/vllm/utils.py", line 304, in producer
    async for item in iterator:
  File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 772, in generate
    async for output in self._process_request(
  File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 873, in _process_request
    stream = await self.add_request(
  File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 676, in add_request
    self.start_background_loop()
  File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 516, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
INFO:     127.0.0.1:30584 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application

@mku-wedoai
Copy link

I experienced the issue too.

My setup:

  • 2 x A100 80 GB, run in Kubernetes cluster,
  • vLLM version: 0.5.4,
  • model: Mixtral-8x22B-Instruct-v0.1-GPTQ-4bit,
  • start arguments:
  - "--tensor-parallel-size"  - "2"
  - "--distributed-executor-backend"  - "ray"
  - "--gpu_memory_utilization"  - "0.98"
  - "--enable-chunked-prefill"  - "False"
  - "--disable-custom-all-reduce"  - "True"

The issue seems to happen when I send around 30 parallel requests each (few) seconds. The prompt is about 1400-tokens-long.

@zhaotyer
Copy link
Contributor

zhaotyer commented Sep 2, 2024

vllm 0.5.3.post1+cu118 the error info is:

INFO 08-28 14:15:50 model.py:1691] prompt is:<|im_start|>system
请ä»<94>ç»<86>é<98><85>读以ä¸<8b>æ<96><87>æ<9c>¬ã<80><82>å<9c>¨å<9b><9e>ç­<94>å<89><8d>ï¼<8c>é¦<96>å<85><88>æ<80><9d>è<80><83>æ<96><87>æ<9c>¬ç<9a><84>主è¦<81>å<86><85>容å<92><8c>æ ¸å¿<83>论ç<82>¹æ<98>¯ä»<80>ä¹<88>ã<80><82>ç<84>¶å<90><8e>ï¼<8c>ç<94>¨ä¸­æ<96><87>è¿<9e>è´¯å<9c>°æ<8f><8f>è¿°ä½ ç<9a><84>æ<80><9d>è<80><83>è¿<87>ç¨<8b>ï¼<8c>并å<9c>¨æ­¤å<9f>ºç¡<80>ä¸<8a>æ<80>»ç»<93>æ<96><87>æ<9c>¬ç<9a><84>主æ<97>¨ã<80><82>请确ä¿<9d>ä½ ç<9a><84>æ<80>»ç»<93>æ<97>¢å<87><86>ç¡®å<8f><88>å<8c><85>å<90>«æ<89><80>æ<9c><89>å<85>³é<94>®ä¿¡æ<81>¯ã<80><82><|im_end|>
<|im_start|>user
82<|im_end|>
<|im_start|>assistant

^[[1;36m(VllmWorkerProcess pid=340358)^[[0;0m WARNING 08-28 14:16:50 shm_broadcast.py:404] No available block found in 60 second.
I0828 14:15:59.390105 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step START
INFO 08-28 14:15:50 async_llm_engine.py:173] Added request c14e8b5485c540cb89fbf93f65c6255c.
I0828 14:17:18.014911 339898 grpc_server.cc:4189] New request handler for ModelStreamInferHandler, 0
I0828 14:17:18.015064 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step READ
I0828 14:17:18.015154 339898 infer_request.cc:729] [request id: cr7j27oud5nvlek8se5g] prepared: [0x0x7fdf4402f7d0] request id: cr7j27oud5nvlek8se5g, model: atom, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7fdf4402de58] input: JSON, type: BYTES, original shape: [1], batch + shape: [1], shape: [1]
override inputs:
inputs:
[0x0x7fdf4402de58] input: JSON, type: BYTES, original shape: [1], batch + shape: [1], shape: [1]
original requested outputs:
RESULT
requested outputs:
RESULT

I0828 14:17:18.015272 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=0, context 0, 0 step READ
I0828 14:17:18.015293 339898 grpc_server.cc:2710] Done for ModelStreamInferHandler, 0
^[[1;36m(VllmWorkerProcess pid=340357)^[[0;0m WARNING 08-28 14:16:50 shm_broadcast.py:404] No available block found in 60 second.
I0828 14:17:18.015400 339898 python_be.cc:1028] model atom, instance atom_0, executing 1 requests
^[[1;36m(VllmWorkerProcess pid=340359)^[[0;0m WARNING 08-28 14:16:50 shm_broadcast.py:404] No available block found in 60 second.
I0828 14:17:18.016570 339898 grpc_server.cc:3518] ModelInferHandler::InferRequestComplete
I0828 14:17:18.016631 339898 python_be.cc:1978] TRITONBACKEND_ModelInstanceExecute: model instance name atom_0 released 1 requests
INFO 08-28 14:17:18 metrics.py:396] Avg prompt throughput: 4.2 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
ERROR 08-28 14:17:18 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
INFO 08-28 14:17:18 model.py:1665] input query is:hi
INFO 08-28 14:17:18 model.py:1691] prompt is:<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hi<|im_end|>
<|im_start|>assistant

ERROR 08-28 14:17:18 model.py:1656] Capture error:Background loop has errored already.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.8/asyncio/tasks.py", line 426, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 534, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
    await asyncio.sleep(0)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/models/atom/1/model.py", line 1601, in vllm_response_thread
    async for request_output in results_generator:
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 772, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 873, in _process_request
    stream = await self.add_request(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 676, in add_request
    self.start_background_loop()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 516, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
I0828 14:17:18.044533 339898 grpc_server.cc:4568] ModelStreamInferHandler::StreamInferComplete, context 0, 0 step ISSUED, callback index 1, flags 0
I0828 14:17:18.044585 339898 grpc_server.cc:4620] Failed for ID: cr7j27oud5nvlek8se5g

I0828 14:17:18.045161 339898 grpc_server.cc:4568] ModelStreamInferHandler::StreamInferComplete, context 0, 0 step WRITEREADY, callback index 2, flags 1
I0828 14:17:18.045316 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step WRITEREADY
I0828 14:17:18.045532 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step WRITTEN
I0828 14:17:18.045599 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step COMPLETE
I0828 14:17:18.045613 339898 grpc_server.cc:2710] Done for ModelStreamInferHandler, 0
ERROR 08-28 14:17:18 async_llm_engine.py:56] Engine background task failed^M
ERROR 08-28 14:17:18 async_llm_engine.py:56] Traceback (most recent call last):^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]   File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]     done, _ = await asyncio.wait(^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]   File "/usr/lib/python3.8/asyncio/tasks.py", line 426, in wait^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]   File "/usr/lib/python3.8/asyncio/tasks.py", line 534, in _wait^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]     await waiter^M
ERROR 08-28 14:17:18 async_llm_engine.py:56] asyncio.exceptions.CancelledError^M
ERROR 08-28 14:17:18 async_llm_engine.py:56] ^M
ERROR 08-28 14:17:18 async_llm_engine.py:56] During handling of the above exception, another exception occurred:^M
ERROR 08-28 14:17:18 async_llm_engine.py:56] ^M
ERROR 08-28 14:17:18 async_llm_engine.py:56] Traceback (most recent call last):^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]   File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]     return_value = task.result()^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]   File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]     await asyncio.sleep(0)^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]   File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]     self._do_exit(exc_type)^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]   File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit^M
ERROR 08-28 14:17:18 async_llm_engine.py:56]     raise asyncio.TimeoutError^M
ERROR 08-28 14:17:18 async_llm_engine.py:56] asyncio.exceptions.TimeoutError

/usr/local/lib/python3.8/dist-packages/transformers/generation/configuration_utils.py:615: UserWarning: `num_beams` is set to 1. However, `early_stopping` is set to `True` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `early_stopping`.
  warnings.warn(
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7fb7a08af280>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7fb7a08af280>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.8/asyncio/tasks.py", line 426, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 534, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
    await asyncio.sleep(0)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
I0828 14:17:18.095796 339898 grpc_server.cc:4568] ModelStreamInferHandler::StreamInferComplete, context 0, 0 step ISSUED, callback index 1, flags 0
I0828 14:17:18.095858 339898 grpc_server.cc:4620] Failed for ID: cr7j238ud5nvlek8sdug

I0828 14:17:18.096197 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step WRITEREADY
I0828 14:17:18.096310 339898 grpc_server.cc:4568] ModelStreamInferHandler::StreamInferComplete, context 0, 0 step WRITTEN, callback index 2, flags 1
I0828 14:17:18.096412 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step WRITTEN
I0828 14:17:18.096492 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step COMPLETE
I0828 14:17:18.096505 339898 grpc_server.cc:2710] Done for ModelStreamInferHandler, 0
INFO 08-28 14:17:18 async_llm_engine.py:180] Aborted request de36db22e82e463986ab6367270e4a78.
ERROR 08-28 14:17:18 model.py:1656] Capture error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.8/asyncio/tasks.py", line 426, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 534, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/models/atom/1/model.py", line 1601, in vllm_response_thread
    async for request_output in results_generator:
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 772, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 888, in _process_request
    raise e
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 884, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 93, in __anext__
    raise result
  File "/models/atom/1/model.py", line 1601, in vllm_response_thread
    async for request_output in results_generator:
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 772, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 888, in _process_request
    raise e
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 884, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 93, in __anext__
    raise result
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
    await asyncio.sleep(0)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
I0828 14:17:18.098339 339898 grpc_server.cc:4568] ModelStreamInferHandler::StreamInferComplete, context 0, 0 step ISSUED, callback index 1, flags 0
I0828 14:17:18.098404 339898 grpc_server.cc:4620] Failed for ID: cr7j23gud5nvlek8sdvg

I0828 14:17:18.099036 339898 grpc_server.cc:4568] ModelStreamInferHandler::StreamInferComplete, context 0, 0 step WRITEREADY, callback index 2, flags 1
I0828 14:17:18.099245 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step WRITEREADY
I0828 14:17:18.099403 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step WRITTEN
INFO 08-28 14:17:18 async_llm_engine.py:180] Aborted request d15c7b7a14ae4f10b82eedd6066de46c.
I0828 14:17:18.099460 339898 grpc_server.cc:4196] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step COMPLETE
I0828 14:17:18.131106 339898 grpc_server.cc:2710] Done for ModelStreamInferHandler, 0

@eyuansu62
Copy link

a new case..

Collecting environment information...
WARNING 09-13 11:18:12 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.35

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-126-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.128
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB
GPU 2: NVIDIA A800-SXM4-80GB
GPU 3: NVIDIA A800-SXM4-80GB
GPU 4: NVIDIA A800-SXM4-80GB
GPU 5: NVIDIA A800-SXM4-80GB
GPU 6: NVIDIA A800-SXM4-80GB
GPU 7: NVIDIA A800-SXM4-80GB

Nvidia driver version: 470.141.10
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   43 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-255
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 7742 64-Core Processor
CPU family:                      23
Model:                           49
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
Stepping:                        0
Frequency boost:                 enabled
CPU max MHz:                     2250.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4491.36
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
Virtualization:                  AMD-V
L1d cache:                       4 MiB (128 instances)
L1i cache:                       4 MiB (128 instances)
L2 cache:                        64 MiB (128 instances)
L3 cache:                        512 MiB (32 instances)
NUMA node(s):                    8
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu11==11.10.3.66
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu11==11.7.101
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu11==11.7.99
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu11==11.7.99
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu11==8.5.0.96
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu11==10.9.0.58
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu11==10.2.10.91
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu11==11.4.0.1
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu11==11.7.4.91
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.535.133
[pip3] nvidia-nccl-cu11==2.14.3
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.3.101
[pip3] nvidia-nvtx-cu11==11.7.91
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pynvml==11.5.0
[pip3] pyzmq==26.2.0
[pip3] sentence-transformers==2.2.2
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu11        11.10.3.66               pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu11    11.7.101                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu11    11.7.99                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu11  11.7.99                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu11         8.5.0.96                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu11         10.9.0.58                pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu11        10.2.10.91               pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu11      11.4.0.1                 pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu11      11.7.4.91                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.535.133               pypi_0    pypi
[conda] nvidia-nccl-cu11          2.14.3                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.3.101                 pypi_0    pypi
[conda] nvidia-nvtx-cu11          11.7.91                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pynvml                    11.5.0                   pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] sentence-transformers     2.2.2                    pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.44.2                   pypi_0    pypi
[conda] transformers-stream-generator 0.0.4                    pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	mlx5_5	mlx5_6	mlx5_7	mlx5_8	mlx5_9	CPU Affinity	NUMA Affinity
GPU0	 X 	NV8	NV8	NV8	NV8	NV8	NV8	NV8	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU1	NV8	 X 	NV8	NV8	NV8	NV8	NV8	NV8	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU2	NV8	NV8	 X 	NV8	NV8	NV8	NV8	NV8	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU3	NV8	NV8	NV8	 X 	NV8	NV8	NV8	NV8	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU4	NV8	NV8	NV8	NV8	 X 	NV8	NV8	NV8	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU5	NV8	NV8	NV8	NV8	NV8	 X 	NV8	NV8	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU6	NV8	NV8	NV8	NV8	NV8	NV8	 X 	NV8	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5
GPU7	NV8	NV8	NV8	NV8	NV8	NV8	NV8	 X 	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5
mlx5_0	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
mlx5_1	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
mlx5_2	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	SYS	SYS
mlx5_3	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	SYS	SYS
mlx5_4	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS
mlx5_5	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS
mlx5_6	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS
mlx5_7	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS
mlx5_8	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX
mlx5_9	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

@tuobay
Copy link

tuobay commented Sep 25, 2024

any update?

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

To everyone commenting on this issue, AsyncEngineDeadError occurs when we get into a bad state. It can be can be caused by anything - but especially if we hit an illegal memory access or something of the like.

We are trying to remove as many bugs as possible. So please, as per the original comment:

“When reporting an issue, please include a sample request that causes the issue so we can reproduce on our side.”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests