Skip to content

Conversation

gshtras
Copy link
Collaborator

@gshtras gshtras commented Sep 24, 2025

Fixing an issue from #24532
ncclCommWindowRegister and ncclCommWindowDeregister only exist on nccl >= 2.27.03
On ROCm having an unhandled exception (trying to import symbols in the above part of the code crashes graph capturing due to the HIP error: operation not permitted when stream is capturing error.
I suppose the least invasive solution would be to swallow the import error on ROCm for these two symbols silently.
cc @Amir-19

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras requested a review from youkaichao September 24, 2025 20:50
@mergify mergify bot added the rocm Related to AMD ROCm label Sep 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a crash on ROCm when certain NCCL symbols are not present by silently ignoring them during library initialization. However, the current implementation appears to be flawed. It checks if the function pointer is None, whereas ctypes is expected to raise an AttributeError for missing symbols. This suggests the fix may not work as intended. I have provided a suggestion to correctly handle missing symbols using a try...except block, which should make the fix more robust.

@Amir-19
Copy link
Contributor

Amir-19 commented Sep 24, 2025

I'm not sure if it is better to exclude nccl specific functions like what you did here or include them like: #25608

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras
Copy link
Collaborator Author

gshtras commented Sep 24, 2025

I'm not sure if it is better to exclude nccl specific functions like what you did here or include them like: #25608

That's also possible, only the condition should not be CUDA/ROCm, but ideally, NCCL version.

@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 25, 2025
"ncclCommWindowDeregister"
]:
# These symbols require NCCL >= 2.27.03, and having
# an exception here on ROCm platform is not allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a logger.warn_once here that mentions that these symbols failed to import and that users should update their nccl/rccl libraries if they want to use them?

…nabled

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@JartX
Copy link
Contributor

JartX commented Sep 28, 2025

I confirm that it is necessary to be able to run mGPU models in ROCM, it solves the following error:

vllm1-1  | INFO 09-28 07:27:46 [__init__.py:1383] Found nccl from library librccl.so.1
vllm1-1  | INFO 09-28 07:27:46 [__init__.py:1383] Found nccl from library librccl.so.1
vllm1-1  | INFO 09-28 07:27:46 [parallel_state.py:1208] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
vllm1-1  | INFO 09-28 07:27:46 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
vllm1-1  | INFO 09-28 07:27:46 [__init__.py:1383] Found nccl from library librccl.so.1
vllm1-1  | INFO 09-28 07:27:46 [parallel_state.py:1208] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
vllm1-1  | INFO 09-28 07:27:46 [parallel_state.py:1208] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:27:47 [gpu_model_runner.py:2647] Starting to load model /models/Qwen3-Next-80B-A3B-Instruct-w4g128...
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:27:47 [gpu_model_runner.py:2647] Starting to load model /models/Qwen3-Next-80B-A3B-Instruct-w4g128...
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:27:47 [gpu_model_runner.py:2647] Starting to load model /models/Qwen3-Next-80B-A3B-Instruct-w4g128...
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:27:47 [gpu_model_runner.py:2647] Starting to load model /models/Qwen3-Next-80B-A3B-Instruct-w4g128...
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:27:47 [gpu_model_runner.py:2679] Loading model from scratch...
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:27:47 [gpu_model_runner.py:2679] Loading model from scratch...
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:27:47 [gpu_model_runner.py:2679] Loading model from scratch...
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:27:47 [gpu_model_runner.py:2679] Loading model from scratch...
vllm1-1  | (Worker_TP1 pid=239) `torch_dtype` is deprecated! Use `dtype` instead!
vllm1-1  | (Worker_TP2 pid=240) `torch_dtype` is deprecated! Use `dtype` instead!
vllm1-1  | (Worker_TP3 pid=241) `torch_dtype` is deprecated! Use `dtype` instead!
vllm1-1  | (Worker_TP0 pid=238) `torch_dtype` is deprecated! Use `dtype` instead!
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:27:47 [rocm.py:245] Using Rocm/Aiter Attention backend on V1 engine.
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:27:47 [rocm.py:245] Using Rocm/Aiter Attention backend on V1 engine.
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:27:47 [rocm.py:245] Using Rocm/Aiter Attention backend on V1 engine.
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:27:47 [rocm.py:245] Using Rocm/Aiter Attention backend on V1 engine.
Loading safetensors checkpoint shards: 100% 9/9 [00:22<00:00,  2.54s/it]
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:11 [default_loader.py:267] Loading weights took 22.95 seconds
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:11 [gpu_model_runner.py:2698] Model loading took 9.8945 GiB and 23.623914 seconds
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:15 [default_loader.py:267] Loading weights took 27.80 seconds
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:16 [gpu_model_runner.py:2698] Model loading took 9.8945 GiB and 28.500534 seconds
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:16 [default_loader.py:267] Loading weights took 28.14 seconds
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:16 [default_loader.py:267] Loading weights took 28.14 seconds
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:16 [gpu_model_runner.py:2698] Model loading took 9.8945 GiB and 28.843373 seconds
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:16 [gpu_model_runner.py:2698] Model loading took 9.8945 GiB and 28.843287 seconds
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:20 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9dfdfbbf8f/rank_1_0/backbone for vLLM's torch.compile
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:20 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9dfdfbbf8f/rank_2_0/backbone for vLLM's torch.compile
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:20 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9dfdfbbf8f/rank_0_0/backbone for vLLM's torch.compile
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:20 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9dfdfbbf8f/rank_3_0/backbone for vLLM's torch.compile
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:20 [backends.py:559] Dynamo bytecode transform time: 3.72 s
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:20 [backends.py:559] Dynamo bytecode transform time: 3.73 s
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:20 [backends.py:559] Dynamo bytecode transform time: 3.72 s
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:20 [backends.py:559] Dynamo bytecode transform time: 3.71 s
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:21 [backends.py:197] Cache the graph for dynamic shape for later use
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:21 [backends.py:197] Cache the graph for dynamic shape for later use
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:21 [backends.py:197] Cache the graph for dynamic shape for later use
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:21 [backends.py:197] Cache the graph for dynamic shape for later use
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:41 [backends.py:218] Compiling a graph for dynamic shape takes 20.51 s
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:41 [backends.py:218] Compiling a graph for dynamic shape takes 20.59 s
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:41 [backends.py:218] Compiling a graph for dynamic shape takes 20.60 s
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:41 [backends.py:218] Compiling a graph for dynamic shape takes 20.70 s
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:46 [fused_moe.py:788] Using configuration from /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=0x744c,dtype=int4_w4a16.json for MoE layer.
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:46 [fused_moe.py:788] Using configuration from /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=0x744c,dtype=int4_w4a16.json for MoE layer.
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:46 [fused_moe.py:788] Using configuration from /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=0x744c,dtype=int4_w4a16.json for MoE layer.
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:46 [fused_moe.py:788] Using configuration from /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=0x744c,dtype=int4_w4a16.json for MoE layer.
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:48 [monitor.py:34] torch.compile takes 24.32 s in total
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:48 [monitor.py:34] torch.compile takes 24.41 s in total
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:48 [monitor.py:34] torch.compile takes 24.31 s in total
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:48 [monitor.py:34] torch.compile takes 24.24 s in total
vllm1-1  | (Worker_TP2 pid=240) INFO 09-28 07:28:49 [gpu_worker.py:298] Available KV cache memory: 10.91 GiB
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:28:49 [gpu_worker.py:298] Available KV cache memory: 10.91 GiB
vllm1-1  | (Worker_TP3 pid=241) INFO 09-28 07:28:49 [gpu_worker.py:298] Available KV cache memory: 10.91 GiB
vllm1-1  | (Worker_TP0 pid=238) INFO 09-28 07:28:49 [gpu_worker.py:298] Available KV cache memory: 10.91 GiB
vllm1-1  | (EngineCore_DP0 pid=166) INFO 09-28 07:28:49 [kv_cache_utils.py:1087] GPU KV cache size: 238,272 tokens
vllm1-1  | (EngineCore_DP0 pid=166) INFO 09-28 07:28:49 [kv_cache_utils.py:1091] Maximum concurrency for 262,144 tokens per request: 3.62x
vllm1-1  | (EngineCore_DP0 pid=166) INFO 09-28 07:28:49 [kv_cache_utils.py:1087] GPU KV cache size: 238,272 tokens
vllm1-1  | (EngineCore_DP0 pid=166) INFO 09-28 07:28:49 [kv_cache_utils.py:1091] Maximum concurrency for 262,144 tokens per request: 3.62x
vllm1-1  | (EngineCore_DP0 pid=166) INFO 09-28 07:28:49 [kv_cache_utils.py:1087] GPU KV cache size: 238,272 tokens
vllm1-1  | (EngineCore_DP0 pid=166) INFO 09-28 07:28:49 [kv_cache_utils.py:1091] Maximum concurrency for 262,144 tokens per request: 3.62x
vllm1-1  | (EngineCore_DP0 pid=166) INFO 09-28 07:28:49 [kv_cache_utils.py:1087] GPU KV cache size: 238,272 tokens
vllm1-1  | (EngineCore_DP0 pid=166) INFO 09-28 07:28:49 [kv_cache_utils.py:1091] Maximum concurrency for 262,144 tokens per request: 3.62x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0% 0/67 [00:00<?, ?it/s]
vllm1-1  | [rank3]:[E928 07:28:50.788602763 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
vllm1-1  | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm1-1  | For debugging consider passing AMD_SERIALIZE_KERNEL=3
vllm1-1  | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
vllm1-1  | 
vllm1-1  | Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fd11a09a6ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x52eb1 (0x7fd150f7eeb1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7fd150f7eb52 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd153989ece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7fd15399a370 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7fd15399d9de in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7fd15399fc9b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #7: <unknown function> + 0xdc253 (0x7fd118576253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #8: <unknown function> + 0x94ac3 (0x7fd16c228ac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #9: <unknown function> + 0x126850 (0x7fd16c2ba850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | [rank0]:[E928 07:28:50.788605239 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
vllm1-1  | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm1-1  | For debugging consider passing AMD_SERIALIZE_KERNEL=3
vllm1-1  | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
vllm1-1  | 
vllm1-1  | Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7f454a2df6ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x52eb1 (0x7f45811c3eb1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7f45811c3b52 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7f4583bceece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7f4583bdf370 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7f4583be29de in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7f4583be4c9b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #7: <unknown function> + 0xdc253 (0x7f45487bb253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #8: <unknown function> + 0x94ac3 (0x7f459c46dac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #9: <unknown function> + 0x126850 (0x7f459c4ff850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | terminate called after throwing an instance of 'c10::DistBackendErrorterminate called after throwing an instance of ''
vllm1-1  | c10::DistBackendError'
vllm1-1  |   what():    what():  [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
vllm1-1  | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm1-1  | For debugging consider passing AMD_SERIALIZE_KERNEL=3
vllm1-1  | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
vllm1-1  | 
vllm1-1  | Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fd11a09a6ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x52eb1 (0x7fd150f7eeb1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7fd150f7eb52 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd153989ece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7fd15399a370 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7fd15399d9de in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7fd15399fc9b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #7: <unknown function> + 0xdc253 (0x7fd118576253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #8: <unknown function> + 0x94ac3 (0x7fd16c228ac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #9: <unknown function> + 0x126850 (0x7fd16c2ba850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | Exception raised from run at /app/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fd11a09a6ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x28fc222 (0x7fd153976222 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #2: <unknown function> + 0x855634 (0x7fd1518cf634 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #3: <unknown function> + 0xdc253 (0x7fd118576253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #4: <unknown function> + 0x94ac3 (0x7fd16c228ac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #5: <unknown function> + 0x126850 (0x7fd16c2ba850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
vllm1-1  | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm1-1  | For debugging consider passing AMD_SERIALIZE_KERNEL=3
vllm1-1  | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
vllm1-1  | 
vllm1-1  | Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7f454a2df6ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x52eb1 (0x7f45811c3eb1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7f45811c3b52 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7f4583bceece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7f4583bdf370 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7f4583be29de in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7f4583be4c9b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #7: <unknown function> + 0xdc253 (0x7f45487bb253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #8: <unknown function> + 0x94ac3 (0x7f459c46dac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #9: <unknown function> + 0x126850 (0x7f459c4ff850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | Exception raised from run at /app/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7f454a2df6ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x28fc222 (0x7f4583bbb222 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #2: <unknown function> + 0x855634 (0x7f4581b14634 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #3: <unknown function> + 0xdc253 (0x7f45487bb253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #4: <unknown function> + 0x94ac3 (0x7f459c46dac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #5: <unknown function> + 0x126850 (0x7f459c4ff850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | 
vllm1-1  | [rank1]:[E928 07:28:50.793438616 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
vllm1-1  | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm1-1  | For debugging consider passing AMD_SERIALIZE_KERNEL=3
vllm1-1  | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
vllm1-1  | 
vllm1-1  | Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fe97ea496ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x52eb1 (0x7fe9b592deb1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7fe9b592db52 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fe9b8338ece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7fe9b8349370 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7fe9b834c9de in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7fe9b834ec9b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #7: <unknown function> + 0xdc253 (0x7fe97cf25253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #8: <unknown function> + 0x94ac3 (0x7fe9d0bd7ac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #9: <unknown function> + 0x126850 (0x7fe9d0c69850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | terminate called after throwing an instance of 'c10::DistBackendError'
vllm1-1  |   what():  [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
vllm1-1  | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm1-1  | For debugging consider passing AMD_SERIALIZE_KERNEL=3
vllm1-1  | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
vllm1-1  | 
vllm1-1  | Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fe97ea496ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x52eb1 (0x7fe9b592deb1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7fe9b592db52 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fe9b8338ece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7fe9b8349370 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7fe9b834c9de in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7fe9b834ec9b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #7: <unknown function> + 0xdc253 (0x7fe97cf25253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #8: <unknown function> + 0x94ac3 (0x7fe9d0bd7ac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #9: <unknown function> + 0x126850 (0x7fe9d0c69850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | Exception raised from run at /app/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fe97ea496ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x28fc222 (0x7fe9b8325222 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #2: <unknown function> + 0x855634 (0x7fe9b627e634 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #3: <unknown function> + 0xdc253 (0x7fe97cf25253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #4: <unknown function> + 0x94ac3 (0x7fe9d0bd7ac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #5: <unknown function> + 0x126850 (0x7fe9d0c69850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | [rank2]:[E928 07:28:50.856274153 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
vllm1-1  | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm1-1  | For debugging consider passing AMD_SERIALIZE_KERNEL=3
vllm1-1  | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
vllm1-1  | 
vllm1-1  | Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fc47dfc16ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x52eb1 (0x7fc4b4ea5eb1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7fc4b4ea5b52 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fc4b78b0ece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7fc4b78c1370 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7fc4b78c49de in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7fc4b78c6c9b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #7: <unknown function> + 0xdc253 (0x7fc47c49d253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #8: <unknown function> + 0x94ac3 (0x7fc4d014fac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #9: <unknown function> + 0x126850 (0x7fc4d01e1850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | terminate called after throwing an instance of 'c10::DistBackendError'
vllm1-1  |   what():  [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
vllm1-1  | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm1-1  | For debugging consider passing AMD_SERIALIZE_KERNEL=3
vllm1-1  | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
vllm1-1  | 
vllm1-1  | Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fc47dfc16ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x52eb1 (0x7fc4b4ea5eb1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7fc4b4ea5b52 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
vllm1-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fc4b78b0ece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7fc4b78c1370 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7fc4b78c49de in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7fc4b78c6c9b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #7: <unknown function> + 0xdc253 (0x7fc47c49d253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #8: <unknown function> + 0x94ac3 (0x7fc4d014fac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #9: <unknown function> + 0x126850 (0x7fc4d01e1850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | Exception raised from run at /app/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
vllm1-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fc47dfc16ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm1-1  | frame #1: <unknown function> + 0x28fc222 (0x7fc4b789d222 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #2: <unknown function> + 0x855634 (0x7fc4b57f6634 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
vllm1-1  | frame #3: <unknown function> + 0xdc253 (0x7fc47c49d253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
vllm1-1  | frame #4: <unknown function> + 0x94ac3 (0x7fc4d014fac3 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | frame #5: <unknown function> + 0x126850 (0x7fc4d01e1850 in /lib/x86_64-linux-gnu/libc.so.6)
vllm1-1  | 
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [multiproc_executor.py:154] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708] EngineCore failed to start.
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708] Traceback (most recent call last):
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 92, in __init__
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     self._initialize_kv_caches(vllm_config)
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 207, in _initialize_kv_caches
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     self.model_executor.initialize_from_config(kv_cache_configs)
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 75, in initialize_from_config
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     self.collective_rpc("compile_or_warm_up_model")
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     result = get_response(w, dequeue_timeout,
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 244, in get_response
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     status, result = w.worker_response_mq.dequeue(
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 511, in dequeue
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     with self.acquire_read(timeout, cancel, indefinite) as buf:
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     return next(self.gen)
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]            ^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 455, in acquire_read
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708]     raise RuntimeError("cancelled")
vllm1-1  | (EngineCore_DP0 pid=166) ERROR 09-28 07:28:57 [core.py:708] RuntimeError: cancelled
vllm1-1  | (EngineCore_DP0 pid=166) Process EngineCore_DP0:
vllm1-1  | (EngineCore_DP0 pid=166) Traceback (most recent call last):
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
vllm1-1  | (EngineCore_DP0 pid=166)     self.run()
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
vllm1-1  | (EngineCore_DP0 pid=166)     self._target(*self._args, **self._kwargs)
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
vllm1-1  | (EngineCore_DP0 pid=166)     raise e
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
vllm1-1  | (EngineCore_DP0 pid=166)     engine_core = EngineCoreProc(*args, **kwargs)
vllm1-1  | (EngineCore_DP0 pid=166)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
vllm1-1  | (EngineCore_DP0 pid=166)     super().__init__(vllm_config, executor_class, log_stats,
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 92, in __init__
vllm1-1  | (EngineCore_DP0 pid=166)     self._initialize_kv_caches(vllm_config)
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 207, in _initialize_kv_caches
vllm1-1  | (EngineCore_DP0 pid=166)     self.model_executor.initialize_from_config(kv_cache_configs)
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 75, in initialize_from_config
vllm1-1  | (EngineCore_DP0 pid=166)     self.collective_rpc("compile_or_warm_up_model")
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
vllm1-1  | (EngineCore_DP0 pid=166)     result = get_response(w, dequeue_timeout,
vllm1-1  | (EngineCore_DP0 pid=166)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 244, in get_response
vllm1-1  | (EngineCore_DP0 pid=166)     status, result = w.worker_response_mq.dequeue(
vllm1-1  | (EngineCore_DP0 pid=166)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 511, in dequeue
vllm1-1  | (EngineCore_DP0 pid=166)     with self.acquire_read(timeout, cancel, indefinite) as buf:
vllm1-1  | (EngineCore_DP0 pid=166)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
vllm1-1  | (EngineCore_DP0 pid=166)     return next(self.gen)
vllm1-1  | (EngineCore_DP0 pid=166)            ^^^^^^^^^^^^^^
vllm1-1  | (EngineCore_DP0 pid=166)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 455, in acquire_read
vllm1-1  | (EngineCore_DP0 pid=166)     raise RuntimeError("cancelled")
vllm1-1  | (EngineCore_DP0 pid=166) RuntimeError: cancelled
vllm1-1  | (APIServer pid=1) Traceback (most recent call last):
vllm1-1  | (APIServer pid=1)   File "/usr/local/bin/vllm", line 7, in <module>
vllm1-1  | (APIServer pid=1)     sys.exit(main())
vllm1-1  | (APIServer pid=1)              ^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 54, in main
vllm1-1  | (APIServer pid=1)     args.dispatch_function(args)
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 57, in cmd
vllm1-1  | (APIServer pid=1)     uvloop.run(run_server(args))
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
vllm1-1  | (APIServer pid=1)     return __asyncio.run(
vllm1-1  | (APIServer pid=1)            ^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
vllm1-1  | (APIServer pid=1)     return runner.run(main)
vllm1-1  | (APIServer pid=1)            ^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
vllm1-1  | (APIServer pid=1)     return self._loop.run_until_complete(task)
vllm1-1  | (APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
vllm1-1  | (APIServer pid=1)     return await main
vllm1-1  | (APIServer pid=1)            ^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
vllm1-1  | (APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
vllm1-1  | (APIServer pid=1)     async with build_async_engine_client(
vllm1-1  | (APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
vllm1-1  | (APIServer pid=1)     return await anext(self.gen)
vllm1-1  | (APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
vllm1-1  | (APIServer pid=1)     async with build_async_engine_client_from_engine_args(
vllm1-1  | (APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
vllm1-1  | (APIServer pid=1)     return await anext(self.gen)
vllm1-1  | (APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
vllm1-1  | (APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
vllm1-1  | (APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1571, in inner
vllm1-1  | (APIServer pid=1)     return fn(*args, **kwargs)
vllm1-1  | (APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
vllm1-1  | (APIServer pid=1)     return cls(
vllm1-1  | (APIServer pid=1)            ^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
vllm1-1  | (APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
vllm1-1  | (APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
vllm1-1  | (APIServer pid=1)     return AsyncMPClient(*client_args)
vllm1-1  | (APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 769, in __init__
vllm1-1  | (APIServer pid=1)     super().__init__(
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 448, in __init__
vllm1-1  | (APIServer pid=1)     with launch_core_engines(vllm_config, executor_class,
vllm1-1  | (APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
vllm1-1  | (APIServer pid=1)     next(self.gen)
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
vllm1-1  | (APIServer pid=1)     wait_for_engine_startup(
vllm1-1  | (APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
vllm1-1  | (APIServer pid=1)     raise RuntimeError("Engine core initialization failed. "
vllm1-1  | (APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
vllm1-1  | /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
vllm1-1  |   warnings.warn('resource_tracker: There appear to be %d '
vllm1-1  | /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 5 leaked shared_memory objects to clean up at shutdown
vllm1-1  |   warnings.warn('resource_tracker: There appear to be %d '
vllm1-1 exited with code 0

@tlrmchlsmth tlrmchlsmth merged commit 61a3431 into vllm-project:main Sep 29, 2025
47 of 48 checks passed
@gshtras gshtras deleted the nccl_symbols_fix branch September 29, 2025 21:45
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
…ccl.so (vllm-project#25605)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…ccl.so (#25605)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants