Closed
Description
Name of failing test
TP_SIZE=1 DP_SIZE=2 pytest -s -v "v1/test_async_llm_dp.py::test_load[ray-RequestOutputKind.DELTA]"
Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers
)
🧪 Describe the failing test
The error ends up looking like a triton bug with AttributeError: module 'triton.language' has no attribute 'bfloat16'
reported, however very early in the logs you can see the following:
INFO 06-17 07:32:31 [utils.py:384] Creating placement groups for data parallel
(pid=3893316) INFO 06-17 07:32:33 [importing.py:27] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
(pid=3893316) INFO 06-17 07:32:33 [importing.py:47] Triton not installed or not compatible; certain GPU-related functions will not be available.
(pid=3893316) WARNING 06-17 07:32:33 [importing.py:59] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
This is strange because triton is fully installed in my environment as usual.
Here is the full command and traceback:
TP_SIZE=1 DP_SIZE=2 pytest -s -v "v1/test_async_llm_dp.py::test_load[ray-RequestOutputKind.DELTA]"
INFO 06-17 07:32:14 [__init__.py:244] Automatically detected platform cuda.
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================================================================ test session starts =============================================================================
platform linux -- Python 3.12.4, pytest-8.3.3, pluggy-1.5.0 -- /home/mgoin/venvs/vllm/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/home/mgoin/code/vllm/tests/.hypothesis/examples'))
rootdir: /home/mgoin/code/vllm
configfile: pyproject.toml
plugins: forked-1.6.0, subtests-0.14.1, asyncio-0.24.0, shard-0.1.2, buildkite-test-collector-0.1.9, timeout-2.3.1, schemathesis-3.39.15, anyio-4.6.2.post1, mock-3.14.0, hypothesis-6.131.0, rerunfailures-14.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collecting ... INFO 06-17 07:32:25 [config.py:831] This model supports multiple tasks: {'generate', 'score', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-17 07:32:25 [config.py:3270] Downcasting torch.float32 to torch.bfloat16.
INFO 06-17 07:32:25 [config.py:1444] Using max model len 4096
WARNING 06-17 07:32:25 [interface.py:503] Current platform cuda does not have '_pytestfixturefunction' attribute.
WARNING 06-17 07:32:26 [interface.py:503] Current platform cuda does not have '__test__' attribute.
WARNING 06-17 07:32:26 [interface.py:503] Current platform cuda does not have '__bases__' attribute.
WARNING 06-17 07:32:26 [interface.py:503] Current platform cuda does not have '__test__' attribute.
WARNING 06-17 07:32:26 [interface.py:503] Current platform cuda does not have '_schemathesis_test' attribute.
collected 1 item
Running 1 items in this shard: tests/v1/test_async_llm_dp.py::test_load[ray-RequestOutputKind.DELTA]
v1/test_async_llm_dp.py::test_load[ray-RequestOutputKind.DELTA] INFO 06-17 07:32:26 [config.py:831] This model supports multiple tasks: {'generate', 'score', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-17 07:32:26 [config.py:3270] Downcasting torch.float32 to torch.bfloat16.
INFO 06-17 07:32:26 [config.py:1444] Using max model len 4096
INFO 06-17 07:32:26 [arg_utils.py:1095] Using host IP 216.81.245.69 as ray-based data parallel address
INFO 06-17 07:32:26 [config.py:2197] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 06-17 07:32:26 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-06-17 07:32:29,783 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
INFO 06-17 07:32:31 [utils.py:384] Creating placement groups for data parallel
(pid=3893316) INFO 06-17 07:32:33 [importing.py:27] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
(pid=3893316) INFO 06-17 07:32:33 [importing.py:47] Triton not installed or not compatible; certain GPU-related functions will not be available.
(pid=3893316) WARNING 06-17 07:32:33 [importing.py:59] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
(pid=3893316) INFO 06-17 07:32:34 [__init__.py:244] Automatically detected platform cuda.
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:38 [core.py:70] Initializing a V1 LLM engine (v0.9.1.dev287+g89b1388d8) with config: model='ibm-research/PowerMoE-3b', speculative_config=None, tokenizer='ibm-research/PowerMoE-3b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ibm-research/PowerMoE-3b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
(pid=3893317) INFO 06-17 07:32:33 [importing.py:27] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
(pid=3893317) INFO 06-17 07:32:33 [importing.py:47] Triton not installed or not compatible; certain GPU-related functions will not be available.
(pid=3893317) WARNING 06-17 07:32:33 [importing.py:59] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
(DPEngineCoreActor pid=3893317) WARNING 06-17 07:32:39 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x717cd7814710>
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:40 [parallel_state.py:934] Adjusting world_size=2 rank=0 distributed_init_method=tcp://216.81.245.69:57446 for DP
(pid=3893317) INFO 06-17 07:32:35 [__init__.py:244] Automatically detected platform cuda.
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:40 [utils.py:1136] Found nccl from library libnccl.so.2
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:40 [pynccl.py:70] vLLM is using nccl==2.26.2
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:41 [cuda_communicator.py:65] Using naive all2all manager.
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:41 [parallel_state.py:1065] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(DPEngineCoreActor pid=3893316) WARNING 06-17 07:32:41 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:41 [gpu_model_runner.py:1627] Starting to load model ibm-research/PowerMoE-3b...
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:42 [gpu_model_runner.py:1632] Loading model from scratch...
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:42 [cuda.py:259] Using Flash Attention backend on V1 engine.
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:42 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:00<00:00, 55.38it/s]
(DPEngineCoreActor pid=3893316)
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:59 [default_loader.py:272] Loading weights took 16.47 seconds
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:38 [core.py:70] Initializing a V1 LLM engine (v0.9.1.dev287+g89b1388d8) with config: model='ibm-research/PowerMoE-3b', speculative_config=None, tokenizer='ibm-research/PowerMoE-3b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ibm-research/PowerMoE-3b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
(DPEngineCoreActor pid=3893316) WARNING 06-17 07:32:39 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78f88f5bcb30>
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:40 [parallel_state.py:934] Adjusting world_size=2 rank=1 distributed_init_method=tcp://216.81.245.69:57446 for DP
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:40 [utils.py:1136] Found nccl from library libnccl.so.2
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:40 [pynccl.py:70] vLLM is using nccl==2.26.2
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:41 [cuda_communicator.py:65] Using naive all2all manager.
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:41 [parallel_state.py:1065] rank 1 in world size 2 is assigned as DP rank 1, PP rank 0, TP rank 0, EP rank 1
(DPEngineCoreActor pid=3893317) WARNING 06-17 07:32:41 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:41 [gpu_model_runner.py:1627] Starting to load model ibm-research/PowerMoE-3b...
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:42 [gpu_model_runner.py:1632] Loading model from scratch...
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:42 [cuda.py:259] Using Flash Attention backend on V1 engine.
(DPEngineCoreActor pid=3893316) INFO 06-17 07:32:42 [weight_utils.py:292] Using model weights format ['*.safetensors']
(DPEngineCoreActor pid=3893316) INFO 06-17 07:33:00 [gpu_model_runner.py:1656] Model loading took 3.3375 GiB and 17.195956 seconds
(DPEngineCoreActor pid=3893316) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::DPEngineCoreActor.__init__() (pid=3893316, ip=216.81.245.69, actor_id=fa979d0e0af9f039b9196c4b01000000, repr=<vllm.v1.engine.core.DPEngineCoreActor object at 0x78f88f5a64e0>)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 927, in __init__
(DPEngineCoreActor pid=3893316) super().__init__(vllm_config, on_head_node, "", executor_class,
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 769, in __init__
(DPEngineCoreActor pid=3893316) super().__init__(vllm_config, on_head_node, handshake_address,
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 395, in __init__
(DPEngineCoreActor pid=3893316) super().__init__(vllm_config, executor_class, log_stats,
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 83, in __init__
(DPEngineCoreActor pid=3893316) self._initialize_kv_caches(vllm_config)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 143, in _initialize_kv_caches
(DPEngineCoreActor pid=3893316) available_gpu_memory = self.model_executor.determine_available_memory()
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(DPEngineCoreActor pid=3893316) output = self.collective_rpc("determine_available_memory")
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
(DPEngineCoreActor pid=3893316) answer = run_method(self.driver_worker, method, args, kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/utils.py", line 2690, in run_method
(DPEngineCoreActor pid=3893316) return func(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(DPEngineCoreActor pid=3893316) return func(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory
(DPEngineCoreActor pid=3893316) self.model_runner.profile_run()
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 2045, in profile_run
(DPEngineCoreActor pid=3893316) hidden_states = self._dummy_run(self.max_num_tokens)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(DPEngineCoreActor pid=3893316) return func(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1880, in _dummy_run
(DPEngineCoreActor pid=3893316) outputs = model(
(DPEngineCoreActor pid=3893316) ^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(DPEngineCoreActor pid=3893316) return self._call_impl(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(DPEngineCoreActor pid=3893316) return forward_call(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 405, in forward
(DPEngineCoreActor pid=3893316) hidden_states = self.model(input_ids, positions, intermediate_tensors,
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/compilation/decorators.py", line 173, in __call__
(DPEngineCoreActor pid=3893316) return self.forward(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 301, in forward
(DPEngineCoreActor pid=3893316) hidden_states = layer(positions, hidden_states)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(DPEngineCoreActor pid=3893316) return self._call_impl(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(DPEngineCoreActor pid=3893316) return forward_call(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 239, in forward
(DPEngineCoreActor pid=3893316) hidden_states = self.block_sparse_moe(hidden_states)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(DPEngineCoreActor pid=3893316) return self._call_impl(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(DPEngineCoreActor pid=3893316) return forward_call(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 101, in forward
(DPEngineCoreActor pid=3893316) final_hidden_states = self.experts(hidden_states, router_logits)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(DPEngineCoreActor pid=3893316) return self._call_impl(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(DPEngineCoreActor pid=3893316) return forward_call(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1359, in forward
(DPEngineCoreActor pid=3893316) return torch.ops.vllm.moe_forward(hidden_states, router_logits,
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
(DPEngineCoreActor pid=3893316) return self._op(*args, **(kwargs or {}))
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1522, in moe_forward
(DPEngineCoreActor pid=3893316) return self.forward_impl(hidden_states, router_logits)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1449, in forward_impl
(DPEngineCoreActor pid=3893316) final_hidden_states = self.quant_method.apply(
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 568, in apply
(DPEngineCoreActor pid=3893316) return self.forward(
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/custom_op.py", line 24, in forward
(DPEngineCoreActor pid=3893316) return self._forward_method(*args, **kwargs)
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 628, in forward_cuda
(DPEngineCoreActor pid=3893316) return self.fused_experts(
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1186, in fused_experts
(DPEngineCoreActor pid=3893316) return dispatch_fused_experts_func(inplace)(
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1125, in torch_vllm_inplace_fused_experts
(DPEngineCoreActor pid=3893316) torch.ops.vllm.inplace_fused_experts(**kwargs)
(DPEngineCoreActor pid=3893316) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
(DPEngineCoreActor pid=3893316) return self._op(*args, **(kwargs or {}))
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1018, in inplace_fused_experts
(DPEngineCoreActor pid=3893316) fused_experts_impl(hidden_states, w1, w2, topk_weights, topk_ids, True,
(DPEngineCoreActor pid=3893316) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1295, in fused_experts_impl
(DPEngineCoreActor pid=3893316) compute_type = tl.bfloat16
(DPEngineCoreActor pid=3893316) ^^^^^^^^^^^
(DPEngineCoreActor pid=3893316) AttributeError: module 'triton.language' has no attribute 'bfloat16'
(DPEngineCoreActor pid=3893316) WARNING 06-17 07:33:00 [fused_moe.py:683] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=40,N=256,device_name=NVIDIA_H100_80GB_HBM3.json
FAILED
================================================================================== FAILURES ==================================================================================
___________________________________________________________________ test_load[ray-RequestOutputKind.DELTA] ___________________________________________________________________
output_kind = <RequestOutputKind.DELTA: 1>, data_parallel_backend = 'ray'
@pytest.mark.parametrize(
"output_kind",
[
RequestOutputKind.DELTA,
RequestOutputKind.FINAL_ONLY,
],
)
@pytest.mark.parametrize("data_parallel_backend", ["mp", "ray"])
@pytest.mark.asyncio
async def test_load(output_kind: RequestOutputKind,
data_parallel_backend: str):
with ExitStack() as after:
prompt = "This is a test of data parallel"
engine_args.data_parallel_backend = data_parallel_backend
> engine = AsyncLLM.from_engine_args(engine_args)
v1/test_async_llm_dp.py:82:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../vllm/v1/engine/async_llm.py:189: in from_engine_args
return cls(
../vllm/v1/engine/async_llm.py:124: in __init__
self.engine_core = EngineCoreClient.make_async_mp_client(
../vllm/v1/engine/core_client.py:89: in make_async_mp_client
return RayDPClient(vllm_config, executor_class, log_stats,
../vllm/v1/engine/core_client.py:1099: in __init__
super().__init__(vllm_config, executor_class, log_stats,
../vllm/v1/engine/core_client.py:919: in __init__
super().__init__(vllm_config, executor_class, log_stats,
../vllm/v1/engine/core_client.py:716: in __init__
super().__init__(
../vllm/v1/engine/core_client.py:422: in __init__
self._init_engines_direct(vllm_config, local_only,
../vllm/v1/engine/core_client.py:1125: in _init_engines_direct
self.resources.engine_manager = CoreEngineActorManager(
../vllm/v1/utils.py:370: in __init__
ray.get(refs)
../../../venvs/vllm/lib/python3.12/site-packages/ray/_private/auto_init_hook.py:21: in auto_init_wrapper
return fn(*args, **kwargs)
../../../venvs/vllm/lib/python3.12/site-packages/ray/_private/client_mode_hook.py:103: in wrapper
return func(*args, **kwargs)
../../../venvs/vllm/lib/python3.12/site-packages/ray/_private/worker.py:2771: in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <ray._private.worker.Worker object at 0x729848397e00>
object_refs = [ObjectRef(16310a0f0a45af5cfa979d0e0af9f039b9196c4b0100000001000000), ObjectRef(32d950ec0ccf9d2a298429e99af9b3dd80f746b40100000001000000)], timeout = None
return_exceptions = False, skip_deserialization = False
def get_objects(
self,
object_refs: list,
timeout: Optional[float] = None,
return_exceptions: bool = False,
skip_deserialization: bool = False,
):
"""Get the values in the object store associated with the IDs.
Return the values from the local object store for object_refs. This
will block until all the values for object_refs have been written to
the local object store.
Args:
object_refs: A list of the object refs
whose values should be retrieved.
timeout: The maximum amount of time in
seconds to wait before returning.
return_exceptions: If any of the objects deserialize to an
Exception object, whether to return them as values in the
returned list. If False, then the first found exception will be
raised.
skip_deserialization: If true, only the buffer will be released and
the object associated with the buffer will not be deserailized.
Returns:
list: List of deserialized objects or None if skip_deserialization is True.
bytes: UUID of the debugger breakpoint we should drop
into or b"" if there is no breakpoint.
"""
# Make sure that the values are object refs.
for object_ref in object_refs:
if not isinstance(object_ref, ObjectRef):
raise TypeError(
f"Attempting to call `get` on the value {object_ref}, "
"which is not an ray.ObjectRef."
)
timeout_ms = (
int(timeout * 1000) if timeout is not None and timeout != -1 else -1
)
data_metadata_pairs: List[
Tuple[ray._raylet.Buffer, bytes]
] = self.core_worker.get_objects(
object_refs,
timeout_ms,
)
debugger_breakpoint = b""
for data, metadata in data_metadata_pairs:
if metadata:
metadata_fields = metadata.split(b",")
if len(metadata_fields) >= 2 and metadata_fields[1].startswith(
ray_constants.OBJECT_METADATA_DEBUG_PREFIX
):
debugger_breakpoint = metadata_fields[1][
len(ray_constants.OBJECT_METADATA_DEBUG_PREFIX) :
]
if skip_deserialization:
return None, debugger_breakpoint
values = self.deserialize_objects(data_metadata_pairs, object_refs)
if not return_exceptions:
# Raise exceptions instead of returning them to the user.
for i, value in enumerate(values):
if isinstance(value, RayError):
if isinstance(value, ray.exceptions.ObjectLostError):
global_worker.core_worker.dump_object_store_memory_usage()
if isinstance(value, RayTaskError):
raise value.as_instanceof_cause()
else:
> raise value
E ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::DPEngineCoreActor.__init__() (pid=3893316, ip=216.81.245.69, actor_id=fa979d0e0af9f039b9196c4b01000000, repr=<vllm.v1.engine.core.DPEngineCoreActor object at 0x78f88f5a64e0>)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 927, in __init__
E super().__init__(vllm_config, on_head_node, "", executor_class,
E File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 769, in __init__
E super().__init__(vllm_config, on_head_node, handshake_address,
E File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 395, in __init__
E super().__init__(vllm_config, executor_class, log_stats,
E File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 83, in __init__
E self._initialize_kv_caches(vllm_config)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 143, in _initialize_kv_caches
E available_gpu_memory = self.model_executor.determine_available_memory()
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
E output = self.collective_rpc("determine_available_memory")
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
E answer = run_method(self.driver_worker, method, args, kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/utils.py", line 2690, in run_method
E return func(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
E return func(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory
E self.model_runner.profile_run()
E File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 2045, in profile_run
E hidden_states = self._dummy_run(self.max_num_tokens)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
E return func(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1880, in _dummy_run
E outputs = model(
E ^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
E return self._call_impl(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
E return forward_call(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 405, in forward
E hidden_states = self.model(input_ids, positions, intermediate_tensors,
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/compilation/decorators.py", line 173, in __call__
E return self.forward(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 301, in forward
E hidden_states = layer(positions, hidden_states)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
E return self._call_impl(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
E return forward_call(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 239, in forward
E hidden_states = self.block_sparse_moe(hidden_states)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
E return self._call_impl(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
E return forward_call(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 101, in forward
E final_hidden_states = self.experts(hidden_states, router_logits)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
E return self._call_impl(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
E return forward_call(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1359, in forward
E return torch.ops.vllm.moe_forward(hidden_states, router_logits,
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
E return self._op(*args, **(kwargs or {}))
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1522, in moe_forward
E return self.forward_impl(hidden_states, router_logits)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1449, in forward_impl
E final_hidden_states = self.quant_method.apply(
E ^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 568, in apply
E return self.forward(
E ^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/custom_op.py", line 24, in forward
E return self._forward_method(*args, **kwargs)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 628, in forward_cuda
E return self.fused_experts(
E ^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1186, in fused_experts
E return dispatch_fused_experts_func(inplace)(
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1125, in torch_vllm_inplace_fused_experts
E torch.ops.vllm.inplace_fused_experts(**kwargs)
E File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
E return self._op(*args, **(kwargs or {}))
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1018, in inplace_fused_experts
E fused_experts_impl(hidden_states, w1, w2, topk_weights, topk_ids, True,
E File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1295, in fused_experts_impl
E compute_type = tl.bfloat16
E ^^^^^^^^^^^
E AttributeError: module 'triton.language' has no attribute 'bfloat16'
../../../venvs/vllm/lib/python3.12/site-packages/ray/_private/worker.py:921: ActorDiedError
============================================================================== warnings summary ==============================================================================
../../../venvs/vllm/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
ref_error: type[Exception] = jsonschema.RefResolutionError,
tests/v1/test_async_llm_dp.py::test_load[ray-RequestOutputKind.DELTA]
/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=3891956) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================== short test summary info ===========================================================================
FAILED v1/test_async_llm_dp.py::test_load[ray-RequestOutputKind.DELTA] - ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::DPEngineCoreActor.__init__() (pid=3893316, ip=216.81.245.69, actor_id=fa979d0e0af9f039b9196c4b01000000, repr=<vllm.v1.engine.core.DPEngineCoreActor object at 0x78f88f5a64e0>)
======================================================================= 1 failed, 2 warnings in 45.37s =======================================================================
(DPEngineCoreActor pid=3893316) [rank0]:[W617 07:33:01.422487841 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(DPEngineCoreActor pid=3893317) INFO 06-17 07:32:59 [default_loader.py:272] Loading weights took 16.77 seconds
(DPEngineCoreActor pid=3893317) INFO 06-17 07:33:00 [gpu_model_runner.py:1656] Model loading took 3.3375 GiB and 17.309669 seconds
(DPEngineCoreActor pid=3893317) WARNING 06-17 07:33:00 [fused_moe.py:683] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=40,N=256,device_name=NVIDIA_H100_80GB_HBM3.json
(DPEngineCoreActor pid=3893317) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::DPEngineCoreActor.__init__() (pid=3893317, ip=216.81.245.69, actor_id=298429e99af9b3dd80f746b401000000, repr=<vllm.v1.engine.core.DPEngineCoreActor object at 0x717cd778df70>)
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 3x across cluster]
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 83, in __init__ [repeated 4x across cluster]
(DPEngineCoreActor pid=3893317) super().__init__(vllm_config, on_head_node, "", executor_class,
(DPEngineCoreActor pid=3893317) super().__init__(vllm_config, on_head_node, handshake_address,
(DPEngineCoreActor pid=3893317) super().__init__(vllm_config, executor_class, log_stats,
(DPEngineCoreActor pid=3893317) self._initialize_kv_caches(vllm_config)
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 143, in _initialize_kv_caches
(DPEngineCoreActor pid=3893317) available_gpu_memory = self.model_executor.determine_available_memory()
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory [repeated 2x across cluster]
(DPEngineCoreActor pid=3893317) output = self.collective_rpc("determine_available_memory")
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
(DPEngineCoreActor pid=3893317) answer = run_method(self.driver_worker, method, args, kwargs)
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/utils.py", line 2690, in run_method
(DPEngineCoreActor pid=3893317) return func(*args, **kwargs) [repeated 3x across cluster]
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^ [repeated 3x across cluster]
(DPEngineCoreActor pid=3893317) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [repeated 2x across cluster]
(DPEngineCoreActor pid=3893317) self.model_runner.profile_run()
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 2045, in profile_run
(DPEngineCoreActor pid=3893317) hidden_states = self._dummy_run(self.max_num_tokens)
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1880, in _dummy_run
(DPEngineCoreActor pid=3893317) outputs = model(
(DPEngineCoreActor pid=3893317) ^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl [repeated 4x across cluster]
(DPEngineCoreActor pid=3893317) return self._call_impl(*args, **kwargs) [repeated 4x across cluster]
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 4x across cluster]
(DPEngineCoreActor pid=3893317) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl [repeated 4x across cluster]
(DPEngineCoreActor pid=3893317) return forward_call(*args, **kwargs) [repeated 4x across cluster]
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 5x across cluster]
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/models/granitemoe.py", line 101, in forward [repeated 4x across cluster]
(DPEngineCoreActor pid=3893317) hidden_states = self.model(input_ids, positions, intermediate_tensors,
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/compilation/decorators.py", line 173, in __call__
(DPEngineCoreActor pid=3893317) return self.forward(*args, **kwargs)
(DPEngineCoreActor pid=3893317) hidden_states = layer(positions, hidden_states)
(DPEngineCoreActor pid=3893317) hidden_states = self.block_sparse_moe(hidden_states)
(DPEngineCoreActor pid=3893317) final_hidden_states = self.experts(hidden_states, router_logits)
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1359, in forward
(DPEngineCoreActor pid=3893317) return torch.ops.vllm.moe_forward(hidden_states, router_logits,
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__ [repeated 2x across cluster]
(DPEngineCoreActor pid=3893317) return self._op(*args, **(kwargs or {})) [repeated 2x across cluster]
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1522, in moe_forward
(DPEngineCoreActor pid=3893317) return self.forward_impl(hidden_states, router_logits)
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1449, in forward_impl
(DPEngineCoreActor pid=3893317) final_hidden_states = self.quant_method.apply(
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 568, in apply
(DPEngineCoreActor pid=3893317) return self.forward(
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/custom_op.py", line 24, in forward
(DPEngineCoreActor pid=3893317) return self._forward_method(*args, **kwargs)
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 628, in forward_cuda
(DPEngineCoreActor pid=3893317) return self.fused_experts(
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1186, in fused_experts
(DPEngineCoreActor pid=3893317) return dispatch_fused_experts_func(inplace)(
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1125, in torch_vllm_inplace_fused_experts
(DPEngineCoreActor pid=3893317) torch.ops.vllm.inplace_fused_experts(**kwargs)
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1018, in inplace_fused_experts
(DPEngineCoreActor pid=3893317) fused_experts_impl(hidden_states, w1, w2, topk_weights, topk_ids, True,
(DPEngineCoreActor pid=3893317) File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1295, in fused_experts_impl
(DPEngineCoreActor pid=3893317) compute_type = tl.bfloat16
(DPEngineCoreActor pid=3893317) ^^^^^^^^^^^
(DPEngineCoreActor pid=3893317) AttributeError: module 'triton.language' has no attribute 'bfloat16'
(DPEngineCoreActor pid=3893317) [rank1]:[W617 07:33:02.549104798 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
📝 History of failing test
It looks like the test started failing on June 15th https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/88965c05-73cc-8bc4-bca5-902200fc81b7?period=28days&tags=scm.branch%3Amain

CC List.
Metadata
Metadata
Assignees
Type
Projects
Status
Done