Skip to content

Conversation

@wasertech
Copy link

@wasertech wasertech commented Dec 14, 2025

Purpose

Ensure inputs_embeds are cast to the correct dtype. This should help with #29349 and

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843] EngineCore failed to start.
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843] Traceback (most recent call last):
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     super().__init__(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 235, in _initialize_kv_caches
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return func(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return func(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 324, in determine_available_memory
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     self.model_runner.profile_run()
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4357, in profile_run
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return func(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4071, in _dummy_run
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     outputs = self.model(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]               ^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/apertus.py", line 552, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     model_output = self.model(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                    ^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 360, in __call__
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/apertus.py", line 395, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/apertus.py", line 313, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     hidden_states = self.mlp(hidden_states)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                     ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/apertus.py", line 110, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     x, _ = self.down_proj(x)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 1405, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     output_parallel = self.quant_method.apply(self, input_parallel, bias_)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 240, in apply
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/utils.py", line 105, in default_unquantized_gemm
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return torch.nn.functional.linear(x, weight, bias)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843] RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

Test Plan

WIP

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the v1 label Dec 14, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a RuntimeError caused by a dtype mismatch for inputs_embeds during profile_run. The fix involves casting inputs_embeds to the correct model dtype within the _dummy_run method. The change is correct and effectively resolves the reported issue. However, I've noted that a similar dtype mismatch could potentially occur during regular inference, as the execute_model path lacks a similar safeguard. I've left a comment suggesting a more comprehensive fix in a follow-up to ensure robustness across all execution paths. Overall, this is a good fix for the immediate problem.

@wasertech

This comment was marked as outdated.

@wasertech

This comment was marked as resolved.

@wasertech wasertech marked this pull request as ready for review December 14, 2025 09:29
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how can this happen? The buffer should be in the correct dtype according to this code:

        self.inputs_embeds = self._make_buffer(
            self.max_num_tokens, self.inputs_embeds_size, dtype=self.dtype, numpy=False
        )

@wasertech
Copy link
Author

wasertech commented Dec 14, 2025

https://gist.github.com/wasertech/fd579f5b09b2e9e0206dc5dac092791b#file-vllm-err-dtype-txt

This is the full logs enforcing eager mode on Turing. No matter which dtype I feed using parameter flags I still get this error since v0.10.x (first release compatible with this particular model - Apertus-).

@mergify
Copy link

mergify bot commented Dec 14, 2025

Hi @wasertech, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

This change consolidates the fix for handling runtime dtype mismatches for inputs_embeds, specifically observed in Docker environments with Apertus models. It includes a guard clause to cast inputs_embeds to self.dtype if they diverge, ensuring robustness against such edge cases.

Signed-off-by: wasertech <danny@waser.tech>
@wasertech
Copy link
Author

Hmm, how can this happen? The buffer should be in the correct dtype according to this code:

        self.inputs_embeds = self._make_buffer(
            self.max_num_tokens, self.inputs_embeds_size, dtype=self.dtype, numpy=False
        )

After deep investigation, here is the most technically plausible explanation for the dtype mismatch:

The Mechanism of Divergence:

Initialization: The GPUModelRunner is initialized. self.dtype is set (e.g., to float32) based on the initial model_config. The self.inputs_embeds buffer is created using this initial self.dtype (so it is a float32 buffer).

Configuration Update / Divergence: Sometime later (likely during update_config or model loading processes triggered by specific environment setups like Docker or specific models like Apertus, or in my case my GPU arch. - Turing -), the effective self.dtype expected by the runner logic seems to shift (e.g., to float16), OR the buffer initialization logic inferred a default (like float32 for auto) that differs from the final enforced self.dtype.

The Mismatch: inputs_embeds (passed to _model_forward) is a slice of that persistent self.inputs_embeds buffer. Thus, it carries the buffer's dtype (e.g., float32).

The Crash: The _model_forward method checks self.dtype (now float16) against inputs_embeds.dtype (float32). Without the fix, this mismatch propagates to the model, causing the error.

The Fix: The check catches this exact divergence (inputs_embeds.dtype != self.dtype) and proactively casts the tensor to match self.dtype before execution, healing the state definition conflict.

Crucially, inputs_embeds is almost always a slice of self.inputs_embeds, so the mismatch proves that the buffer's persistent dtype desynchronized from the runner's current dtype property. This is a classic "stale state" issue in persistent buffers.

Thanks Gemini for sniffing around and trying to explain this madness.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 14, 2025

Also, which model triggered this problem? Perhaps the problem is in how the dummy batch is created as well.

@wasertech
Copy link
Author

This one:
swiss-ai/Apertus-8B-Instruct-2509

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this problem happens to any other model?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 14, 2025

Also it would be nice to try to print out self.dtype in various places inside _dummy_run to see at which point it gets changed

@wasertech
Copy link
Author

Can you check if this problem happens to any other model?

I can serve my own mistral fine-tuned export in AWQ fine (MistralForCausalLM) but it's not a MoE so it works fine on Turing natively basically (no config change after init). I can also run granit-h-micro (GraniteMoeHybridForCausalLM) fine (and there is a cast to f16 from bf16). And most other arch. I've downlladed work so you might be on the right track @DarkLight1337 ; it might be an issue with the Apertus architecture definition.

Also it would be nice to try to print out self.dtype in various places inside _dummy_run to see at which point it gets changed

I agree let me add those and produce a nice log.

@wasertech
Copy link
Author

Just tested my changes locally. Not only my fix doesn't solve the issue but the dtype is already aligned when entering the _dummy_run function:

(EngineCore_DP0 pid=200351) DEBUG: Entering _dummy_run. self.dtype=torch.float16, self.inputs_embeds.gpu.dtype=torch.float16

Link to the gist

I understand your reaction now ahhah. I'll close this PR since it doesn't address the root of the issue (or even solves it hihi). It's likely in transformers? Will try to see where exactly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants