Cast inputs_embeds to model dtype if necessary #30635

wasertech · 2025-12-14T07:22:09Z

Purpose

Ensure inputs_embeds are cast to the correct dtype. This should help with #29349 and

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843] EngineCore failed to start.
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843] Traceback (most recent call last):
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     super().__init__(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 235, in _initialize_kv_caches
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return func(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return func(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 324, in determine_available_memory
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     self.model_runner.profile_run()
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4357, in profile_run
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return func(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4071, in _dummy_run
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     outputs = self.model(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]               ^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/apertus.py", line 552, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     model_output = self.model(
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                    ^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 360, in __call__
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/apertus.py", line 395, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/apertus.py", line 313, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     hidden_states = self.mlp(hidden_states)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                     ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/apertus.py", line 110, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     x, _ = self.down_proj(x)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 1405, in forward
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     output_parallel = self.quant_method.apply(self, input_parallel, bias_)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 240, in apply
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/utils.py", line 105, in default_unquantized_gemm
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return torch.nn.functional.linear(x, weight, bias)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=160) ERROR 12-13 22:37:33 [core.py:843] RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

Test Plan

WIP

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request addresses a RuntimeError caused by a dtype mismatch for inputs_embeds during profile_run. The fix involves casting inputs_embeds to the correct model dtype within the _dummy_run method. The change is correct and effectively resolves the reported issue. However, I've noted that a similar dtype mismatch could potentially occur during regular inference, as the execute_model path lacks a similar safeguard. I've left a comment suggesting a more comprehensive fix in a follow-up to ensure robustness across all execution paths. Overall, this is a good fix for the immediate problem.

vllm/v1/worker/gpu_model_runner.py

chatgpt-codex-connector · 2025-12-14T09:29:30Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

DarkLight1337

Hmm, how can this happen? The buffer should be in the correct dtype according to this code:

        self.inputs_embeds = self._make_buffer(
            self.max_num_tokens, self.inputs_embeds_size, dtype=self.dtype, numpy=False
        )

wasertech · 2025-12-14T10:32:49Z

https://gist.github.com/wasertech/fd579f5b09b2e9e0206dc5dac092791b#file-vllm-err-dtype-txt

This is the full logs enforcing eager mode on Turing. No matter which dtype I feed using parameter flags I still get this error since v0.10.x (first release compatible with this particular model - Apertus-).

mergify · 2025-12-14T11:36:07Z

Hi @wasertech, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

This change consolidates the fix for handling runtime dtype mismatches for inputs_embeds, specifically observed in Docker environments with Apertus models. It includes a guard clause to cast inputs_embeds to self.dtype if they diverge, ensuring robustness against such edge cases. Signed-off-by: wasertech <danny@waser.tech>

wasertech · 2025-12-14T12:04:24Z

Hmm, how can this happen? The buffer should be in the correct dtype according to this code:
        self.inputs_embeds = self._make_buffer(
            self.max_num_tokens, self.inputs_embeds_size, dtype=self.dtype, numpy=False
        )

After deep investigation, here is the most technically plausible explanation for the dtype mismatch:

The Mechanism of Divergence:

Initialization: The GPUModelRunner is initialized. self.dtype is set (e.g., to float32) based on the initial model_config. The self.inputs_embeds buffer is created using this initial self.dtype (so it is a float32 buffer).

Configuration Update / Divergence: Sometime later (likely during update_config or model loading processes triggered by specific environment setups like Docker or specific models like Apertus, or in my case my GPU arch. - Turing -), the effective self.dtype expected by the runner logic seems to shift (e.g., to float16), OR the buffer initialization logic inferred a default (like float32 for auto) that differs from the final enforced self.dtype.

The Mismatch: inputs_embeds (passed to _model_forward) is a slice of that persistent self.inputs_embeds buffer. Thus, it carries the buffer's dtype (e.g., float32).

The Crash: The _model_forward method checks self.dtype (now float16) against inputs_embeds.dtype (float32). Without the fix, this mismatch propagates to the model, causing the error.

The Fix: The check catches this exact divergence (inputs_embeds.dtype != self.dtype) and proactively casts the tensor to match self.dtype before execution, healing the state definition conflict.

Crucially, inputs_embeds is almost always a slice of self.inputs_embeds, so the mismatch proves that the buffer's persistent dtype desynchronized from the runner's current dtype property. This is a classic "stale state" issue in persistent buffers.

Thanks Gemini for sniffing around and trying to explain this madness.

vllm/v1/worker/gpu_model_runner.py

DarkLight1337 · 2025-12-14T13:00:53Z

Also, which model triggered this problem? Perhaps the problem is in how the dummy batch is created as well.

wasertech · 2025-12-14T13:03:12Z

This one:
swiss-ai/Apertus-8B-Instruct-2509

DarkLight1337

Can you check if this problem happens to any other model?

DarkLight1337 · 2025-12-14T13:09:10Z

Also it would be nice to try to print out self.dtype in various places inside _dummy_run to see at which point it gets changed

wasertech · 2025-12-14T13:47:32Z

Can you check if this problem happens to any other model?

I can serve my own mistral fine-tuned export in AWQ fine (MistralForCausalLM) but it's not a MoE so it works fine on Turing natively basically (no config change after init). I can also run granit-h-micro (GraniteMoeHybridForCausalLM) fine (and there is a cast to f16 from bf16). And most other arch. I've downlladed work so you might be on the right track @DarkLight1337 ; it might be an issue with the Apertus architecture definition.

Also it would be nice to try to print out self.dtype in various places inside _dummy_run to see at which point it gets changed

I agree let me add those and produce a nice log.

wasertech · 2025-12-14T15:03:25Z

Just tested my changes locally. Not only my fix doesn't solve the issue but the dtype is already aligned when entering the _dummy_run function:

(EngineCore_DP0 pid=200351) DEBUG: Entering _dummy_run. self.dtype=torch.float16, self.inputs_embeds.gpu.dtype=torch.float16

Link to the gist

I understand your reaction now ahhah. I'll close this PR since it doesn't address the root of the issue (or even solves it hihi). It's likely in transformers? Will try to see where exactly.

mergify bot added the v1 label Dec 14, 2025

gemini-code-assist bot reviewed Dec 14, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

wasertech force-pushed the patch-1 branch from 551039f to 53e3449 Compare December 14, 2025 09:03

This comment was marked as outdated.

Sign in to view

wasertech force-pushed the patch-1 branch from 53e3449 to 375e143 Compare December 14, 2025 09:25

This comment was marked as resolved.

Sign in to view

wasertech marked this pull request as ready for review December 14, 2025 09:29

DarkLight1337 reviewed Dec 14, 2025

View reviewed changes

wasertech force-pushed the patch-1 branch from 00de1e4 to 23958fa Compare December 14, 2025 11:44

DarkLight1337 reviewed Dec 14, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

DarkLight1337 reviewed Dec 14, 2025

View reviewed changes

wasertech closed this Dec 14, 2025

wasertech mentioned this pull request Dec 14, 2025

fix: Initialize ApertusMLP's xielu activation using torch_dtype huggingface/transformers#42864

Merged

Uh oh!

Cast inputs_embeds to model dtype if necessary #30635

Cast inputs_embeds to model dtype if necessary #30635

Uh oh!

Conversation

wasertech commented Dec 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

chatgpt-codex-connector bot commented Dec 14, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

wasertech commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 14, 2025

Uh oh!

wasertech commented Dec 14, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wasertech commented Dec 14, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wasertech commented Dec 14, 2025

Uh oh!

wasertech commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wasertech commented Dec 14, 2025 •

edited by github-actions bot

Loading

wasertech commented Dec 14, 2025 •

edited

Loading

DarkLight1337 commented Dec 14, 2025 •

edited

Loading

DarkLight1337 commented Dec 14, 2025 •

edited

Loading