[WebGPU] `Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".`

### Describe the issue

The following error occurs when trying to run https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct on WebGPU.

![Image](https://github.com/user-attachments/assets/dc2aab08-f78c-4598-86ee-dbc029bd69a4)

Note that the CPU implementation operates correctly, so this is indeed a bug with the WebGPU EP. Moreover, the zero-dimension tensor is by design, and is used for the first generation step.

### To reproduce

1. Install and build Transformers.js from source (https://github.com/huggingface/transformers.js)
2. Run the following code in-browser:
```js
import {
  AutoProcessor,
  AutoModelForVision2Seq,
  load_image,
} from "@huggingface/transformers";

// Initialize processor and model
const model_id = "HuggingFaceTB/SmolVLM-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "fp16", // "fp32", "fp16", "q8"
    vision_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
    decoder_model_merged: "q4", // "q8", "q4", "q4f16"
  },
  device: 'webgpu',
});

// Load images
const image1 = await load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg");
const image2 = await load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg");

// Create input messages
const messages = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "image" },
      { type: "text", text: "Can you describe the two images?" },
    ],
  },
];

// Prepare inputs
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, [image1, image2], {
  // Set `do_image_splitting: true` to split images into multiple patches.
  // NOTE: This uses more memory, but can provide more accurate results.
  do_image_splitting: false,
});

// Generate outputs
const generated_ids = await model.generate({
  ...inputs,
  max_new_tokens: 500,
});
const generated_texts = processor.batch_decode(
  generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// ' In the first image, there is a green statue of liberty on a pedestal in the middle of the water. The water is surrounded by trees and buildings in the background. In the second image, there are pink and red flowers with a bee on the pink flower.'
```

### Urgency

This blocks SmolVLM usage in Transformers.js.

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.20.1

### Execution Provider

'webgpu' (WebGPU)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WebGPU] `Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".` #22987

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions". #22987

Description

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[WebGPU] `Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".` #22987