Open
Description
Describe the issue
The following error occurs when trying to run https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct on WebGPU.
Note that the CPU implementation operates correctly, so this is indeed a bug with the WebGPU EP. Moreover, the zero-dimension tensor is by design, and is used for the first generation step.
To reproduce
- Install and build Transformers.js from source (https://github.com/huggingface/transformers.js)
- Run the following code in-browser:
import {
AutoProcessor,
AutoModelForVision2Seq,
load_image,
} from "@huggingface/transformers";
// Initialize processor and model
const model_id = "HuggingFaceTB/SmolVLM-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
dtype: {
embed_tokens: "fp16", // "fp32", "fp16", "q8"
vision_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
decoder_model_merged: "q4", // "q8", "q4", "q4f16"
},
device: 'webgpu',
});
// Load images
const image1 = await load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg");
const image2 = await load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg");
// Create input messages
const messages = [
{
role: "user",
content: [
{ type: "image" },
{ type: "image" },
{ type: "text", text: "Can you describe the two images?" },
],
},
];
// Prepare inputs
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, [image1, image2], {
// Set `do_image_splitting: true` to split images into multiple patches.
// NOTE: This uses more memory, but can provide more accurate results.
do_image_splitting: false,
});
// Generate outputs
const generated_ids = await model.generate({
...inputs,
max_new_tokens: 500,
});
const generated_texts = processor.batch_decode(
generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
{ skip_special_tokens: true },
);
console.log(generated_texts[0]);
// ' In the first image, there is a green statue of liberty on a pedestal in the middle of the water. The water is surrounded by trees and buildings in the background. In the second image, there are pink and red flowers with a bee on the pink flower.'
Urgency
This blocks SmolVLM usage in Transformers.js.
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.20.1
Execution Provider
'webgpu' (WebGPU)