Open
Description
Describe the issue
I observe a significant difference in execution behavior when switching from ONNX to ORT model formats for WebGPU usage.
1. With ONNX format:
run()
completes quickly, and most latency comes fromawait device.queue.onSubmittedWorkDone()
(expected for GPU-bound work).- Profiler shows only GPU-related operations.
2. With ORT format:
- The entire latency shifts to
await model.run()
, with no significant wait on GPU queue completion. - Profiler reveals many WASM operations instead of GPU-accelerated kernels.
I think that ORT format should leverage GPU execution providers (WebGPU/WebGL) similarly to ONNX format, without introducing unexpected WASM-based CPU fallbacks.
Profling for model which I attach in to reproduce
:
Another model profiling (this one larger):
- Conversion flags matter? Does the ORT format pre-optimize the model for CPU/WASM by default?
- Are there hidden constraints when using ORT format in browsers?
To reproduce
- Load an ONNX model and run inference with:
await this._model.run({ input: tensor }, { output: this._outputTensor });
await this._device.queue.onSubmittedWorkDone(); // Majority of latency here
- Convert the model to ORT format (using
convert_onnx_models_to_ort.py
) - Load the ORT model and run the same inference code.
- Observe profiler showing WASM ops and latency moved to
await model.run()
.
I attach model example in ONNX and ORT formats on which I face issue - models.zip
I can reproduce it on 1.20.0
and d27fecd
Urgency
Can you help me to clarify:
- What are these operations and why do they appear?
- Does this mean that the model runs on the CPU?
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
Execution Provider
'webgpu' (WebGPU)