Description
Describe the issue
We're using onnxruntime-web
with WebGPU backend
on different platforms and Electron is one of them.
We observe unstable/inaccurate predictions from an ONNX segmentation model when running inference via ONNX Runtime Web in Electron on specific Intel integrated Intel GPUs (Gen-12LP, Gen-9, Gen-11). The issue does not occur in Chrome on the same devices. The problem manifests as significant tensor value mismatches (e.g., abs/rel errors) in convolution layers, leading to invalid segmentation masks.
On 1.20.1
we faced this problem mostly on intel gen-12lp
devices: i5-12400, i7-13700H, i7-11850H, i7-12700, i5-1235U and on a lot of others.
I tried to cherry-pick versions to find solution to this problem to found problem solution for devices above. I found out that it is broken until 1.21.0-dev.20241107-6a295eb75b
and fixed after 1.21.0-dev.20241109-d3ad76b2cf
.
After that I decided to use d27fecd3d3837864a268bc96f00f2b8dce294697 commit, because everything seemed stable and for devices above problem was solved.
But after that we've faced problem on various devices. Examples:
gen-12lp
: i7-12700H (breaks after model reinitialization), i3-1215U, i5-1035G1gen-11
: i5-11320Hgen-9
: i3-7100U, i5-7200U, i7-8565U
I noticed similar problems, for example, that the prediction models are too different from the reference (atol > 0.1) on Ampere and Turing GPUs in Chrome, and also found in many devices for fp16. But we face this problems much less.
I also tried versions above, but faced look-alike problems on i7-13700H
for example.
To help sort out this problem I can produce more info like WebGPU reports, provide more devices examples, try some commits more on this devices.
To reproduce
I can reproduce on my own devices:
- Mac M1 (metal-3)
- NVIDIA GeForce RTX 3060 (ampere)
- i5-12400 (gen-12lp) - here is I can see some problems
I attach some Convs from my model, on which tests fail - test_examples.zip
Master - 4d03aeff0ef86a62dacf02d67624cf26050125fd
git checkout 4d03aeff0ef86a62dacf02d67624cf26050125fd
cd onnxruntime/js
npm ci
cd common
npm ci
cd ../web
npm ci
npm run pull:wasm
npm run build
Move test cases from above into onnxruntime/js/web/test/data/node/opset_20
(opset_20
is random name that the testing scripts work with)
Change onnxruntime/js/web/test/suite-test-list.jsonc
to:
{
"webgpu": {
"onnx": [],
"node": ["8_Conv", "21_Conv", "31_Conv"],
"ops": []
}
}
After that I run tests for this ops on all of my devices
// gen-12lp
npm run test -- suite1 --backend webgpu --env electron
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
// ampere
npm run test -- suite1 --backend webgpu --env electron
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
// metal-3
npm run test -- suite1 --backend webgpu --env electron
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
After that I checkout to 6a295eb75b
git checkout 6a295eb75b
js\build_jsep.bat r
// building etc
// metal-3
npm run test -- suite1 --backend webgpu --env electron
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
// gen-12lp
npm run test -- suite1 --backend webgpu --env electron
// FAIL
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
// ampere
npm run test -- suite1 --backend webgpu --env electron
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
I see following mismatch on gen-12lp
(I print here only first 10 tensor numbers):
LOG: 'e Validator 2025-04-16T13:01:25.774Z|abs/rel check failed-- index:163839: actual=1.9458719491958618,expected=3.159862518310547'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|Tensor mismatch:
ACTUAL: type=float32; dims=[1,16,80,128]; data=[-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426]
EXPECT: type=float32; dims=[1,16,80,128]; data=[0.6060669422149658,0.5686113834381104,0.5930850505828857,0.5984766483306885,0.5964930057525635,0.5918130874633789,0.5929081439971924,0.6105263233184814,0.6307907104492188,0.6446692943572998]'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z| Result: FAILED'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|Failed to run test data from folder: test_data_set_0. Error: [AssertionError: tensor data should match: expected false to be true]'
After that I checkout to d3ad76b2cf
git checkout d3ad76b2cf
js\build_jsep.bat r
// building etc
// metal-3
npm run test -- suite1 --backend webgpu --env electron
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
// gen-12lp
npm run test -- suite1 --backend webgpu --env electron
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
// ampere
npm run test -- suite1 --backend webgpu --env electron
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS
So this fixes the issue for my device, but I assume that on devices with the incorrect predictions listed above we will face the same errors.
So, It seems that Convolutions are unstable on Electron for a lot of Intel devices.
Urgency
I'm working on segmentation model, and I see on some devices weird model predictions, so this problem is very important. And I face it a lot. But as a workaround I developed some tests, which I run on the initialization, so I can turn off model if it provides incorrect predictions.
Here is picture of incorrect Convolution behaviour (It's not because model was trained badly, It's 100% because of incorrect predictions.)

So I think this problem is critical for onnxruntime-web
usage on Electron.
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
Execution Provider
'webgpu' (WebGPU)