Skip to content

[Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices #24442

Open
@grazder

Description

@grazder

Describe the issue

We're using onnxruntime-web with WebGPU backend on different platforms and Electron is one of them.

We observe unstable/inaccurate predictions from an ONNX segmentation model when running inference via ONNX Runtime Web in Electron on specific Intel integrated Intel GPUs (Gen-12LP, Gen-9, Gen-11). The issue does not occur in Chrome on the same devices. The problem manifests as significant tensor value mismatches (e.g., abs/rel errors) in convolution layers, leading to invalid segmentation masks.

On 1.20.1 we faced this problem mostly on intel gen-12lp devices: i5-12400, i7-13700H, i7-11850H, i7-12700, i5-1235U and on a lot of others.

I tried to cherry-pick versions to find solution to this problem to found problem solution for devices above. I found out that it is broken until 1.21.0-dev.20241107-6a295eb75b and fixed after 1.21.0-dev.20241109-d3ad76b2cf.

After that I decided to use d27fecd3d3837864a268bc96f00f2b8dce294697 commit, because everything seemed stable and for devices above problem was solved.

But after that we've faced problem on various devices. Examples:

  • gen-12lp: i7-12700H (breaks after model reinitialization), i3-1215U, i5-1035G1
  • gen-11: i5-11320H
  • gen-9: i3-7100U, i5-7200U, i7-8565U

I noticed similar problems, for example, that the prediction models are too different from the reference (atol > 0.1) on Ampere and Turing GPUs in Chrome, and also found in many devices for fp16. But we face this problems much less.

I also tried versions above, but faced look-alike problems on i7-13700H for example.

To help sort out this problem I can produce more info like WebGPU reports, provide more devices examples, try some commits more on this devices.

To reproduce

I can reproduce on my own devices:

  • Mac M1 (metal-3)
  • NVIDIA GeForce RTX 3060 (ampere)
  • i5-12400 (gen-12lp) - here is I can see some problems

I attach some Convs from my model, on which tests fail - test_examples.zip

Master - 4d03aeff0ef86a62dacf02d67624cf26050125fd

git checkout 4d03aeff0ef86a62dacf02d67624cf26050125fd
cd onnxruntime/js
npm ci
cd common
npm ci
cd ../web
npm ci
npm run pull:wasm
npm run build

Move test cases from above into onnxruntime/js/web/test/data/node/opset_20 (opset_20 is random name that the testing scripts work with)

Change onnxruntime/js/web/test/suite-test-list.jsonc to:

{
  "webgpu": {
    "onnx": [],
    "node": ["8_Conv", "21_Conv", "31_Conv"],
    "ops": []
  }
}

After that I run tests for this ops on all of my devices

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

After that I checkout to 6a295eb75b

git checkout 6a295eb75b
js\build_jsep.bat r

// building etc

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// FAIL
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

I see following mismatch on gen-12lp (I print here only first 10 tensor numbers):

LOG: 'e Validator 2025-04-16T13:01:25.774Z|abs/rel check failed-- index:163839: actual=1.9458719491958618,expected=3.159862518310547'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|Tensor mismatch:
ACTUAL: type=float32; dims=[1,16,80,128]; data=[-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426]
EXPECT: type=float32; dims=[1,16,80,128]; data=[0.6060669422149658,0.5686113834381104,0.5930850505828857,0.5984766483306885,0.5964930057525635,0.5918130874633789,0.5929081439971924,0.6105263233184814,0.6307907104492188,0.6446692943572998]'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|  Result: FAILED'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|Failed to run test data from folder: test_data_set_0. Error: [AssertionError: tensor data should match: expected false to be true]'

After that I checkout to d3ad76b2cf

git checkout d3ad76b2cf
js\build_jsep.bat r

// building etc

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

So this fixes the issue for my device, but I assume that on devices with the incorrect predictions listed above we will face the same errors.

So, It seems that Convolutions are unstable on Electron for a lot of Intel devices.

Urgency

I'm working on segmentation model, and I see on some devices weird model predictions, so this problem is very important. And I face it a lot. But as a workaround I developed some tests, which I run on the initialization, so I can turn off model if it provides incorrect predictions.

Here is picture of incorrect Convolution behaviour (It's not because model was trained badly, It's 100% because of incorrect predictions.)

Image

So I think this problem is critical for onnxruntime-web usage on Electron.

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

d27fecd

Execution Provider

'webgpu' (WebGPU)

Metadata

Metadata

Assignees

Labels

ep:WebGPUort-web webgpu providerplatform:webissues related to ONNX Runtime web; typically submitted using template

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions