[Question] whisper vs. ort-wasm-simd-threaded.wasm #161

jozefchutka · 2023-06-22T06:41:31Z

While looking into https://cdn.jsdelivr.net/npm/@xenova/transformers@2.2.0/dist/transformers.js I can see a reference to ort-wasm-simd-threaded.wasm however that one never seem to be loaded for whisper/automatic-speech-recognition ( https://huggingface.co/spaces/Xenova/whisper-web ) while it always use ort-wasm-simd.wasm . I wonder if there is a way to enable or enforce threaded wasm and so improve transcription speed?

xenova · 2023-06-22T11:16:33Z

I believe this is due to how default HF spaces are hosted (which block usage of SharedArrayBuffer). Here's another thread discussing this: microsoft/onnxruntime#9681 (comment). I would be interested in seeing what performance benefits we could get though. cc'ing @fs-eire @josephrocca for some help too.

jozefchutka · 2023-06-22T12:21:10Z

good spot! Simply adding:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

...will indeed invoke wasm-simd-threaded.wasm. However I do not see multiple workers spawned (as I would expect) nor any performance improvements.

josephrocca · 2023-06-22T14:21:04Z

@jozefchutka So, to confirm, it loaded the wasm-simd-threaded.wasm file but didn't spawn any threads? Can you check in the console whether self.crossOriginIsolated is true?

If it is true, then can you check the value of env.backends.onnx.wasm.numThreads?

Regarding Hugging Face spaces, I've opened an issue here for COEP/COOP header support: huggingface/huggingface_hub#1525

And in the meantime you can use the service worker hack on Spaces, mentioned here: https://github.com/orgs/community/discussions/13309#discussioncomment-3844940

fs-eire · 2023-06-22T23:30:24Z

~~an easy way to check is to open devtool on that page, and check if typeof SharedArrayBuffer is undefined.~~
( OK, checking self.crossOriginIsolated is the better way, good learning for me :) )

if multi-thread features are available, ort-web will spawn [CPU-core-number / 2] (up-to 4) threads, if <ORT_ENV_OBJ>.wasm.numThreads is not set.

if <ORT_ENV_OBJ>.wasm.numThreads is set, use that number to spawn workers (main thread counts). setting to 1 will force disable multi-thread feature.

jozefchutka · 2023-06-23T11:27:03Z

@josephrocca , @fs-eire following is printed:
self.crossOriginIsolated -> true
env.backends.onnx.wasm.numThreads -> undefined

I have also tried to explicitly set numThreads to 4 but same result.

Something interesting to mention:

I use dev tools / sources / threads to observe running threads, where I see my index.html and my worker.js (which impots transformers.js), nothing else reported until!...
...until a very last moment where transformers.js transcription finishes and then I can see 3 more threads appearing, I believe these has something todo with transformers running in threads (b/c these do not appear when I set numThreads=0), however I wonder .... why it appears so late? and why no performance difference?

fs-eire · 2023-06-23T21:29:51Z

env.backends.onnx.wasm.numThreads -> undefined

for onnxruntime-web, it's env.wasm.numThreads:

import { env } from 'onnxruntime-web';
env.wasm.numThreads = 4;

for transformers.js I believe it is exposed through different way

xenova · 2023-06-23T22:45:46Z

I believe @jozefchutka was doing it correctly:

env.backends.onnx.wasm.numThreads

I.e., env.backends.onnx is the onnx env variable

jozefchutka · 2023-06-26T13:25:52Z

env.backends.onnx.wasm object exists env object... so I think its the right one...
please let me know if I can assist/debug any further

xenova · 2023-07-14T18:58:07Z

I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔 ort-wasm-simd-threaded.wasm is being loaded, but doesn't seem to be working correctly. @fs-eire am I correct in saying that if SharedArrayBuffer is detected, it will default to the number of threads available? So, even in this case, setting env.backends.onnx.wasm.numThreads (which is correct) would not be necessary, and we should be seeing performance improvements either way.

fs-eire · 2023-07-14T19:32:45Z

I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔 ort-wasm-simd-threaded.wasm is being loaded, but doesn't seem to be working correctly. @fs-eire am I correct in saying that if SharedArrayBuffer is detected, it will default to the number of threads available? So, even in this case, setting env.backends.onnx.wasm.numThreads (which is correct) would not be necessary, and we should be seeing performance improvements either way.

The code is here:
https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/backend-wasm.ts#L29-L32

this code run only once when trying to create the first inference session

xenova · 2023-07-14T19:57:13Z

Right, so we should be seeing a performance improvement by simply having loaded ort-wasm-simd-threaded.wasm?

fs-eire · 2023-07-14T23:11:33Z

Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug.

xenova · 2023-07-14T23:16:43Z

Yes, this is something mentioned above by @jozefchutka:

Something interesting to mention:

I use dev tools / sources / threads to observe running threads, where I see my index.html and my worker.js (which impots transformers.js), nothing else reported until!...

...until a very last moment where transformers.js transcription finishes and then I can see 3 more threads appearing, I believe these has something todo with transformers running in threads (b/c these do not appear when I set numThreads=0), however I wonder .... why it appears so late? and why no performance difference?

transformers.js does not do anything extra when it comes to threading, so I do believe this is an issue with onnxruntime-web. Please let me know if there's anything I can do to help debug/test

josephrocca · 2023-07-14T23:45:39Z

@xenova Unless I misunderstand, you or @jozefchutka might need to provide a minimal example here. I don't see the problem of worker threads appearing too late (i.e. after inference) in this fairly minimal demo, for example:

https://josephrocca.github.io/openai-clip-js/onnx-image-demo.html

That's using the latest ORT Web version, and has self.crossOriginIsolated==true⁰. I see ort-wasm-simd-threaded.wasm load in the network tab, and worker threads immediately appear - before inference happens.

Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the ort.env.wasm.proxy flag to get the model off the main thread "automatically".

[0] Just a heads-up: For some reason I had to manually refresh the page the first time I loaded it just now - the service worker that adds the COOP/COEP headers didn't refresh automatically like it's supposed to.

fs-eire · 2023-07-15T00:24:32Z

if you use ort.env.wasm.proxy flag, the proxy worker will be spawn immediately. this is different worker to the workers created for multithread computation

xenova · 2023-07-16T12:07:42Z

Should we see performance improvements even if the batch size is 1? Could you maybe explain how work is divided among threads @fs-eire?

Regarding a demo, @jozefchutka would you mind sharing the code you were referring to above? My testing was done inside my whisper-web application, which is quite large and had a lot of bloat around it.

jozefchutka · 2023-07-17T09:33:47Z

Here is a demo:

worker.js

import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.4.1/dist/transformers.min.js";

env.allowLocalModels = false;
//env.backends.onnx.wasm.numThreads = 4;

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);

const t0 = performance.now();
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

for(let {text, timestamp} of result.chunks)
	console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);

console.log(performance.now() - t0);

demo.html

<script>
new Worker("worker.js", {type:"module"});
</script>

a script to generate .pcm file:

ffmpeg -i tos.mp4 -filter_complex [0:1]aformat=channel_layouts=mono,aresample=16000[aout] -map [aout] -c:a pcm_f32le -f data tos.pcm

changing value of env.backends.onnx.wasm.numThreads makes no difference in transcription performance of tested 1 minute long pcm.

xenova · 2023-08-03T10:17:11Z

Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug.

@fs-eire Any updates on this maybe? 😅 Is there perhaps an issue with spawning workers from a worker itself? Here's a 60-second audio file for testing, if you need it: ted_60.wav

xenova · 2023-08-08T11:15:34Z

@jozefchutka Can you maybe test with @josephrocca's previous suggestion? env.backends.onnx.wasm.proxy=true

Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the ort.env.wasm.proxy flag to get the model off the main thread "automatically".

jozefchutka · 2023-08-08T13:23:04Z

@xenova , I observe no difference in performance or extra threads/workers running when tested with env.backends.onnx.wasm.proxy=true

xenova · 2023-08-08T13:27:33Z

@jozefchutka Did you try not using a worker.js file, and just keeping all transformers.js logic in the UI thread (but still using proxy=true).

jozefchutka · 2023-08-08T13:42:56Z

This is a version without my worker test.html:

<script type="module">
import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0/dist/transformers.min.js";

env.allowLocalModels = false;
env.backends.onnx.wasm.proxy = true;

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);

const t0 = performance.now();
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

for(let {text, timestamp} of result.chunks)
	console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);

console.log(performance.now() - t0);
</script>

With this script, I can see 4 workers opened, however await pipeline() is never resolved and the script basically hangs on that line. Can you please have a look?

xenova · 2023-08-08T13:51:34Z

await pipeline() is never resolved

Are you sure it's not just downloading the model? Can you check your network tab?

I'll test this though.

xenova · 2023-08-14T20:00:49Z

I've done a bit of benchmarking and there does not seem to be any speedup when using threads. url: https://xenova-whisper-testing.hf.space/ consistently takes 3.8 seconds. I do see the threads spawn though.

Also, using the proxy just freezes everything after spawning 6 threads.

@jozefchutka am I missing something? Is this also what you see?
@fs-eire I am still using onnxruntime-web v1.14.0 - is this something which was fixed in a later release?

jozefchutka · 2023-08-15T05:58:05Z

@xenova thats same as what I have observed

guschmue · 2023-08-15T16:28:53Z

I just tried this with a simple app and works fine for me.
Let me try with transformers.js next.
As long you see ort-wasm-simd-threaded.wasm loading it should work.
For testing you can add
--enable-features=SharedArrayBuffer
to the chrome command line to rule out any COEP/COOP issue.

xenova · 2023-08-15T16:33:21Z

I just tried this with a simple app and works fine for me.

Do you see speedups too? 👀

As long you see ort-wasm-simd-threaded.wasm loading it should work.

@guschmue It does seem to load this file when running this demo, but no performance improvements (all 3.7 seconds)

I am still using v1.14.0, so if something changed since then, I can update and check

jozefchutka added the question Further information is requested label Jun 22, 2023

josephrocca mentioned this issue Jun 22, 2023

Support COEP/COOP headers on Spaces so that WebML spaces can use multiple threads huggingface/huggingface_hub#1525

Closed

xenova mentioned this issue Jul 12, 2023

[Question] New demo type/use case: semantic search (SemanticFinder) #84

Open

xenova mentioned this issue Dec 2, 2023

[Question] Batch inference for vit #424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] whisper vs. ort-wasm-simd-threaded.wasm #161

[Question] whisper vs. ort-wasm-simd-threaded.wasm #161

jozefchutka commented Jun 22, 2023

xenova commented Jun 22, 2023 •

edited

Loading

jozefchutka commented Jun 22, 2023

josephrocca commented Jun 22, 2023

fs-eire commented Jun 22, 2023 •

edited

Loading

jozefchutka commented Jun 23, 2023 •

edited

Loading

fs-eire commented Jun 23, 2023

xenova commented Jun 23, 2023

jozefchutka commented Jun 26, 2023

xenova commented Jul 14, 2023 •

edited

Loading

fs-eire commented Jul 14, 2023

xenova commented Jul 14, 2023

fs-eire commented Jul 14, 2023

xenova commented Jul 14, 2023 •

edited

Loading

josephrocca commented Jul 14, 2023 •

edited

Loading

fs-eire commented Jul 15, 2023

xenova commented Jul 16, 2023 •

edited

Loading

jozefchutka commented Jul 17, 2023

xenova commented Aug 3, 2023 •

edited

Loading

xenova commented Aug 8, 2023

jozefchutka commented Aug 8, 2023

xenova commented Aug 8, 2023 •

edited

Loading

jozefchutka commented Aug 8, 2023 •

edited

Loading

xenova commented Aug 8, 2023

xenova commented Aug 14, 2023 •

edited

Loading

jozefchutka commented Aug 15, 2023 •

edited

Loading

guschmue commented Aug 15, 2023

xenova commented Aug 15, 2023 •

edited

Loading

[Question] whisper vs. ort-wasm-simd-threaded.wasm #161

[Question] whisper vs. ort-wasm-simd-threaded.wasm #161

Comments

jozefchutka commented Jun 22, 2023

xenova commented Jun 22, 2023 • edited Loading

jozefchutka commented Jun 22, 2023

josephrocca commented Jun 22, 2023

fs-eire commented Jun 22, 2023 • edited Loading

jozefchutka commented Jun 23, 2023 • edited Loading

fs-eire commented Jun 23, 2023

xenova commented Jun 23, 2023

jozefchutka commented Jun 26, 2023

xenova commented Jul 14, 2023 • edited Loading

fs-eire commented Jul 14, 2023

xenova commented Jul 14, 2023

fs-eire commented Jul 14, 2023

xenova commented Jul 14, 2023 • edited Loading

josephrocca commented Jul 14, 2023 • edited Loading

fs-eire commented Jul 15, 2023

xenova commented Jul 16, 2023 • edited Loading

jozefchutka commented Jul 17, 2023

xenova commented Aug 3, 2023 • edited Loading

xenova commented Aug 8, 2023

jozefchutka commented Aug 8, 2023

xenova commented Aug 8, 2023 • edited Loading

jozefchutka commented Aug 8, 2023 • edited Loading

xenova commented Aug 8, 2023

xenova commented Aug 14, 2023 • edited Loading

jozefchutka commented Aug 15, 2023 • edited Loading

guschmue commented Aug 15, 2023

xenova commented Aug 15, 2023 • edited Loading

xenova commented Jun 22, 2023 •

edited

Loading

fs-eire commented Jun 22, 2023 •

edited

Loading

jozefchutka commented Jun 23, 2023 •

edited

Loading

xenova commented Jul 14, 2023 •

edited

Loading

xenova commented Jul 14, 2023 •

edited

Loading

josephrocca commented Jul 14, 2023 •

edited

Loading

xenova commented Jul 16, 2023 •

edited

Loading

xenova commented Aug 3, 2023 •

edited

Loading

xenova commented Aug 8, 2023 •

edited

Loading

jozefchutka commented Aug 8, 2023 •

edited

Loading

xenova commented Aug 14, 2023 •

edited

Loading

jozefchutka commented Aug 15, 2023 •

edited

Loading

xenova commented Aug 15, 2023 •

edited

Loading