Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] whisper vs. ort-wasm-simd-threaded.wasm #161

Open
jozefchutka opened this issue Jun 22, 2023 · 27 comments
Open

[Question] whisper vs. ort-wasm-simd-threaded.wasm #161

jozefchutka opened this issue Jun 22, 2023 · 27 comments
Labels
question Further information is requested

Comments

@jozefchutka
Copy link

While looking into https://cdn.jsdelivr.net/npm/@xenova/transformers@2.2.0/dist/transformers.js I can see a reference to ort-wasm-simd-threaded.wasm however that one never seem to be loaded for whisper/automatic-speech-recognition ( https://huggingface.co/spaces/Xenova/whisper-web ) while it always use ort-wasm-simd.wasm . I wonder if there is a way to enable or enforce threaded wasm and so improve transcription speed?

@jozefchutka jozefchutka added the question Further information is requested label Jun 22, 2023
@xenova
Copy link
Owner

xenova commented Jun 22, 2023

I believe this is due to how default HF spaces are hosted (which block usage of SharedArrayBuffer). Here's another thread discussing this: microsoft/onnxruntime#9681 (comment). I would be interested in seeing what performance benefits we could get though. cc'ing @fs-eire @josephrocca for some help too.

@jozefchutka
Copy link
Author

good spot! Simply adding:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

...will indeed invoke wasm-simd-threaded.wasm. However I do not see multiple workers spawned (as I would expect) nor any performance improvements.

@josephrocca
Copy link
Sponsor Contributor

@jozefchutka So, to confirm, it loaded the wasm-simd-threaded.wasm file but didn't spawn any threads? Can you check in the console whether self.crossOriginIsolated is true?

If it is true, then can you check the value of env.backends.onnx.wasm.numThreads?

Regarding Hugging Face spaces, I've opened an issue here for COEP/COOP header support: huggingface/huggingface_hub#1525

And in the meantime you can use the service worker hack on Spaces, mentioned here: https://github.com/orgs/community/discussions/13309#discussioncomment-3844940

@fs-eire
Copy link

fs-eire commented Jun 22, 2023

an easy way to check is to open devtool on that page, and check if typeof SharedArrayBuffer is undefined.
( OK, checking self.crossOriginIsolated is the better way, good learning for me :) )

if multi-thread features are available, ort-web will spawn [CPU-core-number / 2] (up-to 4) threads, if <ORT_ENV_OBJ>.wasm.numThreads is not set.

if <ORT_ENV_OBJ>.wasm.numThreads is set, use that number to spawn workers (main thread counts). setting to 1 will force disable multi-thread feature.

@jozefchutka
Copy link
Author

jozefchutka commented Jun 23, 2023

@josephrocca , @fs-eire following is printed:
self.crossOriginIsolated -> true
env.backends.onnx.wasm.numThreads -> undefined

I have also tried to explicitly set numThreads to 4 but same result.

Something interesting to mention:

  • I use dev tools / sources / threads to observe running threads, where I see my index.html and my worker.js (which impots transformers.js), nothing else reported until!...
  • ...until a very last moment where transformers.js transcription finishes and then I can see 3 more threads appearing, I believe these has something todo with transformers running in threads (b/c these do not appear when I set numThreads=0), however I wonder .... why it appears so late? and why no performance difference?

@fs-eire
Copy link

fs-eire commented Jun 23, 2023

env.backends.onnx.wasm.numThreads -> undefined

for onnxruntime-web, it's env.wasm.numThreads:

import { env } from 'onnxruntime-web';
env.wasm.numThreads = 4;

for transformers.js I believe it is exposed through different way

@xenova
Copy link
Owner

xenova commented Jun 23, 2023

I believe @jozefchutka was doing it correctly:

env.backends.onnx.wasm.numThreads

I.e., env.backends.onnx is the onnx env variable

@jozefchutka
Copy link
Author

env.backends.onnx.wasm object exists env object... so I think its the right one...
please let me know if I can assist/debug any further

@xenova
Copy link
Owner

xenova commented Jul 14, 2023

I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔 ort-wasm-simd-threaded.wasm is being loaded, but doesn't seem to be working correctly. @fs-eire am I correct in saying that if SharedArrayBuffer is detected, it will default to the number of threads available? So, even in this case, setting env.backends.onnx.wasm.numThreads (which is correct) would not be necessary, and we should be seeing performance improvements either way.

@fs-eire
Copy link

fs-eire commented Jul 14, 2023

I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔 ort-wasm-simd-threaded.wasm is being loaded, but doesn't seem to be working correctly. @fs-eire am I correct in saying that if SharedArrayBuffer is detected, it will default to the number of threads available? So, even in this case, setting env.backends.onnx.wasm.numThreads (which is correct) would not be necessary, and we should be seeing performance improvements either way.

The code is here:
https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/backend-wasm.ts#L29-L32

this code run only once when trying to create the first inference session

@xenova
Copy link
Owner

xenova commented Jul 14, 2023

Right, so we should be seeing a performance improvement by simply having loaded ort-wasm-simd-threaded.wasm?

@fs-eire
Copy link

fs-eire commented Jul 14, 2023

Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug.

@xenova
Copy link
Owner

xenova commented Jul 14, 2023

Yes, this is something mentioned above by @jozefchutka:

Something interesting to mention:

  • I use dev tools / sources / threads to observe running threads, where I see my index.html and my worker.js (which impots transformers.js), nothing else reported until!...
  • ...until a very last moment where transformers.js transcription finishes and then I can see 3 more threads appearing, I believe these has something todo with transformers running in threads (b/c these do not appear when I set numThreads=0), however I wonder .... why it appears so late? and why no performance difference?

transformers.js does not do anything extra when it comes to threading, so I do believe this is an issue with onnxruntime-web. Please let me know if there's anything I can do to help debug/test

@josephrocca
Copy link
Sponsor Contributor

josephrocca commented Jul 14, 2023

@xenova Unless I misunderstand, you or @jozefchutka might need to provide a minimal example here. I don't see the problem of worker threads appearing too late (i.e. after inference) in this fairly minimal demo, for example:

https://josephrocca.github.io/openai-clip-js/onnx-image-demo.html

That's using the latest ORT Web version, and has self.crossOriginIsolated==true0. I see ort-wasm-simd-threaded.wasm load in the network tab, and worker threads immediately appear - before inference happens.

Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the ort.env.wasm.proxy flag to get the model off the main thread "automatically".


[0] Just a heads-up: For some reason I had to manually refresh the page the first time I loaded it just now - the service worker that adds the COOP/COEP headers didn't refresh automatically like it's supposed to.

@fs-eire
Copy link

fs-eire commented Jul 15, 2023

if you use ort.env.wasm.proxy flag, the proxy worker will be spawn immediately. this is different worker to the workers created for multithread computation

@xenova
Copy link
Owner

xenova commented Jul 16, 2023

Should we see performance improvements even if the batch size is 1? Could you maybe explain how work is divided among threads @fs-eire?

Regarding a demo, @jozefchutka would you mind sharing the code you were referring to above? My testing was done inside my whisper-web application, which is quite large and had a lot of bloat around it.

@jozefchutka
Copy link
Author

Here is a demo:

worker.js

import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.4.1/dist/transformers.min.js";

env.allowLocalModels = false;
//env.backends.onnx.wasm.numThreads = 4;

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);

const t0 = performance.now();
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

for(let {text, timestamp} of result.chunks)
	console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);

console.log(performance.now() - t0);

demo.html

<script>
new Worker("worker.js", {type:"module"});
</script>

a script to generate .pcm file:

ffmpeg -i tos.mp4 -filter_complex [0:1]aformat=channel_layouts=mono,aresample=16000[aout] -map [aout] -c:a pcm_f32le -f data tos.pcm

changing value of env.backends.onnx.wasm.numThreads makes no difference in transcription performance of tested 1 minute long pcm.

@xenova
Copy link
Owner

xenova commented Aug 3, 2023

Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug.

@fs-eire Any updates on this maybe? 😅 Is there perhaps an issue with spawning workers from a worker itself? Here's a 60-second audio file for testing, if you need it: ted_60.wav

@xenova
Copy link
Owner

xenova commented Aug 8, 2023

@jozefchutka Can you maybe test with @josephrocca's previous suggestion? env.backends.onnx.wasm.proxy=true

Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the ort.env.wasm.proxy flag to get the model off the main thread "automatically".

@jozefchutka
Copy link
Author

@xenova , I observe no difference in performance or extra threads/workers running when tested with env.backends.onnx.wasm.proxy=true

@xenova
Copy link
Owner

xenova commented Aug 8, 2023

@jozefchutka Did you try not using a worker.js file, and just keeping all transformers.js logic in the UI thread (but still using proxy=true).

@jozefchutka
Copy link
Author

jozefchutka commented Aug 8, 2023

This is a version without my worker test.html:

<script type="module">
import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0/dist/transformers.min.js";

env.allowLocalModels = false;
env.backends.onnx.wasm.proxy = true;

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);

const t0 = performance.now();
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

for(let {text, timestamp} of result.chunks)
	console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);

console.log(performance.now() - t0);
</script>

With this script, I can see 4 workers opened, however await pipeline() is never resolved and the script basically hangs on that line. Can you please have a look?

@xenova
Copy link
Owner

xenova commented Aug 8, 2023

await pipeline() is never resolved

Are you sure it's not just downloading the model? Can you check your network tab?

I'll test this though.

@xenova
Copy link
Owner

xenova commented Aug 14, 2023

I've done a bit of benchmarking and there does not seem to be any speedup when using threads. url: https://xenova-whisper-testing.hf.space/ consistently takes 3.8 seconds. I do see the threads spawn though.

Also, using the proxy just freezes everything after spawning 6 threads.

@jozefchutka am I missing something? Is this also what you see?
@fs-eire I am still using onnxruntime-web v1.14.0 - is this something which was fixed in a later release?

@jozefchutka
Copy link
Author

jozefchutka commented Aug 15, 2023

@xenova thats same as what I have observed

@guschmue
Copy link

I just tried this with a simple app and works fine for me.
Let me try with transformers.js next.
As long you see ort-wasm-simd-threaded.wasm loading it should work.
For testing you can add
--enable-features=SharedArrayBuffer
to the chrome command line to rule out any COEP/COOP issue.

@xenova
Copy link
Owner

xenova commented Aug 15, 2023

I just tried this with a simple app and works fine for me.

Do you see speedups too? 👀

As long you see ort-wasm-simd-threaded.wasm loading it should work.

@guschmue It does seem to load this file when running this demo, but no performance improvements (all 3.7 seconds)

image

I am still using v1.14.0, so if something changed since then, I can update and check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants