Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to hint number of threads for CPU MLContext #436

Open
huningxin opened this issue Jun 28, 2023 · 9 comments
Open

Allow to hint number of threads for CPU MLContext #436

huningxin opened this issue Jun 28, 2023 · 9 comments

Comments

@huningxin
Copy link
Contributor

Framework use cases

The multi-cores architecture is widely available in modern CPUs that are commonly utilized by ML frameworks to parallelize the operator computation when inferring a model.

However, the number of threads (degree of parallelism) configuration may depend on different usage scenarios, e.g., for small models, single-threaded execution may be preferred because the task scheduling overhead may take over the speedup of parallel execution.

So, the ML frameworks usually allow users to control the number of threads according to their requirement. For example, the ONNXRuntime allows to configure intra_op_num_threads of CPU execution provider. TensorFlow-Lite provides setNumThreads method for its interpreter.

Native ML APIs

The native CPU ML API/lib commonly employ a threadpool for thread-level parallelism. The threadpool usually allows to configure the number of threads in this pool, for example:

XNNPACK utilizes pthreadpool for that allows to configure threads_count when creating the thread pool.

MLAS utilizes onnxruntime::concurrency::ThreadPool that allows to constructs a thread pool for running with degree_of_parallelism threads.

BNNS allows to set n_threads that controls the number of worker threads to execute an kernel.

Other references

Model Loader API already extends the MLContextOptions with numThreads that allows the JS code to set the number of thread to use when computing a model.

Proposal

WebNN may adopt the MLContextOptions.numThreads extension and allow frameworks to hint the number of threads to run operators in parallel for CPU MLContext.

/cc @pyu10055 @wacky6

@fdwr
Copy link
Collaborator

fdwr commented Sep 9, 2023

MLContextOptions.threadCount seems a useful option to me, as an ignorable hint at least. Do JS users have enough information to set an appropriate value for it? Setting a higher thread count than actual cores would be useless (but maybe the API would just clamp to the actual count). The two most useful and most commonly set values would presumably be either 1 or the number of physical cores.

(for naming, I'd follow the "use whole words" identifier advice and avoid fragments like "num")

@dontcallmedom
Copy link
Contributor

if it's a hint rather than a configuration, and if setting the exact number depends on information not exposed to developers (the # of core), maybe a better approach would be an enum that hints towards single or multi-threaded execution?

@wacky6
Copy link

wacky6 commented Sep 11, 2023

navigator.hardwareConcurrency is available to developer, so I think "# of cores" is a known value?

Single vs multi-thread doesn't provide sufficient granularity.

@huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support?

@dontcallmedom
Copy link
Contributor

my understanding is that hardwareConcurrency is not supported in Safari, so it may still be advantageous not to rely on having that number exposed; but I defer to actual experts on the level of granularity that would be needed for effective optimization.

@huningxin
Copy link
Contributor Author

@huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support?

Yes. We collected inference latency of some MediaPipe models on Chromium WebNN XNNPACK CPU prototype with different number of threads setting (1, 2 and 4).

According to the current Chromium prototype implementation, the threads number is capped to the minimum value of 4 and system available cores. And because the parallel inference jobs are scheduled by Chromium's ThreadPool, there is no guarantee that the number of threads set by user would be allocated.

In the following chart, the multi-threads inference speedup is normalized to single-thread (numThreads=1) performance. As the chart illustrates, for some models, such as SelfieSegmenter (landscape), MobileNetV3 (small_075), BlazeFace (short-range), Blendshape and FaceDetector, setting more number of threads doesn't help. These models are usually small.

image

@dontcallmedom
Copy link
Contributor

if I read the chart correctly, there is only one case where setting the number of threads to something different from 1 or max leads to better performance (GestureClassifier) - can anyone hint as to why 2 threads are optimal for that particular model?

@huningxin
Copy link
Contributor Author

@dontcallmedom

can anyone hint as to why 2 threads are optimal for that particular model?

I suppose the context switching / job scheduling overhead would take over the inference time reduction by adding two more threads / jobs for that particular model.

@fdwr
Copy link
Collaborator

fdwr commented Sep 13, 2023

can anyone hint as to why 2 threads are optimal for that particular model?

🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads (edit: oops, you said 4 above), whereas with 2 threads, more long-running operators happen to align nicely.

@huningxin : Would this new MLContextOptions.threadCount represent interoperator threading or intraoperator threading? (or really however the backend chooses to interpret it?)

@huningxin
Copy link
Contributor Author

@fdwr

🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads, whereas with 2 threads, more long-running operators happen to align nicely.

This seem to be possible, although we didn't test with 3 threads.

@huningxin : Would this new MLContextOptions.threadCount represent interoperator threading or intraoperator threading? (or really however the backend chooses to interpret it?)

This is a good point. The current prototype implementation interprets it as intra-operator threading. Should we allow developers to hint inter-operator threading and intra-operator threading separately?

aarongable pushed a commit to chromium/chromium that referenced this issue May 1, 2024
This is a temporary solution to close the perf gap between the TFLite
backend and the XNNPACK backend, which will allow us to delete the
XNNPACK backend

Long-term discussions of how to specify this behavior are happening on
webmachinelearning/webnn#436

Bug: 338162119
Change-Id: I42199744f4a8f3e685cc550dcd013183be65aeb3
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5506116
Reviewed-by: Reilly Grant <reillyg@chromium.org>
Reviewed-by: Alex Gough <ajgo@chromium.org>
Auto-Submit: Austin Sullivan <asully@chromium.org>
Commit-Queue: Alex Gough <ajgo@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1295088}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants