Allow to hint number of threads for CPU MLContext #436

huningxin · 2023-06-28T08:16:48Z

Framework use cases

The multi-cores architecture is widely available in modern CPUs that are commonly utilized by ML frameworks to parallelize the operator computation when inferring a model.

However, the number of threads (degree of parallelism) configuration may depend on different usage scenarios, e.g., for small models, single-threaded execution may be preferred because the task scheduling overhead may take over the speedup of parallel execution.

So, the ML frameworks usually allow users to control the number of threads according to their requirement. For example, the ONNXRuntime allows to configure intra_op_num_threads of CPU execution provider. TensorFlow-Lite provides setNumThreads method for its interpreter.

Native ML APIs

The native CPU ML API/lib commonly employ a threadpool for thread-level parallelism. The threadpool usually allows to configure the number of threads in this pool, for example:

XNNPACK utilizes pthreadpool for that allows to configure threads_count when creating the thread pool.

MLAS utilizes onnxruntime::concurrency::ThreadPool that allows to constructs a thread pool for running with degree_of_parallelism threads.

BNNS allows to set n_threads that controls the number of worker threads to execute an kernel.

Other references

Model Loader API already extends the MLContextOptions with numThreads that allows the JS code to set the number of thread to use when computing a model.

Proposal

WebNN may adopt the MLContextOptions.numThreads extension and allow frameworks to hint the number of threads to run operators in parallel for CPU MLContext.

/cc @pyu10055 @wacky6

The text was updated successfully, but these errors were encountered:

fdwr · 2023-09-09T20:38:59Z

MLContextOptions.threadCount seems a useful option to me, as an ignorable hint at least. Do JS users have enough information to set an appropriate value for it? Setting a higher thread count than actual cores would be useless (but maybe the API would just clamp to the actual count). The two most useful and most commonly set values would presumably be either 1 or the number of physical cores.

(for naming, I'd follow the "use whole words" identifier advice and avoid fragments like "num")

dontcallmedom · 2023-09-11T05:50:28Z

if it's a hint rather than a configuration, and if setting the exact number depends on information not exposed to developers (the # of core), maybe a better approach would be an enum that hints towards single or multi-threaded execution?

wacky6 · 2023-09-11T06:42:03Z

navigator.hardwareConcurrency is available to developer, so I think "# of cores" is a known value?

Single vs multi-thread doesn't provide sufficient granularity.

@huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support?

dontcallmedom · 2023-09-11T07:40:36Z

my understanding is that hardwareConcurrency is not supported in Safari, so it may still be advantageous not to rely on having that number exposed; but I defer to actual experts on the level of granularity that would be needed for effective optimization.

huningxin · 2023-09-12T01:57:35Z

@huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support?

Yes. We collected inference latency of some MediaPipe models on Chromium WebNN XNNPACK CPU prototype with different number of threads setting (1, 2 and 4).

According to the current Chromium prototype implementation, the threads number is capped to the minimum value of 4 and system available cores. And because the parallel inference jobs are scheduled by Chromium's ThreadPool, there is no guarantee that the number of threads set by user would be allocated.

In the following chart, the multi-threads inference speedup is normalized to single-thread (numThreads=1) performance. As the chart illustrates, for some models, such as SelfieSegmenter (landscape), MobileNetV3 (small_075), BlazeFace (short-range), Blendshape and FaceDetector, setting more number of threads doesn't help. These models are usually small.

dontcallmedom · 2023-09-12T07:25:18Z

if I read the chart correctly, there is only one case where setting the number of threads to something different from 1 or max leads to better performance (GestureClassifier) - can anyone hint as to why 2 threads are optimal for that particular model?

huningxin · 2023-09-13T07:58:56Z

@dontcallmedom

can anyone hint as to why 2 threads are optimal for that particular model?

I suppose the context switching / job scheduling overhead would take over the inference time reduction by adding two more threads / jobs for that particular model.

fdwr · 2023-09-13T18:51:12Z

can anyone hint as to why 2 threads are optimal for that particular model?

🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads (edit: oops, you said 4 above), whereas with 2 threads, more long-running operators happen to align nicely.

@huningxin : Would this new MLContextOptions.threadCount represent interoperator threading or intraoperator threading? (or really however the backend chooses to interpret it?)

huningxin · 2023-09-14T00:21:56Z

@fdwr

🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads, whereas with 2 threads, more long-running operators happen to align nicely.

This seem to be possible, although we didn't test with 3 threads.

@huningxin : Would this new MLContextOptions.threadCount represent interoperator threading or intraoperator threading? (or really however the backend chooses to interpret it?)

This is a good point. The current prototype implementation interprets it as intra-operator threading. Should we allow developers to hint inter-operator threading and intra-operator threading separately?

This is a temporary solution to close the perf gap between the TFLite backend and the XNNPACK backend, which will allow us to delete the XNNPACK backend Long-term discussions of how to specify this behavior are happening on webmachinelearning/webnn#436 Bug: 338162119 Change-Id: I42199744f4a8f3e685cc550dcd013183be65aeb3 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5506116 Reviewed-by: Reilly Grant <reillyg@chromium.org> Reviewed-by: Alex Gough <ajgo@chromium.org> Auto-Submit: Austin Sullivan <asully@chromium.org> Commit-Queue: Alex Gough <ajgo@chromium.org> Cr-Commit-Position: refs/heads/main@{#1295088}

inexorabletash added the feature request label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to hint number of threads for CPU MLContext #436

Allow to hint number of threads for CPU MLContext #436

huningxin commented Jun 28, 2023

fdwr commented Sep 9, 2023 •

edited

Loading

dontcallmedom commented Sep 11, 2023

wacky6 commented Sep 11, 2023

dontcallmedom commented Sep 11, 2023

huningxin commented Sep 12, 2023

dontcallmedom commented Sep 12, 2023

huningxin commented Sep 13, 2023

fdwr commented Sep 13, 2023 •

edited

Loading

huningxin commented Sep 14, 2023

Allow to hint number of threads for CPU MLContext #436

Allow to hint number of threads for CPU MLContext #436

Comments

huningxin commented Jun 28, 2023

Framework use cases

Native ML APIs

Other references

Proposal

fdwr commented Sep 9, 2023 • edited Loading

dontcallmedom commented Sep 11, 2023

wacky6 commented Sep 11, 2023

dontcallmedom commented Sep 11, 2023

huningxin commented Sep 12, 2023

dontcallmedom commented Sep 12, 2023

huningxin commented Sep 13, 2023

fdwr commented Sep 13, 2023 • edited Loading

huningxin commented Sep 14, 2023

fdwr commented Sep 9, 2023 •

edited

Loading

fdwr commented Sep 13, 2023 •

edited

Loading