Open
Description
Describe the issue
We have implemented a performance optimization in the MLAS backend of ONNX Runtime that improves CPU utilization in convolution workloads with multiple groups or large batch sizes. As a result, in multi-group or large-batch convolution scenarios, we observe near-linear performance scaling with increasing intra_op_num_threads.
We would like to ask:
- Whether this kind of optimization aligns with the upstream design goals for MLAS?
- If so, would the ONNX Runtime team be open to reviewing a PR that introduces this optimization?
- Are there any guidelines or preferred patterns for contributing such low-level performance improvements to MLAS?
To reproduce
This optimization targets CPU inference scenarios with:
- Convolution models using group > 1, or
- Convolutions with large batch size (e.g., batch ≥ 32)
To observe the issue without optimization:
- Use ort.SessionOptions().intra_op_num_threads = N (e.g., 16 or 32) to explicitly control the number of intra-op threads.
- Benchmark the execution time and thread utilization of convolution operators with multiple groups or large batch sizes
- Measure the overall inference latency of such models and compare core utilization across threads.
Urgency
No response
Platform
Linux
OS Version
Ubuntu 20.04 (x86_64)
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
master
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Other / Unknown
Execution Provider Library Version
MLAS(cpu default execution provider)
Model File
No response
Is this a quantized model?
No