[Performance] Upstream MLAS backend optimization for better thread partitioning in multi-group or large batch convolutions

### Describe the issue

We have implemented a performance optimization in the MLAS backend of ONNX Runtime that improves CPU utilization in convolution workloads with multiple groups or large batch sizes. As a result, in multi-group or large-batch convolution scenarios, we observe near-linear performance scaling with increasing intra_op_num_threads.

We would like to ask:

1. Whether this kind of optimization aligns with the upstream design goals for MLAS?
2. If so, would the ONNX Runtime team be open to reviewing a PR that introduces this optimization?
3. Are there any guidelines or preferred patterns for contributing such low-level performance improvements to MLAS?

### To reproduce

This optimization targets CPU inference scenarios with:
- Convolution models using group > 1, or
- Convolutions with large batch size (e.g., batch ≥ 32)

To observe the issue without optimization:
1. Use ort.SessionOptions().intra_op_num_threads = N (e.g., 16 or 32) to explicitly control the number of intra-op threads.
2. Benchmark the execution time and thread utilization of convolution operators with multiple groups or large batch sizes
3. Measure the overall inference latency of such models and compare core utilization across threads.

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 20.04 (x86_64)

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

master

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

Other / Unknown

### Execution Provider Library Version

MLAS(cpu default execution provider)

### Model File

_No response_

### Is this a quantized model?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Upstream MLAS backend optimization for better thread partitioning in multi-group or large batch convolutions #25152

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Upstream MLAS backend optimization for better thread partitioning in multi-group or large batch convolutions #25152

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions