Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mlas] Unblock hardcoded matmul blocking size #23815

Merged
merged 2 commits into from
Feb 27, 2025
Merged

Conversation

fajin-corp
Copy link
Contributor

@fajin-corp fajin-corp commented Feb 25, 2025

Description

In GemmBatch, target matrix is cut into blocks to dispatch to multiple threads for intra-op parallelism.

Currently the block size hard-coded to 16. If the CPU has > 16 cores, cores are not fully utilized in one op.

This change unblocks the number of blocks in various MatMul.

Benchmark results

Model: llmlingua-2-bert-base-multilingual-cased-meetingbank--add-force-token-100--max-seq-len-512-CPU-INT8.onnx
set up: 96 core x86 linux

Before:
Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.485097 s
First inference time cost: 356 ms
Total inference time cost: 17.731 s
Total inference requests: 50
Average inference time cost: 354.619 ms
Total inference run time: 17.7312 s
Number of inferences per second: 2.81989
Avg CPU usage: 65 %
Peak working set size: 542265344 bytes
Avg CPU usage:65
Peak working set size:542265344

After:

Setting intra_op_num_threads to 32
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.523394 s
First inference time cost: 316 ms
Total inference time cost: 12.2739 s
Total inference requests: 50
Average inference time cost: 245.478 ms
Total inference run time: 12.2741 s
Number of inferences per second: 4.07362
Avg CPU usage: 33 %
Peak working set size: 611241984 bytes
Avg CPU usage:33
Peak working set size:611241984

Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.497698 s
First inference time cost: 289 ms
Total inference time cost: 9.49205 s
Total inference requests: 50
Average inference time cost: 189.841 ms
Total inference run time: 9.49226 s
Number of inferences per second: 5.26745
Avg CPU usage: 65 %
Peak working set size: 548470784 bytes
Avg CPU usage:65
Peak working set size:548470784
Runs:50

Motivation and Context

This issue is reported by M365 research team.

@fajin-corp fajin-corp requested a review from a team as a code owner February 25, 2025 23:39
@fajin-corp fajin-corp merged commit c61a4b1 into main Feb 27, 2025
97 of 99 checks passed
@fajin-corp fajin-corp deleted the fajin/mlas-threading branch February 27, 2025 21:24
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description

In GemmBatch, target matrix is cut into blocks to dispatch to multiple
threads for intra-op parallelism.

Currently the block size hard-coded to 16. If the CPU has > 16 cores,
cores are not fully utilized in one op.

This change unblocks the number of blocks in various MatMul.

__Benchmark results__

Model:
llmlingua-2-bert-base-multilingual-cased-meetingbank--add-force-token-100--max-seq-len-512-CPU-INT8.onnx
set up: 96 core x86 linux

Before: 
Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.485097 s
First inference time cost: 356 ms
Total inference time cost: 17.731 s
Total inference requests: 50
__Average inference time cost: 354.619 ms__
Total inference run time: 17.7312 s
Number of inferences per second: 2.81989
Avg CPU usage: 65 %
Peak working set size: 542265344 bytes
Avg CPU usage:65
Peak working set size:542265344

After:

Setting intra_op_num_threads to 32
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.523394 s
First inference time cost: 316 ms
Total inference time cost: 12.2739 s
Total inference requests: 50
__Average inference time cost: 245.478 ms__
Total inference run time: 12.2741 s
Number of inferences per second: 4.07362
Avg CPU usage: 33 %
Peak working set size: 611241984 bytes
Avg CPU usage:33
Peak working set size:611241984


Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.497698 s
First inference time cost: 289 ms
Total inference time cost: 9.49205 s
Total inference requests: 50
__Average inference time cost: 189.841 ms__
Total inference run time: 9.49226 s
Number of inferences per second: 5.26745
Avg CPU usage: 65 %
Peak working set size: 548470784 bytes
Avg CPU usage:65
Peak working set size:548470784
Runs:50

### Motivation and Context
This issue is reported by M365 research team.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants