[intel-npu] refine microsoft matmulnbits op, to fit NPU mix precision pattern #29671

bopeng1234 · 2025-03-25T05:18:23Z

Details:

when MatMulNBits met requirements of NPU (sym INT4, channel-wised Quantization) the performance raised a lot

  // For NPU Optimization
  // reference: https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support
  //  1. model must be exported with symmetric INT4 quantization.
  //  2. The quantized LLM MatMul Op need to use Channel-wised quantization for better performance. (block_size = K)
  // with the limitation, the NPU vpux-plugin compiler could use mix-precision matmul with i4 weight input.
  // it raised the performance a lot in NPU.

This PR is going to optimize the decomposition OPs (convert/multiply/matmul) to match the NPU optimize pattern, before this PR, the INT4 phi3 onnx model using NPUW runs:
1st token 8000ms, 2nd token 9040ms, 3rd+ token avg 4000ms

apply this PR, the same model could reach to:
1st token 3400ms, 2nd token 3400ms, 3rd+ token avg 250ms

16x speed up in NPUW for phi3 model

… pattern ### Details: - when MatMulNBits met requirements of NPU (sym INT4, channel-wised Quantization) the performance raised a lot

gkrivor · 2025-04-11T07:03:08Z

Hi @bopeng1234, thanks for contribution! it is a common code for all plugins, did you check cpu/gpu/template has non-worse results?

bopeng1234 · 2025-04-22T02:22:48Z

created a new one, #30265, close it.

[intel-npu] refine microsoft matmulnbits op, to fit NPU mix precision…

55322a0

… pattern ### Details: - when MatMulNBits met requirements of NPU (sym INT4, channel-wised Quantization) the performance raised a lot

bopeng1234 requested a review from a team as a code owner March 25, 2025 05:18

github-actions bot added the category: ONNX FE OpenVINO ONNX FrontEnd label Mar 25, 2025

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Mar 25, 2025

bopeng1234 closed this Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[intel-npu] refine microsoft matmulnbits op, to fit NPU mix precision pattern #29671

[intel-npu] refine microsoft matmulnbits op, to fit NPU mix precision pattern #29671

Uh oh!

bopeng1234 commented Mar 25, 2025

Uh oh!

gkrivor commented Apr 11, 2025

Uh oh!

bopeng1234 commented Apr 22, 2025

Uh oh!

Uh oh!

[intel-npu] refine microsoft matmulnbits op, to fit NPU mix precision pattern #29671

[intel-npu] refine microsoft matmulnbits op, to fit NPU mix precision pattern #29671

Uh oh!

Conversation

bopeng1234 commented Mar 25, 2025

Details:

Uh oh!

gkrivor commented Apr 11, 2025

Uh oh!

bopeng1234 commented Apr 22, 2025

Uh oh!

Uh oh!