Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[intel-npu] refine microsoft matmulnbits op, to fit NPU mix precision pattern #29671

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bopeng1234
Copy link
Contributor

Details:

  • when MatMulNBits met requirements of NPU (sym INT4, channel-wised Quantization) the performance raised a lot

      // For NPU Optimization
      // reference: https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support
      //  1. model must be exported with symmetric INT4 quantization.
      //  2. The quantized LLM MatMul Op need to use Channel-wised quantization for better performance. (block_size = K)
      // with the limitation, the NPU vpux-plugin compiler could use mix-precision matmul with i4 weight input.
      // it raised the performance a lot in NPU.
    

This PR is going to optimize the decomposition OPs (convert/multiply/matmul) to match the NPU optimize pattern, before this PR, the INT4 phi3 onnx model using NPUW runs:
1st token 8000ms, 2nd token 9040ms, 3rd+ token avg 4000ms

apply this PR, the same model could reach to:
1st token 3400ms, 2nd token 3400ms, 3rd+ token avg 250ms

16x speed up in NPUW for phi3 model

… pattern

### Details:
- when MatMulNBits met requirements of NPU (sym INT4, channel-wised Quantization) the performance raised a lot
@bopeng1234 bopeng1234 requested a review from a team as a code owner March 25, 2025 05:18
@github-actions github-actions bot added the category: ONNX FE OpenVINO ONNX FrontEnd label Mar 25, 2025
@sys-openvino-ci sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: ONNX FE OpenVINO ONNX FrontEnd ExternalIntelPR External contributor from Intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants