Skip to content

get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8) #3223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

swetha097
Copy link

  • This implements the GGML_OP_GET_ROWS operation specifically for repacked (block interleaved) 4-bit quantized format (q4_0x8) defined within ggml-cpu-aarch64.cpp

  • The following gains were observed by the changes made in the PR - The changes allow for increased usage of the GEMM function (ggml_gemm_q4_0_8x8_q8_0) for q4_0 type defined within ggml-cpu-aarch64.cpp

image

master branch commit - 82f461eaa4e6a1ba29fc0dbdaa415a9934ee8a1d
development (get_rows) branch commit - d6cc466bd48dd27474ecb00c3baba2e8a887f6c4

The PR was tested in AMD Raphael 7600X which supports the following flags :

system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Model for performance tests Downloaded from : https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-base.en.bin and quantized to q4_0

This patch of the code was also tested with llama.cpp repository & the perplexity of Q4_0 models were ensured to be the same before and after the changes made :

Final estimate: PPL = 5.9625 +/- 0.03348

Model used for perplexity test quantized from - https://huggingface.co/meta-llama/Llama-2-7b

@swetha097
Copy link
Author

@danbev Addressed the review comments

@ggerganov
Copy link
Member

Let's merge this after the refactoring in ggml-org/llama.cpp#13892 to avoid resolving the conflicts.

@danbev
Copy link
Collaborator

danbev commented Jun 16, 2025

@swetha097 Would you be able to take a look at resolving the conflict reported here?

@swetha097
Copy link
Author

@danbev Resolved the conflicts.

  • Performance with the gcc compiler is stable and matches the previous results.
  • To see similar performance gains on clang compiler, we still need to pull in the latest updates from the llama.cpp's ggml library.

@danbev
Copy link
Collaborator

danbev commented Jun 17, 2025

@swetha097 Thanks! I think there will be a sync of ggml/llama.cpp shortly, hopefully this won't cause conflict but I'm not sure.

@ggerganov
Copy link
Member

@swetha097 Hm, I am a bit confused about why this would make a difference for the performance. The only place that we use ggml_get_rows is here:

whisper.cpp/src/whisper.cpp

Lines 2543 to 2548 in 1e72e4b

// token encoding + position encoding
struct ggml_tensor * cur =
ggml_add(ctx0,
ggml_get_rows(ctx0, model.d_te, embd),
ggml_get_rows(ctx0, model.d_pe, position));

How does this change lead to increased usage of the ggml_gemm_q4_0_8x8_q8_0?

@swetha097
Copy link
Author

swetha097 commented Jun 18, 2025

@ggerganov
As noted in whisper-arch.h,

    // Note: ASR_TENSOR_DEC_TOKEN_EMBD_WEIGHT is also used by GGML_OP_MAT_MUL. Need to figure out a way how to handle
    // weight tensors that are used by multiple different operators when extra_buffer_type implementations accelerate
    // more than just GGML_OP_MUL_MAT.
    {ASR_TENSOR_DEC_TOKEN_EMBD_WEIGHT, GGML_OP_GET_ROWS},

this tensor is used by both GGML_OP_MAT_MUL and GGML_OP_GET_ROWS. Because the original get_rows operation could not handle repacked weights, the tensor remained in a standard format, preventing the GGML_OP_MAT_MUL operation from using its most optimized kernel (ggml_gemm_q4_0_8x8_q8_0), instead it was using gemm4xN kernel previously.

@ggerganov
Copy link
Member

@swetha097 Thanks, got it.

@eddnjjn Could you help reviewing this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants