get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8) #3223

swetha097 · 2025-06-03T13:09:17Z

This implements the GGML_OP_GET_ROWS operation specifically for repacked (block interleaved) 4-bit quantized format (q4_0x8) defined within ggml-cpu-aarch64.cpp
The following gains were observed by the changes made in the PR - The changes allow for increased usage of the GEMM function (ggml_gemm_q4_0_8x8_q8_0) for q4_0 type defined within ggml-cpu-aarch64.cpp

master branch commit - 82f461eaa4e6a1ba29fc0dbdaa415a9934ee8a1d
development (get_rows) branch commit - d6cc466bd48dd27474ecb00c3baba2e8a887f6c4

The PR was tested in AMD Raphael 7600X which supports the following flags :

system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Model for performance tests Downloaded from : https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-base.en.bin and quantized to q4_0

This patch of the code was also tested with llama.cpp repository & the perplexity of Q4_0 models were ensured to be the same before and after the changes made :

Final estimate: PPL = 5.9625 +/- 0.03348

Model used for perplexity test quantized from - https://huggingface.co/meta-llama/Llama-2-7b

…ghts of type q4_0

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

src/whisper.cpp

swetha097 · 2025-06-06T10:57:13Z

@danbev Addressed the review comments

ggerganov · 2025-06-06T12:15:31Z

Let's merge this after the refactoring in ggml-org/llama.cpp#13892 to avoid resolving the conflicts.

danbev · 2025-06-16T05:43:16Z

@swetha097 Would you be able to take a look at resolving the conflict reported here?

…ghts of type q4_0

…r.cpp into swe_pr/get_rows

swetha097 · 2025-06-17T13:25:09Z

@danbev Resolved the conflicts.

Performance with the gcc compiler is stable and matches the previous results.
To see similar performance gains on clang compiler, we still need to pull in the latest updates from the llama.cpp's ggml library.

danbev · 2025-06-17T13:31:19Z

@swetha097 Thanks! I think there will be a sync of ggml/llama.cpp shortly, hopefully this won't cause conflict but I'm not sure.

ggerganov · 2025-06-18T09:57:35Z

@swetha097 Hm, I am a bit confused about why this would make a difference for the performance. The only place that we use ggml_get_rows is here:

whisper.cpp/src/whisper.cpp

Lines 2543 to 2548 in 1e72e4b

    
           // token encoding + position encoding 
        
           struct ggml_tensor * cur = 
        
               ggml_add(ctx0, 
        
                       ggml_get_rows(ctx0, model.d_te, embd), 
        
                       ggml_get_rows(ctx0, model.d_pe, position));

How does this change lead to increased usage of the ggml_gemm_q4_0_8x8_q8_0?

swetha097 · 2025-06-18T10:39:01Z

@ggerganov
As noted in whisper-arch.h,

    // Note: ASR_TENSOR_DEC_TOKEN_EMBD_WEIGHT is also used by GGML_OP_MAT_MUL. Need to figure out a way how to handle
    // weight tensors that are used by multiple different operators when extra_buffer_type implementations accelerate
    // more than just GGML_OP_MUL_MAT.
    {ASR_TENSOR_DEC_TOKEN_EMBD_WEIGHT, GGML_OP_GET_ROWS},

this tensor is used by both GGML_OP_MAT_MUL and GGML_OP_GET_ROWS. Because the original get_rows operation could not handle repacked weights, the tensor remained in a standard format, preventing the GGML_OP_MAT_MUL operation from using its most optimized kernel (ggml_gemm_q4_0_8x8_q8_0), instead it was using gemm4xN kernel previously.

ggerganov · 2025-06-18T11:24:13Z

@swetha097 Thanks, got it.

@eddnjjn Could you help reviewing this change?

Get_Rows & Dequantize implementation adapted to work for repacked wei…

d6cc466

…ghts of type q4_0

danbev approved these changes Jun 6, 2025

View reviewed changes

Resolve PR comments

994e02a

swetha097 added 4 commits June 16, 2025 00:28

Get_Rows & Dequantize implementation adapted to work for repacked wei…

ed1d3a2

…ghts of type q4_0

Resolve PR comments

6959d41

Merge branch 'swe_pr/get_rows' of https://github.com/swetha097/whispe…

1a79d18

…r.cpp into swe_pr/get_rows

Merge branch 'master' into swe_pr/get_rows

1e72e4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8) #3223

get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8) #3223

swetha097 commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swetha097 commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

swetha097 commented Jun 17, 2025

Uh oh!

danbev commented Jun 17, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

swetha097 commented Jun 18, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

Uh oh!

get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8) #3223

Are you sure you want to change the base?

get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8) #3223

Conversation

swetha097 commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swetha097 commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

swetha097 commented Jun 17, 2025

Uh oh!

danbev commented Jun 17, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

swetha097 commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

Uh oh!

swetha097 commented Jun 18, 2025 •

edited

Loading