-
Notifications
You must be signed in to change notification settings - Fork 4.4k
get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8) #3223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ghts of type q4_0
@danbev Addressed the review comments |
Let's merge this after the refactoring in ggml-org/llama.cpp#13892 to avoid resolving the conflicts. |
@swetha097 Would you be able to take a look at resolving the conflict reported here? |
@danbev Resolved the conflicts.
|
@swetha097 Thanks! I think there will be a sync of ggml/llama.cpp shortly, hopefully this won't cause conflict but I'm not sure. |
@swetha097 Hm, I am a bit confused about why this would make a difference for the performance. The only place that we use Lines 2543 to 2548 in 1e72e4b
How does this change lead to increased usage of the |
@ggerganov
this tensor is used by both GGML_OP_MAT_MUL and GGML_OP_GET_ROWS. Because the original get_rows operation could not handle repacked weights, the tensor remained in a standard format, preventing the GGML_OP_MAT_MUL operation from using its most optimized kernel (ggml_gemm_q4_0_8x8_q8_0), instead it was using gemm4xN kernel previously. |
@swetha097 Thanks, got it. @eddnjjn Could you help reviewing this change? |
This implements the GGML_OP_GET_ROWS operation specifically for repacked (block interleaved) 4-bit quantized format (q4_0x8) defined within ggml-cpu-aarch64.cpp
The following gains were observed by the changes made in the PR - The changes allow for increased usage of the GEMM function (ggml_gemm_q4_0_8x8_q8_0) for q4_0 type defined within ggml-cpu-aarch64.cpp
master branch commit - 82f461eaa4e6a1ba29fc0dbdaa415a9934ee8a1d
development (get_rows) branch commit - d6cc466bd48dd27474ecb00c3baba2e8a887f6c4
The PR was tested in AMD Raphael 7600X which supports the following flags :
Model for performance tests Downloaded from : https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-base.en.bin and quantized to q4_0
This patch of the code was also tested with llama.cpp repository & the perplexity of Q4_0 models were ensured to be the same before and after the changes made :
Model used for perplexity test quantized from - https://huggingface.co/meta-llama/Llama-2-7b