quantize: Handle user-defined quantization levels for additional tensors #12511

EAddario · 2025-03-22T08:21:22Z

This PR adds the ability to quantize other tensors, beyond token-embedding and output-tensor. It handles most of the supported architectures. ~~except Mamba, RWKV6, RWKV6QWEN2 and T5 to avoid having too many command options, but can add as well if maintainers request it.~~

For full background on the PR, please see: Squeezing Tensor Bits: the quest for smaller LLMs

max-krasnyansky · 2025-03-25T03:19:30Z

How about we add a more generic --tensor-type tensor_name_pattern=type.
@slaren has the PR #11397 that overrides backend mapping per tensor.
Let's make this one similar (same patters, etc). That way we'll be able to override specific layers (if needed).

EAddario · 2025-03-25T09:24:05Z

That's an excellent idea! and it'll allow to add all supported tensor types (50+) without creating a mess of parameters. Plus, it will give me something to do over the weekend 😆

jukofyork · 2025-03-27T13:05:51Z

How about we add a more generic --tensor-type tensor_name_pattern=type. @slaren has the PR #11397 that overrides backend mapping per tensor. Let's make this one similar (same patters, etc). That way we'll be able to override specific layers (if needed).

Yeah, I think this is definitely the way to go - the regex support of that PR gives really good flexibility.

…tion

EAddario · 2025-03-29T14:30:20Z

Adding the regex matching would be a relatively straightforward change, and happy to do it, but wanted to check how useful the feature would really be.

@slaren's use case is about choosing which tensors and/or which layers get processed where. For example, the user could opt to run all expert tensors in CPU and the shared experts in the GPU, like so: -ot "_exp=CPU" -ot "_sexp=GPU", or like in the PR's example, to keep experts of layers 20-99 in the CPU: -ot "[2-9][0-9]\.ffn_.*_exps\.=CPU"

In my use case, I had assumed the user would want to choose which tensors get quantized at which level, excluding the ones already handled by the program (--leave-output-tensor, --output-tensor-type, --token-embedding-type and --pure), with said quantization applied to the selected tensor/s across all layers.

The changes I have just pushed implement exactly that, but with @max-krasnyansky suggestion. A couple of things to note:

Implemented the --tensor-type logic very similar to --override-kv to keep the code consistent
ALLOWED_TENSOR_TYPE in quantize.cpp controls which tensors can be selected. I have excluded 1D tensors like norm, gamma, lerp, etc. and others that are usually too small to benefit from quantizing, but happy to change if I got it wrong
Rather than adding to llama.h, I'm duplicating the tensor_quantization in quantize.cpp and llama-quant.cpp. I think is better than polluting the main header file with such a trivial and program-specific struct, but can change if that's not the recommended approach

Having said that, @jukofyork raises an interesting possibility reminiscing of Razvan-Gabriel Dumitru et al Layer-Wise Quantization and Binrui Zeng et al Layer-Specific Adaptive Quantization. I think is worth exploring further, but suspect the amount of testing to ensure nothing gets broken will be significant. Either way, one for another weekend 😁

In the meantime, will keep 🤞 for this PR to be merged.

EAddario added 24 commits March 13, 2025 18:54

Add llama_model_quantize_params parameters

09f716d

Add new quantize parameters parsing and validation

ac908af

Update usage

337d979

Add new parameters defaults

6f8d16d

Add new quantization parameters logic

71c9f93

Add llama_model_quantize_params parameters

8e18131

Add new quantize parameters parsing and validation

a77d947

Update usage

2414eaa

Add new parameters defaults

0dd66b8

Add new quantization parameters logic

1d841c6

Merge main changes into branch

120f71b

Merge branch 'master' into quantize

dbcc0b5

Minor refactoring as per the contributors' coding guidelines

d86de03

Update descriptions to match existing style

99bae5e

Merge branch 'master' into quantize

60b0a53

Merge branch 'master' into quantize

3e2063d

Merge branch 'master' into quantize

b99fa62

Add llama_model_quantize_params parameters

f97b693

Add new quantize parameters parsing and validation

f11e3da

Update usage

ad1e352

Add new parameters defaults

4e5c96a

Add new quantization parameters logic

9b3ccb5

Minor refactoring as per the contributors' guidelines

35f45f1

Merge branch 'master' into quantize

071e9ef

github-actions bot added the examples label Mar 22, 2025

EAddario changed the title ~~Handle user-defined quantization levels for additional tensors~~ quantize: Handle user-defined quantization levels for additional tensors Mar 22, 2025

jukofyork mentioned this pull request Mar 27, 2025

llama : add option to override model tensor buffers #11397

Draft

2 tasks

EAddario added 2 commits March 29, 2025 12:18

Implement general --tensor-type instead of tensor-specific command op…

54e13cf

…tion

Merge branch 'master' into quantize

31d642c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantize: Handle user-defined quantization levels for additional tensors #12511

quantize: Handle user-defined quantization levels for additional tensors #12511

EAddario commented Mar 22, 2025 •

edited

Loading

max-krasnyansky commented Mar 25, 2025

EAddario commented Mar 25, 2025

jukofyork commented Mar 27, 2025

EAddario commented Mar 29, 2025

quantize: Handle user-defined quantization levels for additional tensors #12511

Are you sure you want to change the base?

quantize: Handle user-defined quantization levels for additional tensors #12511

Conversation

EAddario commented Mar 22, 2025 • edited Loading

max-krasnyansky commented Mar 25, 2025

EAddario commented Mar 25, 2025

jukofyork commented Mar 27, 2025

EAddario commented Mar 29, 2025

EAddario commented Mar 22, 2025 •

edited

Loading