Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantize: Handle user-defined quantization levels for additional tensors #12511

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

EAddario
Copy link

@EAddario EAddario commented Mar 22, 2025

This PR adds the ability to quantize other tensors, beyond token-embedding and output-tensor. It handles most of the supported architectures. except Mamba, RWKV6, RWKV6QWEN2 and T5 to avoid having too many command options, but can add as well if maintainers request it.

For full background on the PR, please see: Squeezing Tensor Bits: the quest for smaller LLMs

@EAddario EAddario changed the title Handle user-defined quantization levels for additional tensors quantize: Handle user-defined quantization levels for additional tensors Mar 22, 2025
@max-krasnyansky
Copy link
Collaborator

How about we add a more generic --tensor-type tensor_name_pattern=type.
@slaren has the PR #11397 that overrides backend mapping per tensor.
Let's make this one similar (same patters, etc). That way we'll be able to override specific layers (if needed).

@EAddario
Copy link
Author

That's an excellent idea! and it'll allow to add all supported tensor types (50+) without creating a mess of parameters. Plus, it will give me something to do over the weekend 😆

@jukofyork
Copy link
Contributor

How about we add a more generic --tensor-type tensor_name_pattern=type. @slaren has the PR #11397 that overrides backend mapping per tensor. Let's make this one similar (same patters, etc). That way we'll be able to override specific layers (if needed).

Yeah, I think this is definitely the way to go - the regex support of that PR gives really good flexibility.

@EAddario
Copy link
Author

Adding the regex matching would be a relatively straightforward change, and happy to do it, but wanted to check how useful the feature would really be.

@slaren's use case is about choosing which tensors and/or which layers get processed where. For example, the user could opt to run all expert tensors in CPU and the shared experts in the GPU, like so: -ot "_exp=CPU" -ot "_sexp=GPU", or like in the PR's example, to keep experts of layers 20-99 in the CPU: -ot "[2-9][0-9]\.ffn_.*_exps\.=CPU"

In my use case, I had assumed the user would want to choose which tensors get quantized at which level, excluding the ones already handled by the program (--leave-output-tensor, --output-tensor-type, --token-embedding-type and --pure), with said quantization applied to the selected tensor/s across all layers.

The changes I have just pushed implement exactly that, but with @max-krasnyansky suggestion. A couple of things to note:

  • Implemented the --tensor-type logic very similar to --override-kv to keep the code consistent
  • ALLOWED_TENSOR_TYPE in quantize.cpp controls which tensors can be selected. I have excluded 1D tensors like norm, gamma, lerp, etc. and others that are usually too small to benefit from quantizing, but happy to change if I got it wrong
  • Rather than adding to llama.h, I'm duplicating the tensor_quantization in quantize.cpp and llama-quant.cpp. I think is better than polluting the main header file with such a trivial and program-specific struct, but can change if that's not the recommended approach

Having said that, @jukofyork raises an interesting possibility reminiscing of Razvan-Gabriel Dumitru et al Layer-Wise Quantization and Binrui Zeng et al Layer-Specific Adaptive Quantization. I think is worth exploring further, but suspect the amount of testing to ensure nothing gets broken will be significant. Either way, one for another weekend 😁

In the meantime, will keep 🤞 for this PR to be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants