-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-quants : weighted rounding algorithms with cumulative search #12557
base: master
Are you sure you want to change the base?
Conversation
Slightly faster than the previous method.
Weirdly, it seems like in practice replacing this instance is not better. This is probably because of its interaction with make_qkx3_quants.
There seem to be some problems with both Metal and Vulkan tests when copying
I think this may be caused by the new quantization algorithm for I'm not sure how to fix this other than making the CPU quantization for |
This looks really interesting and will read though it this week if I get time - I'm still keen to see if we can find a better way to regularise the weights. You might find this interesting:
This bothered me as I couldn't see any good reason why it shouldn't be (after recentering) an odd function of the form It's not as clear in this form, but it actually parameterises a whole family of odd functions: (which may be useful to make an extension to the With some more manipulation you get: Which IIRC is related to a well known approximation to the symmetric beta quantile function (inverse CDF) and has been discussed on John D. Cook's blog before: https://www.johndcook.com/blog/ (which is sadly so badly organised it's near impossible to find again lol) Anyway, thought it might be interesting since you are looking at the rounding - it may be that it now comes out as an odd function if you were to rerun the k-means clustering on the fixed rounding? |
Can you explain how you created the plots in more detail? My father was a geography teacher and map projections were one of his favourite topics, but I still can't 100% see the relationship here! :) Are the "maximums" you are referring to the edges of the cube? I can see we could create a 2D heightmap of This is really fascinating BTW! |
@jukofyork The weighted cosine similarly only affects the color gradient of the plot. |
@ikawrakow kindly explained where this came from here: The reason this bothers me so much is because the formula doesn't act as a regulariser at the two extremes:
Then if you look at the experts in a MoE model, we should be weighing more or less towards the prior depending on the relative sample sizes, and so on. Or put another way: there should be a tunable There are a multitude of different ways you can estimate the optimal or the textbooks by James E. Gentle, for an overview of this. I'm going to dip out now as like I said in the other thread; I've nothing to gain from this and may have come across badly, which certainly wasn't my intention! :) I think the work @ikawrakow did on the quants in |
I think that the CPY operations that involve quantization of the source data should remain simple because these are difficult to implement efficiently on the GPU and other devices. So using the fast shorcut-taking implementation during copy should be the better options here. |
I did a quick perplexity test with a base Gemma 3 4B and observe improvement for
Though I agree that KLD is a better metric to track, especially for tuned models. I think after we resolve the failing tests, we can proceed to merge. Great work on this @compilade! |
This adds proper
imatrix
support toTQ1_0
andTQ2_0
, in addition to improving the rounding algorithm used forQ3_K
,IQ4_NL
,IQ4_XS
(both with and withoutimatrix
), as well as when usingimatrix
withQ4_0
andQ5_0
.This is backward and forward compatible with other versions of
llama.cpp
.Since this doesn't change the format of the types, only how the values are rounded when quantized, even previous (or current) versions of
llama.cpp
can use quants made with this PR.Affected types
When using
imatrix
, all the types mentionned in the table below are affected.When not using
imatrix
, a change was only made where "Yes" is in the table below.imatrix
TQ1_0
TQ2_0
Q3_K
IQ4_NL
IQ4_XS
Q4_0
Q5_0
KL-Divergence
The following tests were made with
wiki.test.raw
fromwikitext-2-raw
, using chunks of 512 tokens.Quantization was done using the
imatrix
files made by @bartowski1182.Since this doesn't affect how
imatrix
files are made, older ones can still be used for quantization.Important
All the following tests use PURE quantization to avoid testing multiple changed types at once, to be sure that the changes are measured on their own.
$ ./bin/llama-quantize --imatrix <some-file.imatrix> --token-embedding-type q8_0 --output-tensor-type q8_0 --pure <source.gguf> <quant.gguf> <quant-type>
Qwen2.5-Coder-3B-Instruct
With
imatrix
from https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF/blob/main/Qwen2.5-Coder-3B-Instruct.imatrix:KL-divergence (lower is better):
TQ1_0
TQ2_0
Q3_K
IQ4_NL
IQ4_XS
Q4_0
Q5_0
*:
TQ1_0
andTQ2_0
kl-divergence was calculated on the first 8 chunks.Note how
Q3_K
was previously very broken for this model. There was a reddit thread about brokenQ3_K
for this model.Full KL-Divergence results
TQ1_0-master
:TQ1_0-PR
:TQ2_0-master
:TQ2_0-PR
:Q3_K-master
:Q3_K-PR
:IQ4_NL-master
:IQ4_NL-PR
:IQ4_XS-master
:IQ4_XS-PR
:Q4_0-master
:Q4_0-PR
:Q5_0-master
:Q5_0-PR
:Without
imatrix
:(lower is better)
Q3_K
IQ4_NL
IQ4_XS
The other types were not changed.
Full KL-Divergence results
Q3_K-master
:Q3_K-PR
:IQ4_NL-master
:IQ4_NL-PR
:IQ4_XS-master
:IQ4_XS-PR
:Llama-3.1-8B-Instruct
Same tests, using
Llama-3.1-8B-Instruct
, withimatrix
from https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct.imatrixAgain, note that the quantizations here are pure (without mixing types apart from
Q8_0
token embeddings and output tensor).KL-divergence on
wiki.test.raw
(lower is better):TQ1_0
TQ2_0
Q3_K
IQ4_NL
IQ4_XS
Q4_0
Q5_0
*:
TQ1_0
andTQ2_0
kl-divergence was calculated on the first 8 chunks.Full KL-Divergence results
TQ1_0-master
:TQ1_0-PR
:TQ2_0-master
:TQ2_0-PR
:Q3_K-master
:Q3_K-PR
:IQ4_NL-master
:IQ4_NL-PR
:IQ4_XS-master
:IQ4_XS-PR
:Q4_0-master
:Q4_0-PR
:Q5_0-master
:Q5_0-PR
:Without
imatrix
:Q3_K
IQ4_NL
IQ4_XS
Full KL-Divergence results
Q3_K-master
:Q3_K-PR
:IQ4_NL-master
:IQ4_NL-PR
:IQ4_XS-master
:IQ4_XS-PR
:The improvements are more apparent with
imatrix
, where this is a strict improvement.Without
imatrix
, it's a bit less clear.What changed in the algorithms?
There's a neat way to visualize rounding algorithms with equirectangular projections of their errors in a particular 3D space.
Here's an equirectangular projection from the algorithm used in
Q4_0
(which uses integers between-8
and7
):This plots the weighted cosine similarity between the quantized vectors and the full-precision vectors which correspond to each pixel of the projection.
Less error is more yellow, while more error is more blue.
Unless otherwise noted, the projections I'm including here always use .
Note that this doesn't fully capture the behavior of more complex rounding algorithms at higher dimensions, since this fundamentally is a 3D view of the rounding space (which in practice is more like 16D, 32D, or even 256D), but it is enough to make some problems more easily identifiable.
Non-ideal rounding algorithms have discontinuities in their weighted cosine similarity plots
(for
Q4_0
, the bluer line is caused by how the max scale is handled since #729)Algorithms used on
master
Let's start with what is used in the types on
master
, so that we have some baseline to compare with.make_q3_quants
This algorithm is used only with
Q3_K
when there is noimatrix
provided.It's a bit broken for some models, notably with
Qwen2.5-Coder-3B-Instruct
.It doesn't seem quite right (this will become clearer later when more ideal algorithms are illustrated).
Notice how vectors with positive or negative maximums are handled completely differently.
In practice, the rounding weights it uses are the square of the vectors, which looks more like this:
make_qx_quants
This algorithm is used in a lot of types.
In this example it's used with
[-8, 7]
as the range of integers:I did not replace all of its uses yet because in some places it's good enough (e.g.
Q6_K
).make_qp_quants
This is almost like
make_qx_quants
, but assumes unsigned quantization (from 0 tonmax
) with a positive scale.That it only works with unsigned quantization makes visualizing this a bit more different, since only the positive quadrant can be explored.
Still, if we limit the viewing range the positive quadrant of a face of a cube, here's what it looks like:
Note that the top left corner is
[1, 0, 0]
, while the bottom right corner is[1, 1, 1]
in this cube face projection.quantize_row_iq4_nl_impl
This is used in both
IQ4_NL
andIQ4_XS
.Notice how there are many discontinuities, although the error is mostly small.
Algorithms from this PR
The weighted vector rounding algorithms I'm introducing all share a similar theory.
It's possible to use a cumulative sum to enumerate all weighted dot products for each distinct initial scales.
This requires sorting the possible inverse scales so that each step changes only a single integer in the candidate quantized vector.
In practice, using a max-heap of the scales seems to be faster than using
qsort
, which is why I've addedstruct k_heap
(which is basically a binary max-heap).I've been exploring this idea in https://github.com/compilade/rounding-experiments, which is also where the equirectangular visualization script comes from (it's
equirectangular.py
in that repo).I will be eventually publishing a more complete explanation of the algorithms, but there are still some unsolved problems like how to generalize this to offset quantization types like
Q4_K
(which loosely have the formq[i] * s - m
).If you'd like to help research into this kind of quantization algorithm, or help formalize this, reach out.
make_qkxh_quants
This is very similar to
make_qx_quants
, but it's doing a more exhaustive cumulative search instead of a grid search (it's still not quite fully exhaustive, but close).It's general enough to be a replacement for both
make_qx_quants
andmake_qp_quants
, since it supports arbitrary min and max representable values, instead of assuming the negative side goes one further than the positive (in the case ofmake_qx_quants
), or assuming the min is zero (formake_qp_quants
).It does assume zero is somewhere part of the representable range, though.
make_qkxsh_quants
This is almost the same as
make_qkxh_quants
, but it has a different behavior for some distributions ofimatrix
weights where the best sign for the scale is not the sign of the absolute max value.For example, when the representable integer range is
[-2, 7]
, and the the weights are[1, 8, 8]
instead of[1, 1, 1]
,make_qkxh_quants
shows some discontinuities at the boundaries where the max changes.But
make_qkxsh_quants
doesn't have this problem:In practice, though, it doesn't seem to impact the quality of the quantization that much, except for very asymmetric types.
This is used with
TQ2_0
withimatrix
, since it's quite asymmetric, because it can store{-1, 0, 1, 2}
.make_qkxh_nl_quants
A more exhaustive general non-linear quantization function (which can technically be used for more than just the
IQ4_NL
kvalues if other non-linear types are introduced).There are some variants.
One doesn't assume the sign of the best scale.
This is the slowest, but highest quality, and is used when an
imatrix
file is provided.Another one assumes the sign of the best scale should make the absolute max value have the same sign as the absolute max
kvalue
of the non-linear mapping.This is used when no
imatrix
is provided, since it's faster than trying both signs.Notes
qw[i] * (sigma2 + x[i] * x[i])
, which I think may be interesting for @jukofyorkQ3_K
withimatrix
), this may affect decisions regarding types chosen in mixes (@ddh0, @KerfuffleV2, @Nexesenex)make_qkxh_nl_quants
is general enough to be useful in @ikawrakow's other non-linear types too (IQ2_K
,IQ3_K
, etc., in https://github.com/ikawrakow/ik_llama.cpp), although since they use multiple lookup tables for some types instead of only one, it might be more complicated than forIQ4_NL
(and need some modifications).qsort
instead of a binary max-heap might be easier to understand, and were last in these lines from an older commit in this PR:llama.cpp/ggml/src/ggml-quants.c
Lines 631 to 1107 in 0c9e442
TODO in future PRs
make_qx_quants
andmake_qp_quants
withmake_qkxh_quants
IQ1_S
,IQ1_M
,IQ2_XXS
,IQ2_XS
,IQ2_S
,IQ3_XXS
,IQ3_S
)make_qkx3_quants
(q[i] * s - m
quants)qw[i] * (sigma2 + x[i] * x[i])
if possibleTQ2_0
in a quant mix (it's nearIQ1_S
quality-wise, but faster)Make sure to read the contributing guidelines before submitting a PR