Update the way scale is calculated for affine Symmetric #805

iseeyuan · 2024-09-04T15:33:49Z

@TiRune identified this option. In the situation where the absolute value of quantized min and max can be different, like [-8, 7], we can calculate the scale factor with the pos and neg individually, and pick the larger one. It shows perplexity improvement in llama-like 4-bit weight quantized models.

before:
'word_perplexity,none': 24.198390005931635
after:
'word_perplexity,none': 23.25360136363946

In this PR, one mapping type, SYMMETRIC_MAX_POS_NEG is added, to get the group symmetric quantization scales as mentioned above.

Please refer to the inline comments for the reasoning.

In the situation where min and max can be different, like [-8, 7]

pytorch-bot · 2024-09-04T15:33:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/805

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 24c0873 with merge base 144445a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-09-04T23:43:39Z

torchao/quantization/quant_primitives.py

@@ -730,8 +730,10 @@ def _choose_qparams_affine(
        max_val_pos = max_val

    if mapping_type == MappingType.SYMMETRIC.name:
-        max_val_pos = torch.max(-min_val_neg, max_val_pos)
-        scale = max_val_pos / (float(quant_max - quant_min) / 2)
+        smin = min_val_neg / float(quant_min)


discussed offline that @iseeyuan will create a new mapping_type for this, and this is not always better than the existing way of computing scale. we can discuss the naming a bit later

In what scenarios would it be worse? I would argue that this is always intended behavior, for any symmetric setting. Essentially, with min-max you never want there to be clipping of your values, which I believe could happen in the current implementation.

There's another issue not mentioned yet, if you take the current scheme, export the weights that are qdqed, then load them back in again with the min-max quantizer, the results change everytime you do this. With this PR, the weights will stay the same even if you apply the min-max quantizer multiple times.

@TiRune why you never want to clip values? it's always a trade off between clipping error and rounding error I think?

For me, it's more about what's expected of a min-max quantizer. Sure, we always need to trade off clipping and rounding error. But that's what the MSE-based range setting, HQQ or those types of algorithms are for. The current choice for symmetric is kinda arbitrary to be 1/2 (q_min + q_max).

To me, a min-max range setter is 1) it always includes both the min-and max in the range so there's no clipping error and 2) if it's applied twice the same result comes out i.e., f(f(x)) = f(x). We use this for e.g. to export fake quant weights and load them in another library like Executorch :D

Another use-case of the min-max quantizer: Use it as the worst-case scenario for quantization to seed HQQ/MSE based range setting with.

E.g. you start with min-max, then search over shrinked versions of this range like in AWQ, you search over 0.99^N * the scale factor. In this algorithm version, you also expect the min-max quantizer to start off with 0 clipping error.

Thanks @TiRune and @jerryzh168 ! To play safe incase there's usage of the old mapping, and quickly unblock the usage (without fixing all tests, especially those old tests that are planned to update), I added a new symmetric mapping type.

Later if we notice that the the two symmetric mappings can merge to one, we can have a PR to merge them.

Please review the code and let me know if it make sense.

@TiRune I see, yeah that makes sense I think. as @iseeyuan mentioned the current impl is used in all the existing code so changing the behavior will be bc-breaking, it will be better to add a new mapping type.

facebook-github-bot · 2024-09-05T23:44:23Z