From 38200912e616e130fdc2403dfb48df3a0e3b86d9 Mon Sep 17 00:00:00 2001 From: Vasiliy Kuznetsov Date: Mon, 13 Oct 2025 09:52:04 -0400 Subject: [PATCH] Simplify quantization README.md Move features which are not top of mind in 2025 to a section near the bottom where user can click to expand. For now, keep all the docs as they are, we can make further changes in future PRs. --- torchao/quantization/README.md | 295 +++++++++++++++++---------------- 1 file changed, 150 insertions(+), 145 deletions(-) diff --git a/torchao/quantization/README.md b/torchao/quantization/README.md index f53a6085c1..d4b972ce2c 100644 --- a/torchao/quantization/README.md +++ b/torchao/quantization/README.md @@ -71,56 +71,7 @@ lm_eval --model hf --model_args pretrained=${HF_USER}/${MODEL_ID} --tasks hellas Check out the lm-eval [usage docs](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#basic-usage) for more details. -## Autoquantization - -Autoquantization is a tool to automatically determine the best way to apply quantization to your model by comparing the performance of each quantization technique to each layer for the input types and shapes you care about. - -```python -import torch -import torchao -from torchao.quantization import DEFAULT_INT4_AUTOQUANT_CLASS_LIST - -# Plug in your model and example input -model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16) -input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda') -use_autoquant_default = True - -if use_autoquant_default: - # perform autoquantization and torch.compile with default settings - model = torchao.autoquant(torch.compile(model, mode='max-autotune')) -elif not use_autoquant_default: - # perform autoquantization and torch.compile with int4 support - model = torchao.autoquant(torch.compile(model, mode='max-autotune'), qtensor_class_list=DEFAULT_INT4_AUTOQUANT_CLASS_LIST) - -# pass in an input which is used in order to pick fastest quantization operations -# and apply torch compilation. -model(input) -``` - -When used as in the example above, when the `autoquant` api is called alongside torch.compile, autoquant sets up the model so that when its run on the next input, the autoquantization and torch.compile processes leave you with a heavily optimized model. - -When `model(input)` is called, (under the hood) the tool does a preliminary run with the input where each linear layer keeps track of the different shapes and types of activations that it sees. Once the preliminary run is complete, the next step is to check each linear layer and benchmark the tracked shapes for different types of quantization techniques in order to pick the fastest one, attempting to take into account fusions where possible. Finally once the best class is found for each layer, the next step is to apply the necessary quantization technique to each layer, before finally allowing the normal `torch.compile` process to occur on the now quantized model. By default the api only uses int8 techniques, i.e. it chooses between no quantization, int8 dynamic quantization and int8 weight only quantization for each layer, though there is also an option add int4 quantization which can be used for maximum performance or to avoid perf regressions from `Int4WeightOnlyConfig()` since for certain (compute bound) regimes, int4 weight only quantization can be very slow. - -Sometimes it is desirable to reuse a quantization plan that `autoquant` came up with. `torchao.quantization._AUTOQUANT_CACHE` is a dictionary holding autoquant's benchmark results. We can save it and restore it later, which will cause `autoquant` to choose the same quantization methods. - -```python -import pickle -import torchao.quantization - -# After the first forward pass (when quantization was done) -from torchao.quantization.autoquant import _AUTOQUANT_CACHE -with open("quantization-cache.pkl", "wb") as f: - pickle.dump(_AUTOQUANT_CACHE, f) - -# On load -from torchao.quantization.autoquant import _AUTOQUANT_CACHE -with open("quantization-cache.pkl", "rb") as f: - _AUTOQUANT_CACHE.update(pickle.load(f)) -``` - ## Quantization Techniques -While the above `autoquant` api tries multiple quantization techniques to find the best combination for your model, the techniques themselves can -be applied individually. While there are a large variety of quantization apis, the following techniques have been thoroughly tested and perform well for the metrics they seek to optimize. Each are examples of affine quantization #### A16W4 WeightOnly Quantization @@ -218,7 +169,154 @@ quantize_( ) ``` -## Affine Quantization Details +#### Workaround with `unwrap_tensor_subclass` for `export`, `AOTI` and `torch.compile` + +If you are using pytorch 2.6 or before, you need to call `unwrap_tensor_subclass` before `torch.export.export` and `aot_compile`: +``` +from torchao.utils import unwrap_tensor_subclass +m_unwrapped = unwrap_tensor_subclass(m) + + +# export +m = torch.export.export(m_unwrapped, example_inputs).module() + +# aot_compile +torch._export.aot_compile(m_unwrapped, example_inputs) +``` + +If you are using pytorch 2.4 or before, you'll also need `unwrap_tensor_subclass` before calling `torch.compile` as well. + +Note that the workaround is also required for `torch.compile` with `freezing` (`torch._inductor.config.freezing=True`) until https://github.com/pytorch/pytorch/pull/136265 is fixed. + +## Other Available Quantization Techniques + + +### Sparse-Marlin + +Sparse-Marlin 2:4 is an optimized GPU kernel that extends the Mixed Auto-Regressive Linear (Marlin) dense kernel to support 4-bit quantized weights and 2:4 sparsity for extremely high performance. + +| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | +| ----------- | ----------------------- | ------------- | ----------------------- | ---------------- | --------------- | +| Llama-3-8B | Base (bfloat16) | 95.64 | 1435.54 | 16.43 | 15.01 | +| | int8wo | 153.03 | 1150.80 | 10.42 | 7.52 | +| | int4wo-64 | 180.80 | 763.33 | 6.88 | 4.22 | +| | int4wo-64-sparse-marlin | 226.02 | 689.20 | 5.32 | 3.05 | + +More details can be found [here](../sparsity/README.md) + +### Marlin QQQ + +Marlin QQQ is an optimized GPU kernel that supports W4A8 mixed precision GEMM. For more details about Marlin QQQ, please refer to [paper](https://arxiv.org/pdf/2406.09904). + +| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | +| ----------- | ----------------------- | ------------- | ----------------------- | ---------------- | --------------- | +| Llama-2-7B | Base (float16) | 112.45 | 1486.00 | 13.93 | 13.21 | +| | w4a8 | 197.45 | 653.50 | 4.79 | 3.31 | +| | w4a8-g128 | 187.62 | 640.32 | 4.82 | 3.41 | + +### Gemlite Triton +Int4 and Int8 quantization using the [Gemlite Triton](https://github.com/mobiusml/gemlite) kernels. You can try it out with the `quantize_` api as above alongside the constructor `GemliteUIntXWeightOnlyConfig`. An example can be found in `torchao/_models/llama/generate.py`. + +Note: we test on gemlite 0.4.1, but should be able to use any version after that, we'd recommend to use the latest release to get the most recent performance improvements. + +### UINTx Quantization +We're trying to develop kernels for low bit quantization for intx quantization formats. While the current performance is not ideal, we're hoping to continue to iterate on these kernels to improve their performance. + +| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | +| ----------- | ----------------------- | ------------------- | ------------- | ----------------------- | ---------------- | --------------- | +| Llama-2-7B | Base (bfloat16) | 12.212 | 107.38 | 1418.93 | 13.88 | 13.21 | +| | uintx-4-64-hqq | 12.775 | 50.99 | 200.08 | 6.29 | 3.92 | +| | uintx-2-8-hqq | 24.500 | 40.25 | 265.95 | 9.24 | 6.61 | +| Llama-3-8B | Base (bfloat16) | 7.441 | 95.64 | 1435.54 | 16.43 | 15.01 | +| | uintx-4-64-hqq | 8.124 | 47.85 | 213.24 | 11.85 | 4.46 | +| | uintx-2-8-hqq | 39.605 | 34.83 | 261.42 | 14.99 | 7.51 | + +You try can out these apis with the `quantize_` api as above alongside the config `UIntXWeightOnlyConfig`. An example can be found in in `torchao/_models/llama/generate.py`. + +### Int8DynamicActivationIntxWeightConfig Quantization +We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon). The benchmarks below were run on an M1 Mac Pro, with 8 perf cores, and 2 efficiency cores, and 32GB of RAM. In all cases, torch.compile was used. + +| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | +| ------------- | -------------------------------------------------| --------------| ------------------------| ---------------- | ----------------| +| Llama-3.1-8B | Base (bfloat16) | 1.24 | 18.62 | NA | 15.01 | +| | int8_dynamic_activation_intx_weight-4-256-false | 16.03 | 65.81 | NA | 4.11 | +| | int8_dynamic_activation_intx_weight-3-256-false | 18.94 | 59.97 | NA | 3.17 | + +You can try out these apis with the `quantize_` api as above alongside the config `Int8DynamicActivationIntxWeightConfig`. An example can be found in `torchao/_models/llama/generate.py`. + +### Codebook Quantization +The benchmarks below were run on a single NVIDIA-A6000 GPU. + +| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | +| ----------- | ----------------------- | ------------------- | ------------- | ----------------------- | ---------------- | --------------- | +| Llama-3-8B | Base (bfloat16) | 7.590 | 32.36 | 485.71 | 16.19 | 15.01 | +| | codebook-4-64 | 9.533 | 1.73 | 8.62 | 23.11 | 4.98 | +| Llama-3.1-8B| Base (bfloat16) | 7.713 | 32.16 | 482.70 | 16.35 | 15.01 | +| | codebook-4-64 | 10.095 | 1.73 | 8.63 | 23.11 | 4.98 | + +You try can out these apis with the `quantize_` api as above alongside the config `CodebookWeightOnlyConfig` an example can be found in in `torchao/_models/llama/generate.py`. + +### GPTQ Quantization +We have a GPTQ quantization workflow that can be used to quantize a model to int4. More details can be found in [GPTQ](./GPTQ/README.md), +an example can be found in `torchao/_models/llama/eval.py`. + +### Automatic Inductor Configuration + +:warning: This functionality is being migrated from the top level `quantize_` API to individual workflows, see https://github.com/pytorch/ao/issues/1715 for more details. + +The `quantize_` and `autoquant` apis now automatically use our recommended inductor configuration setings. You can mimic the same configuration settings for your own experiments by using the `torchao.quantization.utils.recommended_inductor_config_setter` to replicate our recommended configuration settings. Alternatively if you wish to disable these recommended settings, you can use the key word argument `set_inductor_config` and set it to false in the `quantize_` or `autoquant` apis to prevent assignment of those configuration settings. You can also overwrite these configuration settings after they are assigned if you so desire, as long as they are overwritten before passing any inputs to the torch.compiled model. This means that previous flows which referenced a variety of inductor configurations that needed to be set are now outdated, though continuing to manually set those same inductor configurations is unlikely to cause any issues. + +
+ Expand to see more! + +### Autoquantization + +Autoquantization is a tool to automatically determine the best way to apply quantization to your model by comparing the performance of each quantization technique to each layer for the input types and shapes you care about. + +```python +import torch +import torchao +from torchao.quantization import DEFAULT_INT4_AUTOQUANT_CLASS_LIST + +# Plug in your model and example input +model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16) +input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda') +use_autoquant_default = True + +if use_autoquant_default: + # perform autoquantization and torch.compile with default settings + model = torchao.autoquant(torch.compile(model, mode='max-autotune')) +elif not use_autoquant_default: + # perform autoquantization and torch.compile with int4 support + model = torchao.autoquant(torch.compile(model, mode='max-autotune'), qtensor_class_list=DEFAULT_INT4_AUTOQUANT_CLASS_LIST) + +# pass in an input which is used in order to pick fastest quantization operations +# and apply torch compilation. +model(input) +``` + +When used as in the example above, when the `autoquant` api is called alongside torch.compile, autoquant sets up the model so that when its run on the next input, the autoquantization and torch.compile processes leave you with a heavily optimized model. + +When `model(input)` is called, (under the hood) the tool does a preliminary run with the input where each linear layer keeps track of the different shapes and types of activations that it sees. Once the preliminary run is complete, the next step is to check each linear layer and benchmark the tracked shapes for different types of quantization techniques in order to pick the fastest one, attempting to take into account fusions where possible. Finally once the best class is found for each layer, the next step is to apply the necessary quantization technique to each layer, before finally allowing the normal `torch.compile` process to occur on the now quantized model. By default the api only uses int8 techniques, i.e. it chooses between no quantization, int8 dynamic quantization and int8 weight only quantization for each layer, though there is also an option add int4 quantization which can be used for maximum performance or to avoid perf regressions from `Int4WeightOnlyConfig()` since for certain (compute bound) regimes, int4 weight only quantization can be very slow. + +Sometimes it is desirable to reuse a quantization plan that `autoquant` came up with. `torchao.quantization._AUTOQUANT_CACHE` is a dictionary holding autoquant's benchmark results. We can save it and restore it later, which will cause `autoquant` to choose the same quantization methods. + +```python +import pickle +import torchao.quantization + +# After the first forward pass (when quantization was done) +from torchao.quantization.autoquant import _AUTOQUANT_CACHE +with open("quantization-cache.pkl", "wb") as f: + pickle.dump(_AUTOQUANT_CACHE, f) + +# On load +from torchao.quantization.autoquant import _AUTOQUANT_CACHE +with open("quantization-cache.pkl", "rb") as f: + _AUTOQUANT_CACHE.update(pickle.load(f)) +``` + +### Affine Quantization Details Affine quantization refers to the type of quantization that maps from high precision floating point numbers to quantized numbers (low precision integer or floating point dtypes) with an affine transformation, i.e.: `quantized_val = high_precision_float_val / scale + zero_point` where `scale` and `zero_point` are quantization parameters for some granularity and based on some data (also some dtypes may not require a `zero_point`). Each of the techniques in the above section qualify as Affine Quantization. ### Quantization Primitives @@ -320,106 +418,13 @@ for module, name in model.named_modules(): module.weight = nn.Parameter(to_linear_activation_quantized(module.weight, input_quant_func)) ``` -#### Workaround with `unwrap_tensor_subclass` for `export`, `AOTI` and `torch.compile` - -If you are using pytorch 2.6 or before, you need to call `unwrap_tensor_subclass` before `torch.export.export` and `aot_compile`: -``` -from torchao.utils import unwrap_tensor_subclass -m_unwrapped = unwrap_tensor_subclass(m) - - -# export -m = torch.export.export(m_unwrapped, example_inputs).module() - -# aot_compile -torch._export.aot_compile(m_unwrapped, example_inputs) -``` - -If you are using pytorch 2.4 or before, you'll also need `unwrap_tensor_subclass` before calling `torch.compile` as well. - -Note that the workaround is also required for `torch.compile` with `freezing` (`torch._inductor.config.freezing=True`) until https://github.com/pytorch/pytorch/pull/136265 is fixed. - -## Other Available Quantization Techniques - ### KV Cache Quantization We've added kv cache quantization and other features in order to enable long context length (and necessarily memory efficient) inference. In practice these features alongside int4 weight only quantization allow us to **reduce peak memory by ~55%**, meaning we can Llama3.1-8B inference with a **130k context length with only 18.9 GB of peak memory.** More details can be found [here](../../torchao/_models/llama/README.md#KV-Cache-Quantization-Memory-Efficient-Inference) -### Sparse-Marlin - -Sparse-Marlin 2:4 is an optimized GPU kernel that extends the Mixed Auto-Regressive Linear (Marlin) dense kernel to support 4-bit quantized weights and 2:4 sparsity for extremely high performance. - -| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | -| ----------- | ----------------------- | ------------- | ----------------------- | ---------------- | --------------- | -| Llama-3-8B | Base (bfloat16) | 95.64 | 1435.54 | 16.43 | 15.01 | -| | int8wo | 153.03 | 1150.80 | 10.42 | 7.52 | -| | int4wo-64 | 180.80 | 763.33 | 6.88 | 4.22 | -| | int4wo-64-sparse-marlin | 226.02 | 689.20 | 5.32 | 3.05 | - -More details can be found [here](../sparsity/README.md) - -### Marlin QQQ - -Marlin QQQ is an optimized GPU kernel that supports W4A8 mixed precision GEMM. For more details about Marlin QQQ, please refer to [paper](https://arxiv.org/pdf/2406.09904). - -| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | -| ----------- | ----------------------- | ------------- | ----------------------- | ---------------- | --------------- | -| Llama-2-7B | Base (float16) | 112.45 | 1486.00 | 13.93 | 13.21 | -| | w4a8 | 197.45 | 653.50 | 4.79 | 3.31 | -| | w4a8-g128 | 187.62 | 640.32 | 4.82 | 3.41 | - -### Gemlite Triton -Int4 and Int8 quantization using the [Gemlite Triton](https://github.com/mobiusml/gemlite) kernels. You can try it out with the `quantize_` api as above alongside the constructor `GemliteUIntXWeightOnlyConfig`. An example can be found in `torchao/_models/llama/generate.py`. - -Note: we test on gemlite 0.4.1, but should be able to use any version after that, we'd recommend to use the latest release to get the most recent performance improvements. - -### UINTx Quantization -We're trying to develop kernels for low bit quantization for intx quantization formats. While the current performance is not ideal, we're hoping to continue to iterate on these kernels to improve their performance. - -| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | -| ----------- | ----------------------- | ------------------- | ------------- | ----------------------- | ---------------- | --------------- | -| Llama-2-7B | Base (bfloat16) | 12.212 | 107.38 | 1418.93 | 13.88 | 13.21 | -| | uintx-4-64-hqq | 12.775 | 50.99 | 200.08 | 6.29 | 3.92 | -| | uintx-2-8-hqq | 24.500 | 40.25 | 265.95 | 9.24 | 6.61 | -| Llama-3-8B | Base (bfloat16) | 7.441 | 95.64 | 1435.54 | 16.43 | 15.01 | -| | uintx-4-64-hqq | 8.124 | 47.85 | 213.24 | 11.85 | 4.46 | -| | uintx-2-8-hqq | 39.605 | 34.83 | 261.42 | 14.99 | 7.51 | - -You try can out these apis with the `quantize_` api as above alongside the config `UIntXWeightOnlyConfig`. An example can be found in in `torchao/_models/llama/generate.py`. - -### Int8DynamicActivationIntxWeightConfig Quantization -We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon). The benchmarks below were run on an M1 Mac Pro, with 8 perf cores, and 2 efficiency cores, and 32GB of RAM. In all cases, torch.compile was used. - -| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | -| ------------- | -------------------------------------------------| --------------| ------------------------| ---------------- | ----------------| -| Llama-3.1-8B | Base (bfloat16) | 1.24 | 18.62 | NA | 15.01 | -| | int8_dynamic_activation_intx_weight-4-256-false | 16.03 | 65.81 | NA | 4.11 | -| | int8_dynamic_activation_intx_weight-3-256-false | 18.94 | 59.97 | NA | 3.17 | - -You can try out these apis with the `quantize_` api as above alongside the config `Int8DynamicActivationIntxWeightConfig`. An example can be found in `torchao/_models/llama/generate.py`. - -### Codebook Quantization -The benchmarks below were run on a single NVIDIA-A6000 GPU. - -| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | -| ----------- | ----------------------- | ------------------- | ------------- | ----------------------- | ---------------- | --------------- | -| Llama-3-8B | Base (bfloat16) | 7.590 | 32.36 | 485.71 | 16.19 | 15.01 | -| | codebook-4-64 | 9.533 | 1.73 | 8.62 | 23.11 | 4.98 | -| Llama-3.1-8B| Base (bfloat16) | 7.713 | 32.16 | 482.70 | 16.35 | 15.01 | -| | codebook-4-64 | 10.095 | 1.73 | 8.63 | 23.11 | 4.98 | - -You try can out these apis with the `quantize_` api as above alongside the config `CodebookWeightOnlyConfig` an example can be found in in `torchao/_models/llama/generate.py`. - -### GPTQ Quantization -We have a GPTQ quantization workflow that can be used to quantize a model to int4. More details can be found in [GPTQ](./GPTQ/README.md), -an example can be found in `torchao/_models/llama/eval.py`. - -### Automatic Inductor Configuration - -:warning: This functionality is being migrated from the top level `quantize_` API to individual workflows, see https://github.com/pytorch/ao/issues/1715 for more details. - -The `quantize_` and `autoquant` apis now automatically use our recommended inductor configuration setings. You can mimic the same configuration settings for your own experiments by using the `torchao.quantization.utils.recommended_inductor_config_setter` to replicate our recommended configuration settings. Alternatively if you wish to disable these recommended settings, you can use the key word argument `set_inductor_config` and set it to false in the `quantize_` or `autoquant` apis to prevent assignment of those configuration settings. You can also overwrite these configuration settings after they are assigned if you so desire, as long as they are overwritten before passing any inputs to the torch.compiled model. This means that previous flows which referenced a variety of inductor configurations that needed to be set are now outdated, though continuing to manually set those same inductor configurations is unlikely to cause any issues. + +
## Notes