From 31608768bd784c775e04719f09640b99992f529f Mon Sep 17 00:00:00 2001 From: Vasiliy Kuznetsov Date: Tue, 17 Sep 2024 10:17:16 -0700 Subject: [PATCH 1/2] move float8 inference README contents to prototype section --- torchao/quantization/README.md | 36 +++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/torchao/quantization/README.md b/torchao/quantization/README.md index 433688e44b..db75071107 100644 --- a/torchao/quantization/README.md +++ b/torchao/quantization/README.md @@ -121,22 +121,6 @@ from torchao.quantization.quant_api import change_linear_weights_to_int8_dqtenso change_linear_weights_to_int8_dqtensors(model) ``` -#### A16W8 Float8 WeightOnly Quantization - -```python -# for torch 2.5+ -from torchao.quantization import quantize_, float8_weight_only -quantize_(model, float8_weight_only()) -``` - -#### A16W8 Float8 Dynamic Quantization with Rowwise Scaling - -```python -# for torch 2.5+ -from torchao.quantization.quant_api import quantize_, PerRow, float8_dynamic_activation_float8_weight -quantize_(model, float8_dynamic_activation_float8_weight(granularity=PerRow())) -``` - #### A16W6 Floating Point WeightOnly Quantization ```python @@ -303,6 +287,26 @@ You try can out these apis with the `quantize_` api as above alongside the const ### Automatic Inductor Configuration The `quantize_` and `autoquant` apis now automatically use our recommended inductor configuration setings. You can mimic the same configuration settings for your own experiments by using the `torchao.quantization.utils.recommended_inductor_config_setter` to replicate our recommended configuration settings. Alternatively if you wish to disable these recommended settings, you can use the key word argument `set_inductor_config` and set it to false in the `quantize_` or `autoquant` apis to prevent assignment of those configuration settings. You can also overwrite these configuration settings after they are assigned if you so desire, as long as they are overwritten before passing any inputs to the torch.compiled model. This means that previous flows which referenced a variety of inductor configurations that needed to be set are now outdated, though continuing to manually set those same inductor configurations is unlikely to cause any issues. +### (prototype) A16W8 Float8 WeightOnly Quantization + +```python +# for torch 2.5+ +from torchao.quantization import quantize_, float8_weight_only +quantize_(model, float8_weight_only()) +``` + +This API works today but has not been extensively tested and benchmarked yet. + +### (prototype) A16W8 Float8 Dynamic Quantization with Rowwise Scaling + +```python +# for torch 2.5+ +from torchao.quantization.quant_api import quantize_, PerRow, float8_dynamic_activation_float8_weight +quantize_(model, float8_dynamic_activation_float8_weight(granularity=PerRow())) +``` + +This API works today but has not been extensively tested and benchmarked yet. + ## (To be moved to prototype) A16W4 WeightOnly Quantization with GPTQ ```python From 4e0bdfdb802c0fd3f15cdc48c5c54ae2e13f02d4 Mon Sep 17 00:00:00 2001 From: Vasiliy Kuznetsov Date: Tue, 17 Sep 2024 10:45:03 -0700 Subject: [PATCH 2/2] Update README.md --- torchao/quantization/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/torchao/quantization/README.md b/torchao/quantization/README.md index db75071107..d21edf45e2 100644 --- a/torchao/quantization/README.md +++ b/torchao/quantization/README.md @@ -295,7 +295,7 @@ from torchao.quantization import quantize_, float8_weight_only quantize_(model, float8_weight_only()) ``` -This API works today but has not been extensively tested and benchmarked yet. +This API works today but has not been extensively tested and benchmarked yet. Hardware with CUDA compute capability 8.9 or greater is required. ### (prototype) A16W8 Float8 Dynamic Quantization with Rowwise Scaling @@ -305,7 +305,7 @@ from torchao.quantization.quant_api import quantize_, PerRow, float8_dynamic_act quantize_(model, float8_dynamic_activation_float8_weight(granularity=PerRow())) ``` -This API works today but has not been extensively tested and benchmarked yet. +This API works today but has not been extensively tested and benchmarked yet. Hardware with CUDA compute capability 8.9 or greater is required. ## (To be moved to prototype) A16W4 WeightOnly Quantization with GPTQ