Feedback on `quantize()` API #384

gau-nernst · 2024-06-16T06:49:23Z

Previously we do this

from torchao.quantization.quant_api import change_linear_weights_to_int8_woqtensors

model = torch.compile(model, mode="max-autotune", fullgraph=True)
change_linear_weights_to_int8_woqtensors(model)
# compile after quant also works

With the new quantization API, we have to do this

from torchao.quantization.quant_api import quantize, int8wo, unwrap_tensor_subclass

model = quantize(model, int8wo())  # or "int8_weight_only"
model = unwrap_tensor_subclass(model)
model = torch.compile(model, mode='max-autotune', fullgraph=True)  # must compile after unwrap

I think the new API is less user-friendly than the previous one.

int8wo(), int4wo() is a bit unintuitive. I understand it is a mechanism to pass params like group size to the quantization. Alternatives: full-blown class with __call__() method e.g. Int8WeightOnlyConfig (kinda verbose, but intention is clear); just pass quant params as extra args/kwargs e.g. quantize("int4wo", groupsize=128)
It's not clear what unwrap_tensor_subclass() does. Also, why do we need it now to compile the model, but not previously?
- Small doc correction. unwrap_tensor_subclass() should be imported from torchao.utils or torchao.quantization.quant_api, not torchao.quantization.utils (https://github.com/pytorch/ao/tree/main/torchao/quantization)

@jerryzh168

The text was updated successfully, but these errors were encountered:

msaroufim · 2024-06-16T14:58:11Z

So my understanding of unwrap_tensor_subclass() is this is primarily there to deal with some limitation of torch.export() but perhaps @tugsbayasgalan can shed some more light but if that's the case then we should ONLY recommend people do that for export() since indeed the API is strange cause it introduces concepts like unwrapping and subclasses which are implementation details so I see 2 options here

Either remove the call to unwrap() by default and only recommend people do it for export() and link to an issue in export() as to why this is needed so people can follow progress
Call unwrap automatically as part of quantize() function so end users aren't aware of it

Regarding the int8wo() only point the problem is not all quantization algorithms will share the same algorithms and kwargs are difficult for users to figure out what's actually supported granted we did explore some other ideas like

Don't have a top level user API for quantization
Don't try to shorten the names so for example jus say Int8WeightOnlyConfig, I personally strongly dislike our abbreviations like wo and qat since they're familiar to people that work with quantization all the time but no one else

On your point around compilation, it is indeed unclear when a user should vs must compile and we need to communicate the benefits and the necessity of compilation might drive users back to a module swap api

gau-nernst · 2024-06-16T15:04:22Z

Using the new quantize() API, unwrap_tensor_subclass() is a MUST. Without it, I'm getting this error (running the snippet above)

  File "/home/---/code/ao/torchao/dtypes/aqt.py", line 240, in __torch_function__
    return _ATEN_OP_OR_TORCH_FN_TABLE[cls][func](*args, **kwargs)
  File "/home/---/code/ao/torchao/dtypes/utils.py", line 25, in wrapper
    return func(*args, **kwargs)
  File "/home/---/code/ao/torchao/dtypes/aqt.py", line 685, in functional_linear
    weight_tensor = weight_tensor.dequantize()
  File "/home/---/code/ao/torchao/dtypes/aqt.py", line 160, in dequantize
    int_data, scale, zero_point = self.layout_tensor.get_plain()
torch._dynamo.exc.TorchRuntimeError: Failed running call_method forward(*(Linear(in_features=4096, out_features=4096, bias=False), FakeTensor(..., device='cuda:0', size=(1, 4096), dtype=torch.float16)), **{}):
'FakeTensor' object has no attribute 'get_plain'

from user code:
   File "/home/---/miniconda3/envs/dev_nightly/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 43, in inner
    return fn(*args, **kwargs)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

tugsbayasgalan · 2024-06-17T00:13:15Z

@msaroufim I am working on the support for unwrapping/wrapping nested tensor subclasses in PT2. In general, we expect we should be able to preserve the tensor subclasses if users are targetting our training IR and they shouldn't have to rely on unwrap_tensor_subclass().

yiliu30 · 2024-06-17T06:14:27Z

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

jerryzh168 · 2024-06-17T17:02:43Z

@gau-nernst thanks for the feedback.

makes sense, Configs sounds reasonable to me, I'll gather a bit more feedback on this one. I think we don't want to pass around kwargs since we can't document them
so the reason we need it for torch.compile now is because we have multiple levels of tensor subclass now, but previous implementation does not. this is a temporary workaround and I hope it get fixed soon, @tugsbayasgalan is working on this one

jerryzh168 · 2024-06-17T17:03:29Z

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

we are thinking of deprecating GPTQ when we make HQQ work. cc @HDCharles to confirm that hqq is better than GPTQ in general.

jerryzh168 · 2024-06-17T19:16:13Z

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

can you also describe your use case for GPTQ as well?

HDCharles · 2024-06-17T19:44:36Z

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

@yiliu30 to add on to what @jerryzh168 is saying, we haven't seen a lot of people interested in this API at the moment so its not something we've invested a ton of effort into, there are some limitations in the existing API/implementation that make it not work on some parts of some models unless they're carefully handled (https://github.com/pytorch/ao/blob/main/torchao/_models/llama/model.py#L89-L96) . We could fix those if we rewrote the whole thing, but until we do that, it hasn't been tested as thoroughly and isn't expected to work as widely as something like int8 weight only quantization. If you have a significant use case for GPTQ that may change what we do with it.

yiliu30 · 2024-06-18T10:02:17Z

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

can you also describe your use case for GPTQ as well?

@jerryzh168 @HDCharles My reason for keeping GPTQ support is that it is quite popular within the community :). For instance, Hugging Face currently includes 3000+ GPTQ models.

Summary: Addressing feedback from pytorch#384 and pytorch#375 Test Plan: regression tests python test/quantization/test_quant_api.py python test/integration/test_integration.py Reviewers: Subscribers: Tasks: Tags:

Summary: Addressing feedback from #384 and #375 Test Plan: regression tests python test/quantization/test_quant_api.py python test/integration/test_integration.py Reviewers: Subscribers: Tasks: Tags:

gau-nernst · 2024-06-25T00:33:30Z

@jerryzh168 Just visiting this issue again, particularly about unwrap_tensor_subclass(). When I tested with latest main (96d49cd), unwrap_tensor_subclass() is still needed. Are there any drawbacks if we include it inside quantize() so that the users don't need to care about it? (as suggested by @msaroufim #384 (comment)).

jerryzh168 · 2024-06-26T01:51:52Z

@jerryzh168 Just visiting this issue again, particularly about unwrap_tensor_subclass(). When I tested with latest main (96d49cd), unwrap_tensor_subclass() is still needed. Are there any drawbacks if we include it inside quantize() so that the users don't need to care about it? (as suggested by @msaroufim #384 (comment)).

main thing is it makes it a bit harder to debug I think, we'll be removing this soon though, in these two days, stay tuned. we are waiting for pytorch/pytorch#127431 to be landed, and I'll put up a PR to remove it

gau-nernst · 2024-06-26T02:08:32Z

@jerryzh168 that's good to hear! However, users of previous versions of PyTorch (e.g. v2.3) will still need to unwarp tensor subclass? Might not be that important.

jerryzh168 · 2024-07-04T03:32:16Z

@jerryzh168 that's good to hear! However, users of previous versions of PyTorch (e.g. v2.3) will still need to unwarp tensor subclass? Might not be that important.

yeah that's true, I hope at some point we can just stop supporting 2.2 and 2.3 so we can deprecate the old APIs as well.

also we have an updated timeline of weeks to month, see comments in #462 (comment) for more details

…orch#400) Summary: Addressing feedback from pytorch#384 and pytorch#375 Test Plan: regression tests python test/quantization/test_quant_api.py python test/integration/test_integration.py Reviewers: Subscribers: Tasks: Tags:

msaroufim added the enhancement New feature or request label Jun 16, 2024

msaroufim assigned jerryzh168 Jun 16, 2024

jerryzh168 mentioned this issue Jun 17, 2024

quantization api name consistency #375

Closed

gau-nernst mentioned this issue Jun 18, 2024

[RFC] Tensor Subclass based Quantization API #391

Open

jerryzh168 mentioned this issue Jun 19, 2024

Refactor the API for quant method argument for quantize function #400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback on `quantize()` API #384

Feedback on `quantize()` API #384

gau-nernst commented Jun 16, 2024

msaroufim commented Jun 16, 2024 •

edited

Loading

gau-nernst commented Jun 16, 2024

tugsbayasgalan commented Jun 17, 2024

yiliu30 commented Jun 17, 2024

jerryzh168 commented Jun 17, 2024 •

edited

Loading

jerryzh168 commented Jun 17, 2024

jerryzh168 commented Jun 17, 2024

HDCharles commented Jun 17, 2024 •

edited

Loading

yiliu30 commented Jun 18, 2024 •

edited

Loading

gau-nernst commented Jun 25, 2024

jerryzh168 commented Jun 26, 2024

gau-nernst commented Jun 26, 2024

jerryzh168 commented Jul 4, 2024

Feedback on quantize() API #384

Feedback on quantize() API #384

Comments

gau-nernst commented Jun 16, 2024

msaroufim commented Jun 16, 2024 • edited Loading

gau-nernst commented Jun 16, 2024

tugsbayasgalan commented Jun 17, 2024

yiliu30 commented Jun 17, 2024

jerryzh168 commented Jun 17, 2024 • edited Loading

jerryzh168 commented Jun 17, 2024

jerryzh168 commented Jun 17, 2024

HDCharles commented Jun 17, 2024 • edited Loading

yiliu30 commented Jun 18, 2024 • edited Loading

gau-nernst commented Jun 25, 2024

jerryzh168 commented Jun 26, 2024

gau-nernst commented Jun 26, 2024

jerryzh168 commented Jul 4, 2024

Feedback on `quantize()` API #384

Feedback on `quantize()` API #384

msaroufim commented Jun 16, 2024 •

edited

Loading

jerryzh168 commented Jun 17, 2024 •

edited

Loading

HDCharles commented Jun 17, 2024 •

edited

Loading

yiliu30 commented Jun 18, 2024 •

edited

Loading