Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All quantized ops should take scalar_type as an argument #34351

Closed
jerryzh168 opened this issue Mar 6, 2020 · 21 comments
Closed

All quantized ops should take scalar_type as an argument #34351

jerryzh168 opened this issue Mar 6, 2020 · 21 comments
Labels
low priority We're unlikely to get around to doing this in the near future oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@jerryzh168
Copy link
Contributor

jerryzh168 commented Mar 6, 2020

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qconcat.cpp#L111

because quantized::cat is fused from dequant - aten::cat - quantize and quantize has scalar_type argument.

We'll need to first implement a requantize that can change the scalar_type of a Tensor, maybe just use the to API?
And then add scalar_type argument for all the quantized ops, we can change one op in each PR.

cc @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel @jerryzh168 @dzhulgakov

@jerryzh168 jerryzh168 added oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 6, 2020
@imskr
Copy link

imskr commented Mar 6, 2020

Can I work on this? @jerryzh168

@jerryzh168
Copy link
Contributor Author

@imskr sure, feel free to submit a PR and tag us :)

@imskr
Copy link

imskr commented Mar 6, 2020

@jerryzh168 Thanks. Do I have to add scalar_type as argument in quantized::cat?

@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Mar 6, 2020

@imskr yeah, I think what we need here is
PR 0. implement a requantize op, that can convert between different scalar_types
PR 1. add scalar_type as an argument for quantized::cat, and in the end of the op we'll call requantize
PR 2-?. add scalar_type for other quantized ops

I just discovered that none of our quantized ops has scalar_type argument. we want to add scalar_type for all of them.

@jerryzh168 jerryzh168 changed the title quantized::cat should take scalar_type as an argument All quantized ops should take scalar_type as an argument Mar 6, 2020
@imskr
Copy link

imskr commented Mar 7, 2020

I have to add scalar_type argument to functions in here right? @jerryzh168

@jerryzh168
Copy link
Contributor Author

@imskr yeah, and requantize is implemented in this PR: #34376, you can wait until that PR is landed

@imskr
Copy link

imskr commented Mar 10, 2020

@jerryzh168 Thanks

@imskr
Copy link

imskr commented Mar 12, 2020

@jerryzh168 Since requantize is implemented in PR: #34376 I can now add SCALAR_TYPE in file right?

@masahi
Copy link

masahi commented Mar 16, 2020

@jerryzh168 I believe the output dtype of concat op should be the same as input qtensors, so adding scalar_type to qconcat args doesn't seem right to me.

Instead, now that you have a requantize op, I think a better way is to use the requantize op to make input quant params to be the same as output, and use the normal concat op (aten::cat). This is how TVM implements qconcat. https://github.com/apache/incubator-tvm/blob/master/src/relay/qnn/op/concatenate.cc#L132-L143

In my PR on translating quantized torch models to TVM apache/tvm#4977, I tried two different approaches for converting quantized::cat:

  • Dequantize -> normal concat -> Quantize (Torch way)
  • Requantize inputs and use normal concat (TVM way linked above)

I didn't see any accuracy loss using the second approach. And the second approach is better because there is no float values in the middle, if requanitze is implemented with fixed point math (which doesn't seem to be the case for Torch, https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/quantized/Quantizer.cpp#L359).

@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Mar 17, 2020

@masahi Thanks for providing your input! Yes for quantized::cat we might want to enforce inputs and outputs to have the same scalar_type, but this is not the case in general. If we take a step back and look at how we get quantized ops, it is through patterns like the following, taking quantized::conv2d from graph mode quantization as an example:

  std::string conv2d = R"(
graph(%a_quant, %packed_params, %r_scale, %r_zero_point, %r_dtype, %stride, %padding, %dilation, %groups):
        %a_dequant = aten::dequantize(%a_quant)
        %w_quant : Tensor, %b : Tensor? = quantized::conv2d_unpack(%packed_params)
        %w_dequant = aten::dequantize(%w_quant)
        %r = aten::conv2d(%a_dequant, %w_dequant, %b, %stride, %padding, %dilation, %groups)
        %r_quant = aten::quantize_per_tensor(%r, %r_scale, %r_zero_point, %r_dtype)
        return (%r_quant) )";

  std::string quantized_conv2d = R"(
graph(%a_quant, %packed_params, %r_scale, %r_zero_point, %r_dtype, %stride, %padding, %dilation, %groups):
        %r_quant = quantized::conv2d(%a_quant, %packed_params, %stride, %padding, %dilation, %groups, %r_scale, %r_zero_point)
        return (%r_quant) )";

Notice that %r_dtype in aten::quantize_per_tensor is ignored in quantized:::conv2d right now, which means this is not an equivalent transformation. And we do have cases when the scalar_type matters, cc @z-a-f for examples

For implementations, we can still explicitly enforce that input quantized Tensors should have the same scalar_type as output for ops like quantized::cat

@masahi
Copy link

masahi commented Mar 17, 2020

I believe concat op doesn't need scalar_type argument (assuming it means output dtype), but for other ops it makes sense to add scalar_type.

From what I've seen, right now all Torch quantized ops output quint8 tensor as output, so there is no need to pass around scalar_type. Can I use this occasion to ask if Torch team has a plan for other output types, like uint4? If that is the case, the output scalar_type does really matter. Personally I'm very interested in 4 bit quantization support.

@jerryzh168
Copy link
Contributor Author

yeah, but we would like to keep consistent for all quantized ops. quantized::cat and similar ops can have checks in the ops that input/output scalar_types should be the same.

I think we'll be supporting more scalar_types if we see they become popular in different hardwares, but right now you can already do 4 bit quantization during training using fake quant to estimate the accuracy impact

@z-a-f
Copy link
Contributor

z-a-f commented Mar 18, 2020

FYI: It is good to have a scalar_type. For example in our LSTM implementation we want to have 32 bit quant for the internal state (c), but the input/output and gates are 8-bit. At the same time, sigmoid output is unsigned (could be signed tho), while tanh is signed. It makes it more "natural" to have different scalar types for different parts. Of course, we could make an exception, an adapt the gates to specific examples -- so this is not a strong preference for me

@z-a-f
Copy link
Contributor

z-a-f commented Mar 18, 2020

I believe concat op doesn't need scalar_type argument (assuming it means output dtype), but for other ops it makes sense to add scalar_type.

From what I've seen, right now all Torch quantized ops output quint8 tensor as output, so there is no need to pass around scalar_type. Can I use this occasion to ask if Torch team has a plan for other output types, like uint4? If that is the case, the output scalar_type does really matter. Personally I'm very interested in 4 bit quantization support.

For the the concat_out you are right, we can infer the type from the output tensor. The problem is that when there are different paths with different quantization params coming together, it is not 100% clear what the result should be. During the concatenation we have to requantize even if the scalar_type is the same for all tensors. For example, if the scale and zero_point are different. In such a case, imho, changing the scalar_type comes for free.

I think a good approach would be to allow scalar_type in the cat, but also allow the default None arg.

But yet again, this is not a strong preference for me, I can live with either implementation

@z-a-f
Copy link
Contributor

z-a-f commented Mar 18, 2020

BTW FYI: torch.cat tries promoting the types when concatenating. The problem is that it is not behaving properly (@jerryzh168 I wonder if we should open an issue about that):

xt = torch.ones((2, 1, 3), dtype=torch.int32)
yt = torch.ones((2, 1, 3), dtype=torch.int8)
zt = torch.cat([xt, yt])
print(zt)
# tensor([[[          1,           1,           1]],
#         [[          1,           1,           1]],
#         [[   16843009,         257,   176279536]],
#         [[          1,    94132789, -1342175233]]], dtype=torch.int32)
zt = torch.cat([xt, yt.to(torch.int32)])
print(zt)
# tensor([[[1, 1, 1]],
#         [[1, 1, 1]],
#         [[1, 1, 1]],
#         [[1, 1, 1]]], dtype=torch.int32)

In numpy this is behaving properly:

x = np.ones((2, 1, 3), dtype=np.int32)
y = np.ones((2, 1, 3), dtype=np.int8)
z = np.concatenate([x,y])
print(z)
# array([[[1, 1, 1]],
#        [[1, 1, 1]],
#        [[1, 1, 1]],
#        [[1, 1, 1]]], dtype=int32)

@jerryzh168
Copy link
Contributor Author

BTW FYI: torch.cat tries promoting the types when concatenating. The problem is that it is not behaving properly (@jerryzh168 I wonder if we should open an issue about that):

xt = torch.ones((2, 1, 3), dtype=torch.int32)
yt = torch.ones((2, 1, 3), dtype=torch.int8)
zt = torch.cat([xt, yt])
print(zt)
# tensor([[[          1,           1,           1]],
#         [[          1,           1,           1]],
#         [[   16843009,         257,   176279536]],
#         [[          1,    94132789, -1342175233]]], dtype=torch.int32)
zt = torch.cat([xt, yt.to(torch.int32)])
print(zt)
# tensor([[[1, 1, 1]],
#         [[1, 1, 1]],
#         [[1, 1, 1]],
#         [[1, 1, 1]]], dtype=torch.int32)

In numpy this is behaving properly:

x = np.ones((2, 1, 3), dtype=np.int32)
y = np.ones((2, 1, 3), dtype=np.int8)
z = np.concatenate([x,y])
print(z)
# array([[[1, 1, 1]],
#        [[1, 1, 1]],
#        [[1, 1, 1]],
#        [[1, 1, 1]]], dtype=int32)

we don't allow type promotion for quantized data types since quantized data types are not meaningful independently, they are coupled with other quantization parameters like scale/zero_point to represent the quantization, so the idea of original data type promotion doesn't really apply here I think.

@raghuramank100
Copy link
Contributor

Lets not do type promotion in concat. The input data types should be the same. I do not see the need for a scalar_type for cat. The output type should be the same as that of the input

@dzhulgakov
Copy link
Collaborator

Btw, in any case for regular cat we should not return garbage - let's error out or fix it properly (meaning yes, let's create an issue for it)

@z-a-f
Copy link
Contributor

z-a-f commented Mar 19, 2020

@dzhulgakov #35014

@andrewor14 andrewor14 added the low priority We're unlikely to get around to doing this in the near future label Nov 17, 2023
@andrewor14
Copy link
Contributor

@jerryzh168 Is this something we still want to do?

@jerryzh168
Copy link
Contributor Author

probably not, since the native pytorch quantized ops on CPU are no longer important for us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low priority We're unlikely to get around to doing this in the near future oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants