How can I get INT8 weight in my model? #58

zero90169 · 2020-03-12T11:58:25Z

Hi there,
I'm trying to train a vgg16 model (use the vgg16 provided from brevitas/examples/imagenet_classification/models/vgg.py, and setting is following thecommon.py) on our own datasets, and the model has trained well.
I look the code in brevitas/examples/imagenet_classification/models/common.py
and find line7 QUANT_TYPE = QuantType.INT
Can I regard the weight in qnn.QuantConv2d will be INT type?
But when I load the model.pt(my saving model weight, using torch.load), I get the weight just like this:

How can I get the INT8 weights in my models, and how to use the weight I got to do inference on FPGA? Just directly port my weight to my VGG design on FPGA or I need to add some scaling step or something...?

The text was updated successfully, but these errors were encountered:

volcacius · 2020-03-13T10:28:14Z

Hi,

Conv layers have properties int_weight and quant_weight_scale to extract the integer weights and their scale factor, activations have method quant_act_scale() to extract the scale factor. Mapping to any hardware is up to the user and it really depends on how the quantization was set up.

Alessandro

zero90169 · 2020-03-15T06:14:07Z

Hi, @volcacius
Thanks for your friendly reply.
I have a few more question about how can implement residual block? I want to implement mobilenetv2 but when I have no idea how to do when I facing add operation in quantization.

Different scaling of output tensor cannot directly do add operation..?
In other words, whenx_1 (floating point 32) quantize as q_1 (INT8) and its scaling factor is s_1 and x_2 (floating point 32) quantize as q_2 (INT8) and its scaling factor is s_2, if we directly add q_1 and q_2, when de-quantization part we will get a wrong value, right?
I am really new in this field, could you please provide me a toy example how to quantize the following network?

class Net(nn.Module):
def __init__(self):
    super(Net, self).__init__()
        self.c1 = Conv2d(in, out, 3, 1, 1)
        self.c2 = Conv2d(out, out, 3, 1, 1)
    def forward(self,x):
        r = self.c1(x)
        x = self.c2(r)
        return r+x

volcacius · 2020-03-16T09:51:17Z

Hello,

Yes this a typical problem to address. You have two main strategies.
In general you can easily check that you are doing the right thing by passing around a QuantTensor when possible (return_quant_tensor=True, in activations, compute_output_scale=compute_output_bit_width=return_quant_tensor=True for conv/fc). When you sum two QuantTensor, Brevitas checks that they have the same scale factor.
The first strategy is that you want to re-use activations like QuantReLU or QuantHardTanh, so that you have the same scaling factor when you have an element-wise add or concatenate. If you look at the ProxylessNAS example you can get an idea. Basically what I do is to allocate a new activation function every time a chain of residual connections starts, which typically happens after there is some subsampling (stride!=1).
This applies to your example, so basically you insert QuantHardTanh (which is really just a signed quantized identity) and you call it twice:

class QuantNet(nn.Module):
def __init__(self, in_ch, out_ch):
    super(Net, self).__init__()
    self.shared_quant = QuantHardTanh(
        bit_width=8, 
        quant_type=QuantType.INT,
        min_val=-1.0,# arbitrary
        max_val=1.0,
        return_quant_tensor=True)
    self.c1 = QuantConv2d(
        in_ch, out_ch, 3, 1, 1, 
        weight_bit_width=8, 
        weight_quant_type=QuantType.INT)
    self.c2 = QuantConv2d(
        out_ch, out_ch, 3, 1, 1, 
        weight_bit_width=8, 
        weight_quant_type=QuantType.INT)

    def forward(self,x):
        r =  self.shared_quant(self.c1(x))
        x = self.shared_quant(self.c2(r))
        return r+x # you can check that r1 and r2 have the same scale factors

The second strategy is to share a weight quantizer between multiple conv/fc layers. This applies to any scenario where you need to have weights with the same scale factor (along the whole layer or along corresponding output channels, depending on whatever you set weight_scaling_per_output_channel=True).
An example for the second strategy:

class NetSharedWeightQuant(nn.Module):
def __init__(self, in_ch, out_ch):
    super(Net, self).__init__()
    self.inp_quant = QuantHardTanh(
        bit_width=8, 
        quant_type=QuantType.INT,
        min_val=-1.0, #arbitrary
        max_val=1.0,
        return_quant_tensor=True)
    self.c1 = QuantConv2d(
        in, out, 3, 1, 1, 
        weight_bit_width=8, 
        weight_quant_type=QuantType.INT,
        compute_output_scale=True,
        compute_output_bit_width=True,
        return_quant_tensor=True)
    self.c2 = QuantConv2d(
        in, out, 3, 1, 1, 
        weight_quant_override=self.c1.weight_quant,
        weight_quant_type=QuantType.INT, # this is not necessary when you are setting weight_quant_ovveride, but otherwise [this check](https://github.com/Xilinx/brevitas/blob/d9b1e355299abd5fe0e4a5527c1f69b8371fb121/brevitas/nn/quant_conv.py#L131) would fail
        compute_output_scale=True,
        compute_output_bit_width=True,
        return_quant_tensor=True)
    def forward(self,inp):
        x = self.inp_quant(inp)
        r1 = self.c1(x)
        r2 = self.c2(x)
        return r1+r2 # you can check that r1 and r2 have the same scale factors

Depending on the topology you can combine the two strategies. The good thing is that in both cases, learned scale factors adapt to the fact that they are used in multiple places.

ziqi-zhang · 2020-03-28T13:21:55Z

Hi @volcacius, thanks for the clear example. I still have a problem with a small difference between the first strategy and proxylessnas implementation. The first strategy applies shared_quant to r and x respectively and add them. But in the proxylessnas implementation, you add r and x first and then apply the shared_quant. I think when you add r and x first, their scale factors are different so I'm confused. Can you explain it to me?

Thanks!

volcacius · 2020-03-30T17:08:14Z

Hello,

The proxylessnas implementation is doing what the first strategy is doing, it's just the way the code it's organized that makes it harder to see.
The reason why you are calling shared_act(x) at line 175 here is because you want the input to the next block to be quantized with that scale factor. That means that in the next block, the variable identity has been quantized with shared_act. Additionally, again in the next block, when you are calling self.body, the last activation called is again the same shared_act . So when you perform identity + x, you are summing two quantized tensors that have passed through the same shared_act. If you run the code, you'll see at runtime that identity.scale == x.scale.
Unfortunately there isn't an easier way to do this sort of things at the moment, but the good thing is that if you made a mistake for any reason (identity.scale != x.scale) Brevitas would raise an Exception (you can see it here in the implementation of __add__ in QuantTensor).
Hope this helps.

Alessandro

ziqi-zhang · 2020-03-31T02:56:15Z

Hi @volcacius, thanks for your explanation and I am now understood. I have another question about the data flow inside brevitas. I found that in the conv layer, what brevitas do is convert conv weight to quanted value and multiply it with the input. The scale of input seems not evolved in the computation. Besides QuantTensor has a Tensor and it is a float point unquantized value. So I was thinking if I just want to see the effect of quantization on the accuracy and not care about the speedup, can I only feed pytorch tensor to each operation and leave tensor scale away? And the scale problem in this issue is no longer a problem.

volcacius · 2020-03-31T10:37:20Z

Hi,

Yes, you don't have to use quant tensors if you don't want to. For quantized weights and activations, they are not required. In QuantConv2d and QuantLinear, you can set compute_output_scale=False, compute_output_bit_width=False, and return_quant_tensor=False. In QuantReLU and QuantHardTanh, you just need to set return_quant_tensor=False.
However, quant tensors are required for some things. If you enable bias quantization (bias_quant_type == INT), you need to pass a QuantTensor to QuantConv2d/QuantLinear and have compute_output_scale=True, compute_output_bit_width=True. This is because is the scale factor (and possibly bit width) of bias depends on the scale factor of the input. You have the same issue with QuantAvgPool. For those cases, the easiest thing is to set return_quant_tensor=True only where you need it, for the example for QuantAvgPool you can enable it only in the activation right before the QuantAvgPool layer, and leave it disabled everywhere else.

Alessandro

Shishuii · 2021-02-05T06:22:19Z

Hi, @ziqi-zhang @zero90169

I am still having some issues while implementing MobilenetV2 (passing shared quantizers ), Can you share your model code.
It will be of great help.

Pritesh

volcacius closed this as completed Mar 16, 2020

volcacius reopened this Mar 30, 2020

volcacius closed this as completed Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get INT8 weight in my model? #58

How can I get INT8 weight in my model? #58

zero90169 commented Mar 12, 2020 •

edited

Loading

volcacius commented Mar 13, 2020

zero90169 commented Mar 15, 2020 •

edited

Loading

volcacius commented Mar 16, 2020

ziqi-zhang commented Mar 28, 2020

volcacius commented Mar 30, 2020

ziqi-zhang commented Mar 31, 2020

volcacius commented Mar 31, 2020

Shishuii commented Feb 5, 2021

How can I get INT8 weight in my model? #58

How can I get INT8 weight in my model? #58

Comments

zero90169 commented Mar 12, 2020 • edited Loading

volcacius commented Mar 13, 2020

zero90169 commented Mar 15, 2020 • edited Loading

volcacius commented Mar 16, 2020

ziqi-zhang commented Mar 28, 2020

volcacius commented Mar 30, 2020

ziqi-zhang commented Mar 31, 2020

volcacius commented Mar 31, 2020

Shishuii commented Feb 5, 2021

zero90169 commented Mar 12, 2020 •

edited

Loading

zero90169 commented Mar 15, 2020 •

edited

Loading