Performance Bottleneck for Quantize/Dequantize 

Description:
       [The dequantize/quantize op is implemented with single thread in fbgemm](https://github.com/pytorch/FBGEMM/blob/f40e2d0d420907b002a9dae5ad0dd1c57044407d/include/fbgemm/QuantUtils.h#L128) and these dequantize ops will be the performance bottleneck when they are used in int8 model.

We use [pytorch-transformers](https://github.com/huggingface/transformers) to enable int8 model and only Linear ops are quantized ops, so a lot of dequantize/quantize ops are need.  For glue/CoLA task with large-base-mode,  the profile result are as follows:
![image](https://user-images.githubusercontent.com/46986936/66833241-0d5b6a00-ef8e-11e9-88e5-3aba25813635.png)

In order to improve performance, we use OpenMP  to speed dequantize/quantize op. The profile results are as follows:
![image](https://user-images.githubusercontent.com/46986936/66833418-67f4c600-ef8e-11e9-96e5-78786aecee69.png)
![image](https://user-images.githubusercontent.com/46986936/66833338-367bfa80-ef8e-11e9-9502-2fe11c81ac28.png)

From the above results, we can see  that single thread dequantize/quantize ops seriously impact the performance of quantized model.

Environment:
Cacade lake 8280 CPU 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Bottleneck for Quantize/Dequantize #142

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Bottleneck for Quantize/Dequantize #142

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions