-
Notifications
You must be signed in to change notification settings - Fork 672
Closed
Description
Description:
The dequantize/quantize op is implemented with single thread in fbgemm and these dequantize ops will be the performance bottleneck when they are used in int8 model.
We use pytorch-transformers to enable int8 model and only Linear ops are quantized ops, so a lot of dequantize/quantize ops are need. For glue/CoLA task with large-base-mode, the profile result are as follows:

In order to improve performance, we use OpenMP to speed dequantize/quantize op. The profile results are as follows:


From the above results, we can see that single thread dequantize/quantize ops seriously impact the performance of quantized model.
Environment:
Cacade lake 8280 CPU
Metadata
Metadata
Assignees
Labels
No labels