Skip to content

Performance Bottleneck for Quantize/Dequantize  #142

@liangan1

Description

@liangan1

Description:
The dequantize/quantize op is implemented with single thread in fbgemm and these dequantize ops will be the performance bottleneck when they are used in int8 model.

We use pytorch-transformers to enable int8 model and only Linear ops are quantized ops, so a lot of dequantize/quantize ops are need. For glue/CoLA task with large-base-mode, the profile result are as follows:
image

In order to improve performance, we use OpenMP to speed dequantize/quantize op. The profile results are as follows:
image
image

From the above results, we can see that single thread dequantize/quantize ops seriously impact the performance of quantized model.

Environment:
Cacade lake 8280 CPU

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions