Open
Description
Describe the feature request
DequantizeLinear's current implementation is naive - single thread and scalar instructions.
Could we prioritize a MT/vectorized implementation for this code path to match the MlasQuantizeLinearKernel implementation?
@fajin-corp already made some comments about this.
Describe scenario use case
We should see performance improvements to a multitude of use cases.
Recently, the QNN-EP made this code path the default for the execution provider for performance reasons as well. It's likely vectorization would help this effort even more.
https://github.com/microsoft/onnxruntime/releases/tag/v1.20.2
Another user recently commented also about performance gains with Qwen 2.5 0.5B model:
The thread has since become stale, so I cannot add onto it.