Qualcomm AI Engine Direct - Observer Fix and remove unused passes #6225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
ConvertToLinear()is redundant inqnn_preprocess.pysince this pass is already called inexecutorch/backends/qualcomm/utils/utils.pySome models are experiencing a significant drop in accuracy, with a few models having 0% accuracy. Adding new conditions to perform requantization and change ptq_per_channel_quant_config's IO from MinMaxObserver to MovingAverageMinMaxObserver to resolve the issue.
dtypeandis_dynamic. After this change, it checks for more attributes such asscale,zero_point, etc. This causes some nodes having an extra pair of QDQ nodes. As shown in the image below, there are 2 pairs of QDQ nodes after the PyTorch PR, and these 2 pairs of QDQ nodes have different scale and offset. For QNN lowering process, node will only save the quant info right after the node output. For example,catop below will usequantize_per_tensor_default_18's scale and offset as the node's quant attribute, and all other quant and dequant nodes will be ignored.This causes an accuracy drop, but by inserting a requantize node, we can see an improvement in accuracy for most models. Taking inceptionv3 as an example, the average top1 accuracy 0%->~75%. I have checked a couple other models and see accuracy either stays the same or have improvements.
I have also provided the option for users to skip this requant optimization if they preferred not to use it.
Before:

After

After the above change, it seems like there is an inference speed drop due to requantization. By switching to MovingAverageMinMaxObserver, I observed an improvement in inference speed for some models such as inceptionv3.