TFLite Converter, add possibility to ignore some OPs from quantization

### Issue type

Feature Request

### Have you reproduced the bug with TensorFlow Nightly?

No

### Source

binary

### TensorFlow version

v2.13.0-17-gf841394b1b7

### Custom code

No

### OS platform and distribution

_No response_

### Mobile device

_No response_

### Python version

3.10.13

### Bazel version

_No response_

### GCC/compiler version

_No response_

### CUDA/cuDNN version

_No response_

### GPU model and memory

_No response_

### Current behavior?

Quantizing models to integer works as expected, but because some of the final operations work in INT8, a large accuracy drop can be observed in some models.

This ticket is a feature request to be able to exclude specific operations from quantization and execute in FP32. OpenVINO supports this feature as `ignored_scope` param during quantization. [Link to OpenVINO quantizer documentation.](https://docs.openvino.ai/2022.3/basic_qauntization_flow.html#tune-quantization-parameters) Considering how Edge TPU works, the solution should be to set where to stop quantization and execute the rest of the OPs in FP32 on the CPU.

Lets take yolov8n as an example and convert the pytorch model to TF using onnx2tf. Lets compare the main branch in [FULL INT8](https://github.com/adamp87/ultralytics/tree/main) quantization, with a dirty hack by detaching the last operations and executing as [INT8 + FP32](https://github.com/adamp87/ultralytics/tree/tflite_detach_dirty). As a note, Edge TPU compiled models larger than 192pixel input execute the head on the CPU as some Transpose operations are too large for the TPU.

| Model yolo8n  | mAP50 | mAP50-95 | Note | Speed on Intel CPU |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| Baseline FP32 |  52.6 | 37.4  | Main branch | N/A |
| TFLite Full INT8  |  48.8 | 32.9 | per-tensor | 162.2 ms |
| TFLite INT8 + FP32 |  50.3 | 35.2 | per-tensor | 166.0ms |
| TFLite Full INT8  |  49.8 | 33.9 | per-channel | N/A |
| TFLite INT8 + FP32  |  51.4 | 36.3 | per-channel | N/A |


### Standalone code to reproduce the issue

```shell
https://github.com/adamp87/ultralytics/blob/tflite_detach_dirty/yolo8_full_int8_nohead_test.ipynb
```


### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TFLite Converter, add possibility to ignore some OPs from quantization #62923

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model yolo8n	mAP50	mAP50-95	Note	Speed on Intel CPU
Baseline FP32	52.6	37.4	Main branch	N/A
TFLite Full INT8	48.8	32.9	per-tensor	162.2 ms
TFLite INT8 + FP32	50.3	35.2	per-tensor	166.0ms
TFLite Full INT8	49.8	33.9	per-channel	N/A
TFLite INT8 + FP32	51.4	36.3	per-channel	N/A

TFLite Converter, add possibility to ignore some OPs from quantization #62923

Description

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions