TensorflowLite Android OpenCL delegate may produce invalid Conv2D result

**System information**
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Android
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Huawei P30 Lite
- TensorFlow installed from (source or binary): source
- TensorFlow version (use command below): 2.4.0
- Python version: -
- Bazel version (if compiling from source): 3.1.0
- GCC/Compiler version (if compiling from source): Android NDK 21.3.6528147
- CUDA/cuDNN version: -
- GPU model and memory: Mali-G51 MP4 as per [gsmarena](https://www.gsmarena.com/huawei_p30_lite-9545.php) on an Android smartphone

I have been trying to run inference of some CNN model using TFLite 2.4.0 with OpenCL GPU delegate enabled and found that Conv2D operator may produce NaNs, Infs and other invalid values when running on the **Mali-G51 MP4** GPU if precision loss is allowed (I assume that getting NaNs is not considered as a reasonable precision loss) and Cond2D padding is set to `same`. For `valid` padding model produces valid results.
I've created a simple Conv2D-only ([simple_conv.zip](https://github.com/tensorflow/tensorflow/files/5743060/simple_conv.zip), shown on the illustration below) model to test via [inference_diff](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/inference_diff) util:

![image](https://user-images.githubusercontent.com/5624568/103150993-e2bc9200-478a-11eb-8a56-11c5566eeb33.png)

Here are some sample outputs of the `inference_diff/run_eval` util obtained using the described model on the Huawei P30 Lite (Mali-G51 MP4 GPU) smartphone:

```
$ adb shell /data/local/tmp/run_eval --model_file=/data/local/tmp/simple_conv.tflite --num_runs=1 --delegate=gpu
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
GPU delegate is created.
INFO: Initialized OpenCL-based API.
INFO: Created 1 GPU delegate kernels.
native : inference_profiler_stage.cc:77 Test interpreter has been initialized.
native : tflite_inference_stage.cc:128 
native : inference_profiler_stage.cc:91 Reference interpreter (1 thread on CPU) has been initialized.
Num evaluation runs: 1
Reference run latency: avg=55112(us), std_dev=17548(us)
Test run latency: avg=11990(us), std_dev=1488(us)
OutputDiff[0]: avg_error=inf, std_dev=nan
```

After disabling precision loss manually, model seemed to be working fine, but much slower obviously:

```
$ adb shell /data/local/tmp/run_eval --model_file=/data/local/tmp/simple_conv.tflite --num_runs=1 --delegate=gpu --gpu_precision_loss_allowed=false
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
GPU delegate is created.
INFO: Initialized OpenCL-based API.
INFO: Created 1 GPU delegate kernels.
native : inference_profiler_stage.cc:77 Test interpreter has been initialized.
native : tflite_inference_stage.cc:128 
native : inference_profiler_stage.cc:91 Reference interpreter (1 thread on CPU) has been initialized.
Num evaluation runs: 1
Reference run latency: avg=121364(us), std_dev=20829(us)
Test run latency: avg=28716(us), std_dev=1158(us)
OutputDiff[0]: avg_error=0.000120298, std_dev=0
```

After further investigation I found out that this kind of behavior may be fixed by manually commenting the [piece of code](https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/lite/delegates/gpu/cl/selectors/operation_selector.cc#L257) responsible for selecting Winograd algorithm kernel as a Conv2D node implementation (i.e. `SelectConvolution` branch is always used). After this fix, model seemed to be working fine:

```
$ adb shell /data/local/tmp/run_eval --model_file=/data/local/tmp/simple_conv.tflite --num_runs=1 --delegate=gpu
GPU delegate is created.
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
INFO: Initialized OpenCL-based API.
INFO: Created 1 GPU delegate kernels.
native : inference_profiler_stage.cc:77 Test interpreter has been initialized.
native : tflite_inference_stage.cc:128 
native : inference_profiler_stage.cc:91 Reference interpreter (1 thread on CPU) has been initialized.
Num evaluation runs: 1
Reference run latency: avg=113876(us), std_dev=17465(us)
Test run latency: avg=30590(us), std_dev=3084(us)
OutputDiff[0]: avg_error=0.304206, std_dev=0
```

Thus I assume that Winograd algorithm implementation for OpenCL delegate is the root cause of the issue. To sum up, here is the list of conditions to reproduce the bug, at least on the Mali-G51 MP4 GPU:
1. Create Conv2D node which is suitable for using Winograd algorithm as per [check](https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/lite/delegates/gpu/cl/selectors/operation_selector.cc#L43).
2. Use `same` padding in Conv2D node.
3. Use OpenCL backend.
4. Allow precision loss.

The same behavior was also observed when running on Samsung Galaxy M31 (Mali-G72 MP3 GPU) and Huawei P20 (Mali-G72 MP12 GPU). However, running default build (i.e. without disabling Winograd manually) on Samsung Galaxy S20+ (Mali-G77 MP11 GPU) was successful. 

Please let me know, if you need more details/logs/code etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorflowLite Android OpenCL delegate may produce invalid Conv2D result #45974

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorflowLite Android OpenCL delegate may produce invalid Conv2D result #45974

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions