TFLiteGPUDelegate : FirstNLargestPartitions : It would not be very good solution #66677
Labels
comp:lite
TF Lite related issues
stat:awaiting tensorflower
Status - Awaiting response from tensorflower
stat:contribution welcome
Status - Contributions welcome
TFLiteGpuDelegate
TFLite Gpu delegate issue
type:feature
Feature requests
type:performance
Performance Issue
Issue type
Performance
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
tf 2.4.1
Custom code
Yes
OS platform and distribution
Linux Ubuntu 18.04
Mobile device
Jetson-Xavier nx
Python version
3.6.9
Bazel version
3.1.0
GCC/compiler version
7.5.0
CUDA/cuDNN version
x
GPU model and memory
No response
Current behavior?
Let's suppose that the input tflite model is accelerated using GPUs. If there are incompatible layers for GPU backend (called Fallback layer in tflite) in the middle of the input model, "partition_helper class" divides the input model into delegatable & non-delegatable partition using graph_info class. Then, Apply the FirstNLargestPartitions functions to select the part to be delegated. In FirstNLargestPartitions logic, partitions with many layers are chosen first. Here, the variable N can be set directly by the user, and the default is 1. In this tflite's gpu delegation policy, there are two major problems, I think.
First, It may not be appropriate to preferentially select a partition with many layers containing it. In most cases, such partitions will have the highest computational amount, but there may be partitions with a small number of layers but a very large amount of computation. For example, a partition with two convolutional layers may have more computational power than a partition with 10 simple computational layers, such as add mul. In such cases, FirstNLargest logic may not see the benefits of acceleration well.
Second, It is inconvenient to have the user set the N value, which is a variable for how many partitions to select and delegate. The user must tune and test one by one to see when the N value can most accelerate the input model. Since there is data exchange overhead between CPU and GPU, in most cases the inference time is fastest when the N value is 1, but depending on the structure of the model, there may be cases where it is not.
As a result, I think the existing delegation policy has room for more efficient change by improving the aforementioned problems. However, it seems that the most recent version of the tflite also uses the policy as it is.
Currently, I conducted an inference performance test with yolov4-tiny by delegating all possible partitions to all combinations. As a result, the more computational partitions were delegated, the better the inference performance.
However, the above trend was not always correct because the larger the N value, the greater the CPU GPU data exchange overhead.
Given this overall tendency, it would be better to choose partitions with a large amount of computation rather than to select partitions with a large number of layers contained, in same "N" value condition.
In addition, the N value, which is the variable of how many partitions to select and delegate, may vary depending on the performance of the target hardware. I have not yet developed a logic that prevents users from tuning these N values and automatically selects and delegates partitions internally. It seems difficult to find the most appropriate N value by analyzing the performance of the target hardware without going through tasks such as the profiling process.
On the other hand, the method of obtaining an approximate amount of computation for delegatable partitions and delegating large partitions preferentially will be very simple and effective compared to the previous method.
I wonder why tflite still uses FirstNLargestPartitions logic and whether the aforementioned logic is appropriate from an overall perspective.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: