TFLiteGPUDelegate : FirstNLargestPartitions : It would not be very good solution #66677

easyhardhoon · 2024-04-30T07:47:48Z

Issue type

Performance

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

tf 2.4.1

Custom code

Yes

OS platform and distribution

Linux Ubuntu 18.04

Mobile device

Jetson-Xavier nx

Python version

3.6.9

Bazel version

3.1.0

GCC/compiler version

7.5.0

CUDA/cuDNN version

x

GPU model and memory

No response

Current behavior?

Let's suppose that the input tflite model is accelerated using GPUs. If there are incompatible layers for GPU backend (called Fallback layer in tflite) in the middle of the input model, "partition_helper class" divides the input model into delegatable & non-delegatable partition using graph_info class. Then, Apply the FirstNLargestPartitions functions to select the part to be delegated. In FirstNLargestPartitions logic, partitions with many layers are chosen first. Here, the variable N can be set directly by the user, and the default is 1. In this tflite's gpu delegation policy, there are two major problems, I think.
First, It may not be appropriate to preferentially select a partition with many layers containing it. In most cases, such partitions will have the highest computational amount, but there may be partitions with a small number of layers but a very large amount of computation. For example, a partition with two convolutional layers may have more computational power than a partition with 10 simple computational layers, such as add mul. In such cases, FirstNLargest logic may not see the benefits of acceleration well.
Second, It is inconvenient to have the user set the N value, which is a variable for how many partitions to select and delegate. The user must tune and test one by one to see when the N value can most accelerate the input model. Since there is data exchange overhead between CPU and GPU, in most cases the inference time is fastest when the N value is 1, but depending on the structure of the model, there may be cases where it is not.
As a result, I think the existing delegation policy has room for more efficient change by improving the aforementioned problems. However, it seems that the most recent version of the tflite also uses the policy as it is.
Currently, I conducted an inference performance test with yolov4-tiny by delegating all possible partitions to all combinations. As a result, the more computational partitions were delegated, the better the inference performance.
However, the above trend was not always correct because the larger the N value, the greater the CPU GPU data exchange overhead.
Given this overall tendency, it would be better to choose partitions with a large amount of computation rather than to select partitions with a large number of layers contained, in same "N" value condition.
In addition, the N value, which is the variable of how many partitions to select and delegate, may vary depending on the performance of the target hardware. I have not yet developed a logic that prevents users from tuning these N values and automatically selects and delegates partitions internally. It seems difficult to find the most appropriate N value by analyzing the performance of the target hardware without going through tasks such as the profiling process.
On the other hand, the method of obtaining an approximate amount of computation for delegatable partitions and delegating large partitions preferentially will be very simple and effective compared to the previous method.
I wonder why tflite still uses FirstNLargestPartitions logic and whether the aforementioned logic is appropriate from an overall perspective.

Standalone code to reproduce the issue

if(strcmp(GetOpName(reg), "CONV_2D") == 0){
          double mac = tensor->dims->data[1] * tensor->dims->data[2] * tensor->dims->data[3] * i_tensor->dims->data[3] * filter * filter;
          flops = 2*mac/1000000;
          tot += flops;
          printf("\033[0;31mFLOPs : %.1f\033[0m\n", flops);
        }
Above code shows roughly calculate amount of computation per layers.
If the contents mentioned in the above current behavior are appropriate, a new delegation policy can be developed based on the above code.

A program to analyze the inference results by delegating delegatable partitions to all possible combinations was conducted at the link below.
[tflite APP code] https://github.com/easyhardhoon/FBF-TF-hoon/tree/hoon/APP_DOT 
[tflite source code] https://github.com/easyhardhoon/FBF-TF

Relevant log output

testing yolov4-tiny.tflite
there are seven delegatable partitions by partition helper class.
below log is the result of delegation test in jetson-xavier nx, using my custom tflite application & tflite source.
below log shows that the default gpu delegation policy in tflite is not appropriate. 


=== Fallback node number info === : 
8 20 32 55 57 60 63 67 69 70 73 75 76 81 86 91 95 97 98 103 105 108 111 117 122 127 132 134 135 138 140 141 144 146 147 
=== Delegated_partitions info === : 
[0] : 0 1 2 3 4 5 6 7 
[1] : 9 10 11 12 13 14 15 16 17 18 19 
[2] : 21 22 23 24 25 26 27 28 29 30 31 
[3] : 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 102 
[4] : 56 59 62 66 72 78 79 80 83 84 85 88 89 90 94 104 107 110 114 115 116 119 120 121 124 125 126 131 137 143 
[5] : 58 61 64 65 68 74 82 87 92 93 96 106 109 112 113 118 123 128 129 130 133 139 145 
[6] : 71 77 99 100 101 136 142 148 149 150 151 
=== Fallback reason info === : 
ERROR: Following operations are not supported by GPU delegate:
ADD: tensor type error 
MUL: tensor type error
SPLIT: Operation is not supported.
SPLIT_V: Operation is not supported.

N = 1
[0 ]case's latency is 669ms
[1 ]case's latency is 677ms
[2 ]case's latency is 645ms
[3 ]case's latency is 402.00ms  -> Best case
[4 ]case's latency is 632ms   ---> chosen at default delegation.
[5 ]case's latency is 630ms
[6 ]case's latency is 618ms
[END]...Choose_1's average latency is 610.43ms
N = 2
[0 1 ]case's latency is 393ms
[0 2 ]case's latency is 486ms
[0 3 ]case's latency is 293ms
[0 4 ]case's latency is 534ms
[0 5 ]case's latency is 511ms
[0 6 ]case's latency is 510ms
[1 2 ]case's latency is 408ms
[1 3 ]case's latency is 289.00ms
[1 4 ]case's latency is 519ms
[1 5 ]case's latency is 520ms
[1 6 ]case's latency is 519ms
[2 3 ]case's latency is 296ms --> best case
[2 4 ]case's latency is 536ms
[2 5 ]case's latency is 550ms
[2 6 ]case's latency is 534ms
[3 4 ]case's latency is 403ms
[3 5 ]case's latency is 404ms
[3 6 ]case's latency is 408ms
[4 5 ]case's latency is 638ms  --> chosen at default delegation.
[4 6 ]case's latency is 628ms
[5 6 ]case's latency is 635ms
[END]...Choose_2's average latency is 476.86ms
N = 3
[0 1 2 ]case's latency is 280ms
[0 1 3 ]case's latency is 171.00ms
[0 1 4 ]case's latency is 389ms
[0 1 5 ]case's latency is 385ms
[0 1 6 ]case's latency is 391ms
[0 2 3 ]case's latency is 216ms
[0 2 4 ]case's latency is 470ms
[0 2 5 ]case's latency is 481ms
[0 2 6 ]case's latency is 454ms
[0 3 4 ]case's latency is 285ms
[0 3 5 ]case's latency is 287ms
[0 3 6 ]case's latency is 277ms
[0 4 5 ]case's latency is 526ms
[0 4 6 ]case's latency is 546ms
[0 5 6 ]case's latency is 517ms
[1 2 3 ]case's latency is 187ms
[1 2 4 ]case's latency is 407ms
[1 2 5 ]case's latency is 414ms
[1 2 6 ]case's latency is 408ms
[1 3 4 ]case's latency is 292ms
[1 3 5 ]case's latency is 293ms
[1 3 6 ]case's latency is 292ms
[1 4 5 ]case's latency is 529ms
[1 4 6 ]case's latency is 522ms
[1 5 6 ]case's latency is 524ms
[2 3 4 ]case's latency is 298ms
[2 3 5 ]case's latency is 301ms
[2 3 6 ]case's latency is 306ms
[2 4 5 ]case's latency is 535ms
[2 4 6 ]case's latency is 547ms
[2 5 6 ]case's latency is 549ms
[3 4 5 ]case's latency is 406ms
[3 4 6 ]case's latency is 402ms
[3 5 6 ]case's latency is 408ms
[4 5 6 ]case's latency is 640ms
[END]...Choose_3's average latency is 398.14ms
N = 4
[0 1 2 3 ]case's latency is 65.00ms
[0 1 2 4 ]case's latency is 287ms
[0 1 2 5 ]case's latency is 281ms
[0 1 2 6 ]case's latency is 277ms
[0 1 3 4 ]case's latency is 168ms
[0 1 3 5 ]case's latency is 169ms
[0 1 3 6 ]case's latency is 168ms
[0 1 4 5 ]case's latency is 390ms
[0 1 4 6 ]case's latency is 387ms
[0 1 5 6 ]case's latency is 389ms
[0 2 3 4 ]case's latency is 216ms
[0 2 3 5 ]case's latency is 232ms
[0 2 3 6 ]case's latency is 230ms
[0 2 4 5 ]case's latency is 456ms
[0 2 4 6 ]case's latency is 436ms
[0 2 5 6 ]case's latency is 480ms
[0 3 4 5 ]case's latency is 283ms
[0 3 4 6 ]case's latency is 277ms
[0 3 5 6 ]case's latency is 281ms
[0 4 5 6 ]case's latency is 534ms
[1 2 3 4 ]case's latency is 189ms
[1 2 3 5 ]case's latency is 188ms
[1 2 3 6 ]case's latency is 187ms
[1 2 4 5 ]case's latency is 410ms
[1 2 4 6 ]case's latency is 410ms
[1 2 5 6 ]case's latency is 405ms
[1 3 4 5 ]case's latency is 300ms
[1 3 4 6 ]case's latency is 292ms
[1 3 5 6 ]case's latency is 312ms
[1 4 5 6 ]case's latency is 525ms
[2 3 4 5 ]case's latency is 301ms
[2 3 4 6 ]case's latency is 313ms
[2 3 5 6 ]case's latency is 298ms
[2 4 5 6 ]case's latency is 543ms
[3 4 5 6 ]case's latency is 411ms
[END]...Choose_4's average latency is 316.86ms
N = 5
[0 1 2 3 4 ]case's latency is 66ms
[0 1 2 3 5 ]case's latency is 67ms
[0 1 2 3 6 ]case's latency is 64.00ms
[0 1 2 4 5 ]case's latency is 299ms
[0 1 2 4 6 ]case's latency is 285ms
[0 1 2 5 6 ]case's latency is 277ms
[0 1 3 4 5 ]case's latency is 178ms
[0 1 3 4 6 ]case's latency is 175ms
[0 1 3 5 6 ]case's latency is 177ms
[0 1 4 5 6 ]case's latency is 389ms
[0 2 3 4 5 ]case's latency is 193ms
[0 2 3 4 6 ]case's latency is 202ms
[0 2 3 5 6 ]case's latency is 177ms
[0 2 4 5 6 ]case's latency is 462ms
[0 3 4 5 6 ]case's latency is 286ms
[1 2 3 4 5 ]case's latency is 189ms
[1 2 3 4 6 ]case's latency is 190ms
[1 2 3 5 6 ]case's latency is 188ms
[1 2 4 5 6 ]case's latency is 412ms
[1 3 4 5 6 ]case's latency is 299ms
[2 3 4 5 6 ]case's latency is 332ms
[END]...Choose_5's average latency is 233.67ms
N = 6
[0 1 2 3 4 5 ]case's latency is 69ms
[0 1 2 3 4 6 ]case's latency is 67.00ms
[0 1 2 3 5 6 ]case's latency is 68ms
[0 1 2 4 5 6 ]case's latency is 292ms
[0 1 3 4 5 6 ]case's latency is 174ms
[0 2 3 4 5 6 ]case's latency is 225ms
[1 2 3 4 5 6 ]case's latency is 193ms
[END]...Choose_6's average latency is 155.43ms
N = 7
[0 1 2 3 4 5 6 ]case's latency is 70.00ms
[END]...Choose_7's average latency is 70.00ms
Minimum CAES's value : 64.00ms
Minimum CASE's N : 5
Minimum CASE's th : 2
Minimum CASE's combination : [0 1 2 3 6 ]

sawantkumar · 2024-05-14T06:59:42Z

Hi @easyhardhoon ,

There is a option to select the start and end nodes which you want to be delegated in the tflite benchmarking tool as you can see below.

first_delegate_node_index: int (default=0)
The index of the first node that could be delegated. Debug only. Add '--define=tflite_debug_delegate=true' in your build command line to use it.
Currently only supported by CoreML delegate.
last_delegate_node_index: int (default=INT_MAX)
The index of the last node that could be delegated. Debug only. Add '--define=tflite_debug_delegate=true' in your build command line to use it.

This is intended for debugging purposes only. However you raise some excellent points regarding the current delegation policy and partition selection strategy. I will get back to you if i find something related to this.

pkgoogle · 2024-05-14T17:35:19Z

This seems like a reasonable request, @grantjensen, can you please take a look? Thanks.

grantjensen · 2024-05-14T23:44:25Z

This seems good to me. This is pretty low priority because there is a work-around as listed above if need be. And moreover, in general we will delegate the entire graph to GPU because context switching is so slow. However, if you were to implement a clean solution, I would approve it.

easyhardhoon · 2024-05-16T02:05:48Z

Thank you for your reply @grantjensen @pkgoogle @sawantkumar . It would be reasonable to delegate the entire graph to the GPU, but in reality, I think there could be enough layers that are incompatible with the GPU. For models with such conditions, I think there is room for a little more improvement on the current policy of delegating to the GPU. I'll think about a more efficient policy in terms of inference latency and user convenience. I'll get back to you. Thank you

google-ml-butler bot added the type:performance Performance Issue label Apr 30, 2024

google-ml-butler bot assigned SuryanarayanaY Apr 30, 2024

SuryanarayanaY assigned sawantkumar and unassigned SuryanarayanaY Apr 30, 2024

sawantkumar added the comp:lite TF Lite related issues label May 14, 2024

sawantkumar assigned pkgoogle and unassigned pkgoogle and sawantkumar May 14, 2024

pkgoogle added type:feature Feature requests TFLiteGpuDelegate TFLite Gpu delegate issue labels May 14, 2024

pkgoogle assigned grantjensen May 14, 2024

pkgoogle added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 14, 2024

pkgoogle added the stat:contribution welcome Status - Contributions welcome label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFLiteGPUDelegate : FirstNLargestPartitions : It would not be very good solution #66677

TFLiteGPUDelegate : FirstNLargestPartitions : It would not be very good solution #66677

easyhardhoon commented Apr 30, 2024

sawantkumar commented May 14, 2024

pkgoogle commented May 14, 2024

grantjensen commented May 14, 2024

easyhardhoon commented May 16, 2024

TFLiteGPUDelegate : FirstNLargestPartitions : It would not be very good solution #66677

TFLiteGPUDelegate : FirstNLargestPartitions : It would not be very good solution #66677

Comments

easyhardhoon commented Apr 30, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

sawantkumar commented May 14, 2024

pkgoogle commented May 14, 2024

grantjensen commented May 14, 2024

easyhardhoon commented May 16, 2024