-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow quantized tflite model #40183
Comments
I'm having the exact same issue: |
wonder what model is it? can you use benchmark tool (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark) to get the detailed profiling? thanks |
It's semenatic segmentation FPN with ResNet101 backbone (trained using https://github.com/qubvel/segmentation_models).
|
Hi Chao, can you help verify if there's regression? thanks |
On what HW architecture was this benchmarking performed? Coral board? And what's the compiling option did you use to compile the binary? |
On the PC with i7-8650U CPU. All options default, I followed instructions from readme file. |
I see. I added T.J who could give more insights here. He is much more familiar with the x86_64 optimization in the underlying math library RUY fo quant models in TFLite. Btw, if you are ok w/ float models on x86, then you could try the new xnnpack delegate (i.e. see https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/xnnpack#enable-xnnpack-via-bazel-build-flags-recommended) that will deliver much better performance on x86. As for perf. on Coral dev board (esp. w/ the edgetpu inside the board), could you report this issue to the Coral repo(i.e. https://github.com/google-coral/edgetpu/issues)? |
Here is a detailed report of my case, please note that I'm benchmarking the cpu-tflite models, the edge compiler output is only added for additional information. CPU: Threadripper 1920x ###My Mobilenetv2: ... import tensorflow.compat.v1 as tf converter = tf.lite.TFLiteConverter.from_keras_model_file(chkp) with open('mobilenet_v2_1.0_224_quant.tflite', 'wb') as f: kriszfekete@datascience01:~/jupyter/tensorflow$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite --num_threads=1 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. Inference timings in us: Init: 1063, First inference: 630344, Warmup (avg): 630344, Inference (avg): 623294 #Pretrained net from: https://www.tensorflow.org/lite/guide/hosted_models, Mobilenet_V2_1.0_224_quant
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. Inference timings in us: Init: 772, First inference: 166513, Warmup (avg): 159732, Inference (avg): 161145 ####################################################### Model compiled successfully in 478 ms. Input model: mobilenet_v2_1.0_224_quant.tflite Operator Count Status PAD 5 Mapped to Edge TPU MobileNet_V2 from hosted models: Model compiled successfully in 378 ms. Input model: hub_mobilenet_v2_1.0_224_quant.tflite Operator Count Status CONV_2D 36 Mapped to Edge TPU UPDATE: My coral edgetpu (usb stick) arrived, this is even weirder, my model which ran about 5* as slow (on CPU) as the one from the hosted models, is actually faster on edgetpu. My model had a latency about 3.7-3.8 ms |
|
Hi, OS Linux 18.04.04 LTS I have a very similar issue. A trained a ResNet-50 V2 model and it takes ~ 40ms per inference on my CPU with efficiency 64.15 %. I then converted first to tf lite and speed was ~ 89 ms efficiency identical. Then with dynamic range quantization and it took ~ 548 ms per inference efficiency dropped to 54.35%. On Monday I converted it to full integer quantization with uint8 input/output and it takes ~ 7 seconds on cpu and ~40 ms on Edge TPU. Worth mentioning that besides ridiculously slow speed on CPU the full integer quantized model predicts the same value on CPU and Edge TPU all the time. Therefore, slow and broken. Does anyone have a clue how to solve this? Here you can find a link to a google folder with original mode, conversion code, converted model, and the code I use to test the tf lite model. Many thanks! https://drive.google.com/file/d/1hNc6xCLch1T9EEqahpiT6FzIDg3P423u/view?usp=sharing |
Hi all, It's also good to mention that I've also observed similar behavior (tiny bit difference) when testing out a yolov4 model (note that unfortunately, only 1/962 ops were mapped to edgetpu so we don't see much speed up here):
A quick look into the model with Now let's check this again on my dev board, everything is as expected:
My suggestion for all is to run the tflite model on arm platform since that's what tflite models are optimized for. If you are benchmarking a tflite model against a CPU graph model is not ideal. Hope these finding is helpful! |
TFLite on x86 CPUs may not have been fully optimized for quantized models. But w/ XNNPACK delegate, as mentioned in #40183 (comment), will deliver significant x86 performance improvements for float models. |
It would be interesting to hear if performance is significantly different if you build with this flag,
ruy is not heavily optimized for x86 as it is for ARM, which is part of why it isn't the default yet, but it might already perform better than the default. However, ruy is only an implementation of matrix multiplication. If your model spends most of its time in other nodes, it will run into the fact that tflite's operators are implemented in NEON intrinsics, compiling on x86 thanks to a NEON->SSE intrinsics translation header. In other words, the compromise here has been minimal x86 implementation effort at the expense of x86 performance. It is to be expected that another inference engine with more first-class x86 implementation would outperform it, as mentioned in the previous comment. |
I understand that, I was suggesting that users deploy there tflite models to arm machines instead of comparing it to the graph on an x86. |
I've checked the same model conversion but using much smaller MobileNetV2 classification model from |
TFLite has a couple of built-in profilers that are available wherever you can run tflite and look at terminal output. One is enabled by passing |
So can we conclude that tFLite post training optimizations are not for x86 cpu? |
Hi, Apologies that this issue has gone stale. Some additional x86 optimizations have landed (for AVX, AVX2, and AVX512) and they will soon be the default on x86, but aren't yet. For this issue, it would be good to know if the poor performance persists for you on x86 CPU. Can you please do as follows:
and then post the output to this issue. Also, please provide your exact build line for any executable you are running. Thanks! |
any plan on make this default? |
Hi, Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base. The TFLite team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate. Please follow the release notes to stay up to date with the latest developments which are happening in the TFLite space. Thanks. |
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you. |
This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further. |
System information
Command used to run the converter or code if you’re using the Python API
If possible, please share a link to Colab/Jupyter/any notebook.
The output from the converter invocation
Also, please include a link to the saved model or GraphDef
Failure details
Before conversion, running standard keras model on CPU took ~300ms per frame. After conversion it takes ~55s.
Eventually I want to deploy the model on Coral Dev Board. Currently after compiling it for edge TPU inference takes ~4s using Coral.
Is it normal that it's so slow? I expect it to be at least not slower than before conversion.
Any other info / logs
Logs from edge tpu compiler:
The text was updated successfully, but these errors were encountered: