Very slow quantized tflite model #40183

mieszkokl · 2020-06-05T09:01:43Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
TensorFlow installed from (source or binary): binary
TensorFlow version (or github SHA if from source): 2.2.0

Command used to run the converter or code if you’re using the Python API
If possible, please share a link to Colab/Jupyter/any notebook.

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.representative_dataset = representative_dataset_gen
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tf_lite_model = converter.convert()

with open("model.tflite", "wb") as f:
    f.write(tf_lite_model)

The output from the converter invocation

2020-06-05 10:53:29.063149: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-06-05 10:53:29.063233: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-06-05 10:53:29.080730: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-06-05 10:53:29.080748: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   function_optimizer: function_optimizer did nothing. time = 0.006ms.
2020-06-05 10:53:29.080752: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   function_optimizer: function_optimizer did nothing. time = 0ms.
2020-06-05 10:53:32.284115: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-06-05 10:53:32.284242: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-06-05 10:53:33.407982: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-06-05 10:53:33.408011: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   constant_folding: Graph size after: 1092 nodes (-568), 1139 edges (-568), time = 474.12ms.
2020-06-05 10:53:33.408016: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   constant_folding: Graph size after: 1092 nodes (0), 1139 edges (0), time = 213.886ms.

Also, please include a link to the saved model or GraphDef

https://drive.google.com/file/d/1imjVvw8IqQ6tvQRYaKJi_ynxQUHBBSH_/view?usp=sharing

Failure details
Before conversion, running standard keras model on CPU took ~300ms per frame. After conversion it takes ~55s.
Eventually I want to deploy the model on Coral Dev Board. Currently after compiling it for edge TPU inference takes ~4s using Coral.

Is it normal that it's so slow? I expect it to be at least not slower than before conversion.

Any other info / logs
Logs from edge tpu compiler:

Edge TPU Compiler version 2.1.302470888
Input: model.tflite
Output: model_edgetpu.tflite

Operator                       Count      Status

ADD                            1          More than one subgraph is not supported
ADD                            71         Mapped to Edge TPU
MAX_POOL_2D                    1          Mapped to Edge TPU
PAD                            35         Mapped to Edge TPU
MUL                            35         Mapped to Edge TPU
CONCATENATION                  1          More than one subgraph is not supported
QUANTIZE                       1          Operation is otherwise supported, but not mapped due to some unspecified limitation
QUANTIZE                       3          Mapped to Edge TPU
CONV_2D                        115        Mapped to Edge TPU
CONV_2D                        4          More than one subgraph is not supported
DEQUANTIZE                     1          Operation is working on an unsupported data type
RESIZE_BILINEAR                2          Operation is otherwise supported, but not mapped due to some unspecified limitation
RESIZE_BILINEAR                6          Mapped to Edge TPU
SOFTMAX                        1          Max 16000 elements supported

The text was updated successfully, but these errors were encountered:

chrisai-dev · 2020-06-05T13:35:11Z

I'm having the exact same issue:
MobilenetV2 which I trained and quantized with the same settings as above, runs about 1.7FPS
MobilenetV2 from here https://www.tensorflow.org/lite/guide/hosted_models, runs about 7FPS

renjie-liu · 2020-06-06T04:57:32Z

wonder what model is it?

can you use benchmark tool (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark)

to get the detailed profiling?

thanks

mieszkokl · 2020-06-06T08:40:41Z

It's semenatic segmentation FPN with ResNet101 backbone (trained using https://github.com/qubvel/segmentation_models).
Output from the benchmark tool:

STARTING!
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [8]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [model.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Max number of delegated partitions : [0]
Use gpu : [0]
Use xnnpack : [0]
Loaded model model.tflite
The input model file size (MB): 47.4346
Initialized session in 48.019ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=15908626

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=9 first=16856867 curr=16865810 min=16767506 max=17015249 avg=1.68505e+07 std=69772

Average inference timings in us: Warmup: 1.59086e+07, Init: 48019, Inference: 1.68505e+07
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=11.707 overall=181.352

renjie-liu · 2020-06-07T15:26:29Z

Hi Chao, can you help verify if there's regression?

thanks

multiverse-tf · 2020-06-08T03:56:16Z

It's semenatic segmentation FPN with ResNet101 backbone (trained using https://github.com/qubvel/segmentation_models).
Output from the benchmark tool:

STARTING!
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [8]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [model.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Max number of delegated partitions : [0]
Use gpu : [0]
Use xnnpack : [0]
Loaded model model.tflite
The input model file size (MB): 47.4346
Initialized session in 48.019ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=15908626

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=9 first=16856867 curr=16865810 min=16767506 max=17015249 avg=1.68505e+07 std=69772

Average inference timings in us: Warmup: 1.59086e+07, Init: 48019, Inference: 1.68505e+07
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=11.707 overall=181.352

On what HW architecture was this benchmarking performed? Coral board? And what's the compiling option did you use to compile the binary?

mieszkokl · 2020-06-08T07:18:56Z

On the PC with i7-8650U CPU. All options default, I followed instructions from readme file.

multiverse-tf · 2020-06-08T07:59:41Z

On the PC with i7-8650U CPU. All options default, I followed instructions from readme file.

I see. I added T.J who could give more insights here. He is much more familiar with the x86_64 optimization in the underlying math library RUY fo quant models in TFLite.

Btw, if you are ok w/ float models on x86, then you could try the new xnnpack delegate (i.e. see https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/xnnpack#enable-xnnpack-via-bazel-build-flags-recommended) that will deliver much better performance on x86.

As for perf. on Coral dev board (esp. w/ the edgetpu inside the board), could you report this issue to the Coral repo(i.e. https://github.com/google-coral/edgetpu/issues)?

chrisai-dev · 2020-06-08T08:58:05Z

Here is a detailed report of my case, please note that I'm benchmarking the cpu-tflite models, the edge compiler output is only added for additional information.

CPU: Threadripper 1920x

###My Mobilenetv2:
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = True
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(units=2, activation='softmax')
])
model.build([None, 96, 512, 3])

...
#custom tf traning loop with gradienttape
...
model.save(f"./models/model_{epoch}.hdf5")

import tensorflow.compat.v1 as tf
def representative_data_gen():
dataset_list = tf.data.Dataset.list_files(df.Fn.values)
for i in range(df.shape[0]):
image = next(iter(dataset_list))
image = tf.io.read_file(image)
image = tf.io.decode_jpeg(image, channels=3)
image = tf.cast(image, tf.float32) / 255.0
image = tf.expand_dims(image, 0)
yield [image]

converter = tf.lite.TFLiteConverter.from_keras_model_file(chkp)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.representative_dataset = representative_data_gen
tflite_model = converter.convert()

with open('mobilenet_v2_1.0_224_quant.tflite', 'wb') as f:
f.write(tflite_model)

kriszfekete@datascience01:~/jupyter/tensorflow$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite --num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v2_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
External delegate path : []
External delegate options : []
Use gpu : [0]
Use xnnpack : [0]
Loaded model mobilenet_v2_1.0_224_quant.tflite
The input model file size (MB): 3.22496
Initialized session in 1.063ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=630344

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=621263 curr=631745 min=603622 max=655479 avg=623294 std=14346

Inference timings in us: Init: 1063, First inference: 630344, Warmup (avg): 630344, Inference (avg): 623294
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.38672 overall=5.89453

#Pretrained net from: https://www.tensorflow.org/lite/guide/hosted_models, Mobilenet_V2_1.0_224_quant
bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \

--graph=hub_mobilenet_v2_1.0_224_quant.tflite
--num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [hub_mobilenet_v2_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
External delegate path : []
External delegate options : []
Use gpu : [0]
Use xnnpack : [0]
Loaded model hub_mobilenet_v2_1.0_224_quant.tflite
The input model file size (MB): 3.57776
Initialized session in 0.772ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=166513 curr=159202 min=156239 max=166513 avg=159732 std=4064

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=158890 curr=162784 min=156544 max=167988 avg=161145 std=1948

Inference timings in us: Init: 772, First inference: 166513, Warmup (avg): 159732, Inference (avg): 161145
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=1.55859 overall=9.98047

#######################################################
Edge-TPU compiler outputs:
My model:
edgetpu_compiler mobilenet_v2_1.0_224_quant.tflite -s
Edge TPU Compiler version 2.1.302470888

Model compiled successfully in 478 ms.

Input model: mobilenet_v2_1.0_224_quant.tflite
Input size: 3.08MiB
Output model: mobilenet_v2_1.0_224_quant_edgetpu.tflite
Output size: 3.07MiB
On-chip memory used for caching model parameters: 3.33MiB
On-chip memory remaining for caching model parameters: 4.39MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 73
Operation log: mobilenet_v2_1.0_224_quant_edgetpu.log

Operator Count Status

PAD 5 Mapped to Edge TPU
QUANTIZE 2 Mapped to Edge TPU
CONV_2D 36 Mapped to Edge TPU
DEPTHWISE_CONV_2D 17 Mapped to Edge TPU
MEAN 1 Mapped to Edge TPU
SOFTMAX 1 Mapped to Edge TPU
FULLY_CONNECTED 1 Mapped to Edge TPU
ADD 10 Mapped to Edge TPU

MobileNet_V2 from hosted models:
edgetpu_compiler hub_mobilenet_v2_1.0_224_quant.tflite -s
Edge TPU Compiler version 2.1.302470888

Model compiled successfully in 378 ms.

Input model: hub_mobilenet_v2_1.0_224_quant.tflite
Input size: 3.41MiB
Output model: hub_mobilenet_v2_1.0_224_quant_edgetpu.tflite
Output size: 3.88MiB
On-chip memory used for caching model parameters: 3.75MiB
On-chip memory remaining for caching model parameters: 3.16MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 65
Operation log: hub_mobilenet_v2_1.0_224_quant_edgetpu.log

Operator Count Status

CONV_2D 36 Mapped to Edge TPU
DEPTHWISE_CONV_2D 17 Mapped to Edge TPU
RESHAPE 1 Mapped to Edge TPU
ADD 10 Mapped to Edge TPU
AVERAGE_POOL_2D 1 Mapped to Edge TPU

UPDATE: My coral edgetpu (usb stick) arrived, this is even weirder, my model which ran about 5* as slow (on CPU) as the one from the hosted models, is actually faster on edgetpu.

My model had a latency about 3.7-3.8 ms
The hosted model: 3.8-3.9 ms

multiverse-tf · 2020-06-09T04:21:10Z

Here is a detailed report of my case, please note that I'm benchmarking the cpu-tflite models, the edge compiler output is only added for additional information.

CPU: Threadripper 1920x

###My Mobilenetv2:
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = True
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(units=2, activation='softmax')
])
model.build([None, 96, 512, 3])

...
#custom tf traning loop with gradienttape
...
model.save(f"./models/model_{epoch}.hdf5")

import tensorflow.compat.v1 as tf
def representative_data_gen():
dataset_list = tf.data.Dataset.list_files(df.Fn.values)
for i in range(df.shape[0]):
image = next(iter(dataset_list))
image = tf.io.read_file(image)
image = tf.io.decode_jpeg(image, channels=3)
image = tf.cast(image, tf.float32) / 255.0
image = tf.expand_dims(image, 0)
yield [image]

converter = tf.lite.TFLiteConverter.from_keras_model_file(chkp)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.representative_dataset = representative_data_gen
tflite_model = converter.convert()

with open('mobilenet_v2_1.0_224_quant.tflite', 'wb') as f:
f.write(tflite_model)

kriszfekete@datascience01:~/jupyter/tensorflow$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite --num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v2_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
External delegate path : []
External delegate options : []
Use gpu : [0]
Use xnnpack : [0]
Loaded model mobilenet_v2_1.0_224_quant.tflite
The input model file size (MB): 3.22496
Initialized session in 1.063ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=630344

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=621263 curr=631745 min=603622 max=655479 avg=623294 std=14346

Inference timings in us: Init: 1063, First inference: 630344, Warmup (avg): 630344, Inference (avg): 623294
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.38672 overall=5.89453

#Pretrained net from: https://www.tensorflow.org/lite/guide/hosted_models, Mobilenet_V2_1.0_224_quant
bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \

--graph=hub_mobilenet_v2_1.0_224_quant.tflite
--num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [hub_mobilenet_v2_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
External delegate path : []
External delegate options : []
Use gpu : [0]
Use xnnpack : [0]
Loaded model hub_mobilenet_v2_1.0_224_quant.tflite
The input model file size (MB): 3.57776
Initialized session in 0.772ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=166513 curr=159202 min=156239 max=166513 avg=159732 std=4064

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=158890 curr=162784 min=156544 max=167988 avg=161145 std=1948

Inference timings in us: Init: 772, First inference: 166513, Warmup (avg): 159732, Inference (avg): 161145
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=1.55859 overall=9.98047

#######################################################
Edge-TPU compiler outputs:
My model:
edgetpu_compiler mobilenet_v2_1.0_224_quant.tflite -s
Edge TPU Compiler version 2.1.302470888

Model compiled successfully in 478 ms.

Input model: mobilenet_v2_1.0_224_quant.tflite
Input size: 3.08MiB
Output model: mobilenet_v2_1.0_224_quant_edgetpu.tflite
Output size: 3.07MiB
On-chip memory used for caching model parameters: 3.33MiB
On-chip memory remaining for caching model parameters: 4.39MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 73
Operation log: mobilenet_v2_1.0_224_quant_edgetpu.log

Operator Count Status

PAD 5 Mapped to Edge TPU
QUANTIZE 2 Mapped to Edge TPU
CONV_2D 36 Mapped to Edge TPU
DEPTHWISE_CONV_2D 17 Mapped to Edge TPU
MEAN 1 Mapped to Edge TPU
SOFTMAX 1 Mapped to Edge TPU
FULLY_CONNECTED 1 Mapped to Edge TPU
ADD 10 Mapped to Edge TPU

MobileNet_V2 from hosted models:
edgetpu_compiler hub_mobilenet_v2_1.0_224_quant.tflite -s
Edge TPU Compiler version 2.1.302470888

Model compiled successfully in 378 ms.

Input model: hub_mobilenet_v2_1.0_224_quant.tflite
Input size: 3.41MiB
Output model: hub_mobilenet_v2_1.0_224_quant_edgetpu.tflite
Output size: 3.88MiB
On-chip memory used for caching model parameters: 3.75MiB
On-chip memory remaining for caching model parameters: 3.16MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 65
Operation log: hub_mobilenet_v2_1.0_224_quant_edgetpu.log

Operator Count Status

CONV_2D 36 Mapped to Edge TPU
DEPTHWISE_CONV_2D 17 Mapped to Edge TPU
RESHAPE 1 Mapped to Edge TPU
ADD 10 Mapped to Edge TPU
AVERAGE_POOL_2D 1 Mapped to Edge TPU

UPDATE: My coral edgetpu (usb stick) arrived, this is even weirder, my model which ran about 5* as slow (on CPU) as the one from the hosted models, is actually faster on edgetpu.
Acked. This is possible as it's possible that the quant execution path on x86-64 in TFLite hasn't been that well-optimized for that on ARM CPUs, edgetpu etc.

My model had a latency about 3.7-3.8 ms
The hosted model: 3.8-3.9 ms

stefano555 · 2020-06-09T15:04:21Z

Hi,

OS Linux 18.04.04 LTS
TensorFLow nightly 2.3.0-dev20200601
CPU Intel Core i7-855OU

I have a very similar issue. A trained a ResNet-50 V2 model and it takes ~ 40ms per inference on my CPU with efficiency 64.15 %. I then converted first to tf lite and speed was ~ 89 ms efficiency identical. Then with dynamic range quantization and it took ~ 548 ms per inference efficiency dropped to 54.35%. On Monday I converted it to full integer quantization with uint8 input/output and it takes ~ 7 seconds on cpu and ~40 ms on Edge TPU. Worth mentioning that besides ridiculously slow speed on CPU the full integer quantized model predicts the same value on CPU and Edge TPU all the time. Therefore, slow and broken. Does anyone have a clue how to solve this? Here you can find a link to a google folder with original mode, conversion code, converted model, and the code I use to test the tf lite model. Many thanks!

https://drive.google.com/file/d/1hNc6xCLch1T9EEqahpiT6FzIDg3P423u/view?usp=sharing

Namburger · 2020-06-09T15:50:17Z

Hi all,
The model speeds up after deploying on the edgetpu (compiling from cpu tflite to edgetpu tflite) for both cases so it is the expect behavior. Since the compiler can only delegates from a fully quantized cpu tflite model, it can't do much about the original graph. It seems very odd to me that tflite model is performing much worse than the original graph model though.

It's also good to mention that I've also observed similar behavior (tiny bit difference) when testing out a yolov4 model (note that unfortunately, only 1/962 ops were mapped to edgetpu so we don't see much speed up here):

On my x86_64 debian 10:
Original model: ~55 seconds on CPU
(non quantized) tflite modes: ~5 seconds
(fully quantized) tflite model: ~56 seconds
(edgetpu) tflite model: ~55 seconds

A quick look into the model with netron, I can see many quantized/dequantized ops that I suspect is what's causing the slowdown. Again, tflite models wasn't optimized for x86_64, so I suspect this the issue.

Now let's check this again on my dev board, everything is as expected:

On my dev board:
Original model: Unfortunately cannot run this on the dev board.
(non quantized) tflite modes: ~ 27 seconds
(fully quantized) tflite model: ~ 13 seconds
(edgetpu) tflite model: ~12 seconds

My suggestion for all is to run the tflite model on arm platform since that's what tflite models are optimized for. If you are benchmarking a tflite model against a CPU graph model is not ideal.

Hope these finding is helpful!

multiverse-tf · 2020-06-10T06:27:19Z

Hi all,
The model speeds up after deploying on the edgetpu (compiling from cpu tflite to edgetpu tflite) for both cases so it is the expect behavior. Since the compiler can only delegates from a fully quantized cpu tflite model, it can't do much about the original graph. It seems very odd to me that tflite model is performing much worse than the original graph model though.

It's also good to mention that I've also observed similar behavior (tiny bit difference) when testing out a yolov4 model (note that unfortunately, only 1/962 ops were mapped to edgetpu so we don't see much speed up here):
On my x86_64 debian 10:
Original model: ~55 seconds on CPU
(non quantized) tflite modes: ~5 seconds
(fully quantized) tflite model: ~56 seconds
(edgetpu) tflite model: ~55 seconds
A quick look into the model with netron, I can see many quantized/dequantized ops that I suspect is what's causing the slowdown. Again, tflite models wasn't optimized for x86_64, so I suspect this the issue.

Now let's check this again on my dev board, everything is as expected:
On my dev board:
Original model: Unfortunately cannot run this on the dev board.
(non quantized) tflite modes: ~ 27 seconds
(fully quantized) tflite model: ~ 13 seconds
(edgetpu) tflite model: ~12 seconds
My suggestion for all is to run the tflite model on arm platform since that's what tflite models are optimized for. If you are benchmarking a tflite model against a CPU graph model is not ideal.

Hope these finding is helpful!

TFLite on x86 CPUs may not have been fully optimized for quantized models. But w/ XNNPACK delegate, as mentioned in #40183 (comment), will deliver significant x86 performance improvements for float models.

bjacob · 2020-06-10T12:32:45Z

It would be interesting to hear if performance is significantly different if you build with this flag,

bazel build -c opt --define=tflite_with_ruy=true

ruy is not heavily optimized for x86 as it is for ARM, which is part of why it isn't the default yet, but it might already perform better than the default.

However, ruy is only an implementation of matrix multiplication. If your model spends most of its time in other nodes, it will run into the fact that tflite's operators are implemented in NEON intrinsics, compiling on x86 thanks to a NEON->SSE intrinsics translation header. In other words, the compromise here has been minimal x86 implementation effort at the expense of x86 performance. It is to be expected that another inference engine with more first-class x86 implementation would outperform it, as mentioned in the previous comment.

Namburger · 2020-06-10T13:17:38Z

TFLite on x86 CPUs may not have been fully optimized for quantized models. But w/ XNNPACK delegate, as mentioned in #40183 (comment), will deliver significant x86 performance improvements for float models.

I understand that, I was suggesting that users deploy there tflite models to arm machines instead of comparing it to the graph on an x86.

mieszkokl · 2020-06-12T08:05:42Z

I've checked the same model conversion but using much smaller MobileNetV2 classification model from tf.keras.applications. I can still observe increased processing time when running tflite model on x86 cpu(700ms using tflite vs 30ms using keras model), but as you said it's normal since it's optimized for ARM. After deploying to Coral Dev Board single frame processing time is ~7ms which is even faster then using CPU on my PC.
How can I check what makes my original segmentation model so slow? Is it simply too big or it contains some operations that cannot be mapped to Edge TPU?

bjacob · 2020-06-12T12:20:24Z

TFLite has a couple of built-in profilers that are available wherever you can run tflite and look at terminal output. One is enabled by passing --define=ruy_profiler=true to bazel build, or equivalently in other buildsystems just add -DRUY_PROFILER to the compiler flags. If you build the benchmark_model binary with that, it will dump an ascii "treeview" in the terminal with % time spent in each node. (Despite having "ruy" in the name, this profiler is available regardless of whether tflite_with_ruy is true).

pedroska777 · 2020-08-28T23:55:56Z

So can we conclude that tFLite post training optimizations are not for x86 cpu?
Is there any inference optimization for x86 CPUs?

talumbau · 2020-11-17T18:31:42Z

Hi,

Apologies that this issue has gone stale. Some additional x86 optimizations have landed (for AVX, AVX2, and AVX512) and they will soon be the default on x86, but aren't yet. For this issue, it would be good to know if the poor performance persists for you on x86 CPU. Can you please do as follows:

Please build with:

bazel build -c opt --define=tflite_with_ruy=true -copt=-DRUY_PROFILER

Please run the benchmark_model tool with --enable_op_profiling=true

and then post the output to this issue. Also, please provide your exact build line for any executable you are running. Thanks!

dwSun · 2021-03-11T10:12:48Z

Hi,

Apologies that this issue has gone stale. Some additional x86 optimizations have landed (for AVX, AVX2, and AVX512) and they will soon be the default on x86, but aren't yet. For this issue, it would be good to know if the poor performance persists for you on x86 CPU. Can you please do as follows:

Please build with:
bazel build -c opt --define=tflite_with_ruy=true -copt=-DRUY_PROFILER
Please run the benchmark_model tool with --enable_op_profiling=true

and then post the output to this issue. Also, please provide your exact build line for any executable you are running. Thanks!

any plan on make this default?

pjpratik · 2023-09-07T09:02:01Z

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The TFLite team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the TFLite space.

Thanks.

github-actions · 2023-09-15T01:48:49Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2023-09-23T01:48:00Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2023-09-23T01:48:04Z

Are you satisfied with the resolution of your issue?
Yes
No

mieszkokl added the TFLiteConverter For issues related to TFLite converter label Jun 5, 2020

google-ml-butler bot assigned ravikyram Jun 5, 2020

ravikyram added comp:lite TF Lite related issues TF 2.2 Issues related to TF 2.2 type:support Support issues labels Jun 5, 2020

ravikyram assigned rmothukuru and jvishnuvardhan and unassigned ravikyram and rmothukuru Jun 5, 2020

jvishnuvardhan assigned renjie-liu and unassigned jvishnuvardhan Jun 5, 2020

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 5, 2020

renjie-liu assigned multiverse-tf Jun 7, 2020

multiverse-tf assigned talumbau Jun 8, 2020

mieszkokl mentioned this issue Jun 9, 2020

Slow quantized tflite model inference google-coral/edgetpu#138

Closed

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 25, 2020

le8888e mentioned this issue Sep 29, 2020

RuntimeError: Fill only currently supports int32, int64, float32, bool, string for input 1, got 9.Node number 412 (FILL) failed to invoke. #43596

Closed

fsx950223 mentioned this issue Jul 14, 2021

Inference for int8 efficientdet-d{$n} is not running unlike efficientdet-lite{$n} google/automl#1052

Open

balezz mentioned this issue Mar 23, 2022

Quantized tflite model balezz/LacmusTflite#4

Open

mohantym self-assigned this Jan 5, 2023

mohantym added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 7, 2023

pjpratik added type:performance Performance Issue and removed type:support Support issues labels Jul 25, 2023

pjpratik assigned pjpratik and unassigned mohantym Sep 7, 2023

pjpratik added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 7, 2023

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 15, 2023

github-actions bot closed this as completed Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow quantized tflite model #40183

Very slow quantized tflite model #40183

mieszkokl commented Jun 5, 2020

chrisai-dev commented Jun 5, 2020

renjie-liu commented Jun 6, 2020

mieszkokl commented Jun 6, 2020

renjie-liu commented Jun 7, 2020

multiverse-tf commented Jun 8, 2020

mieszkokl commented Jun 8, 2020 •

edited

multiverse-tf commented Jun 8, 2020

chrisai-dev commented Jun 8, 2020 •

edited

multiverse-tf commented Jun 9, 2020

stefano555 commented Jun 9, 2020

Namburger commented Jun 9, 2020

multiverse-tf commented Jun 10, 2020

bjacob commented Jun 10, 2020 •

edited

Namburger commented Jun 10, 2020 •

edited

mieszkokl commented Jun 12, 2020

bjacob commented Jun 12, 2020 •

edited

pedroska777 commented Aug 28, 2020

talumbau commented Nov 17, 2020

dwSun commented Mar 11, 2021

pjpratik commented Sep 7, 2023

github-actions bot commented Sep 15, 2023

github-actions bot commented Sep 23, 2023

google-ml-butler bot commented Sep 23, 2023

Very slow quantized tflite model #40183

Very slow quantized tflite model #40183

Comments

mieszkokl commented Jun 5, 2020

chrisai-dev commented Jun 5, 2020

renjie-liu commented Jun 6, 2020

mieszkokl commented Jun 6, 2020

renjie-liu commented Jun 7, 2020

multiverse-tf commented Jun 8, 2020

mieszkokl commented Jun 8, 2020 • edited

multiverse-tf commented Jun 8, 2020

chrisai-dev commented Jun 8, 2020 • edited

multiverse-tf commented Jun 9, 2020

stefano555 commented Jun 9, 2020

Namburger commented Jun 9, 2020

multiverse-tf commented Jun 10, 2020

bjacob commented Jun 10, 2020 • edited

Namburger commented Jun 10, 2020 • edited

mieszkokl commented Jun 12, 2020

bjacob commented Jun 12, 2020 • edited

pedroska777 commented Aug 28, 2020

talumbau commented Nov 17, 2020

dwSun commented Mar 11, 2021

pjpratik commented Sep 7, 2023

github-actions bot commented Sep 15, 2023

github-actions bot commented Sep 23, 2023

google-ml-butler bot commented Sep 23, 2023

mieszkokl commented Jun 8, 2020 •

edited

chrisai-dev commented Jun 8, 2020 •

edited

bjacob commented Jun 10, 2020 •

edited

Namburger commented Jun 10, 2020 •

edited

bjacob commented Jun 12, 2020 •

edited