Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow quantized tflite model #40183

Closed
mieszkokl opened this issue Jun 5, 2020 · 23 comments
Closed

Very slow quantized tflite model #40183

mieszkokl opened this issue Jun 5, 2020 · 23 comments
Assignees
Labels
comp:lite TF Lite related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.2 Issues related to TF 2.2 TFLiteConverter For issues related to TFLite converter type:performance Performance Issue

Comments

@mieszkokl
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (or github SHA if from source): 2.2.0

Command used to run the converter or code if you’re using the Python API
If possible, please share a link to Colab/Jupyter/any notebook.

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.representative_dataset = representative_dataset_gen
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tf_lite_model = converter.convert()

with open("model.tflite", "wb") as f:
    f.write(tf_lite_model)

The output from the converter invocation

2020-06-05 10:53:29.063149: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-06-05 10:53:29.063233: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-06-05 10:53:29.080730: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-06-05 10:53:29.080748: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   function_optimizer: function_optimizer did nothing. time = 0.006ms.
2020-06-05 10:53:29.080752: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   function_optimizer: function_optimizer did nothing. time = 0ms.
2020-06-05 10:53:32.284115: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-06-05 10:53:32.284242: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-06-05 10:53:33.407982: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-06-05 10:53:33.408011: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   constant_folding: Graph size after: 1092 nodes (-568), 1139 edges (-568), time = 474.12ms.
2020-06-05 10:53:33.408016: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   constant_folding: Graph size after: 1092 nodes (0), 1139 edges (0), time = 213.886ms.

Also, please include a link to the saved model or GraphDef

https://drive.google.com/file/d/1imjVvw8IqQ6tvQRYaKJi_ynxQUHBBSH_/view?usp=sharing

Failure details
Before conversion, running standard keras model on CPU took ~300ms per frame. After conversion it takes ~55s.
Eventually I want to deploy the model on Coral Dev Board. Currently after compiling it for edge TPU inference takes ~4s using Coral.

Is it normal that it's so slow? I expect it to be at least not slower than before conversion.

Any other info / logs
Logs from edge tpu compiler:

Edge TPU Compiler version 2.1.302470888
Input: model.tflite
Output: model_edgetpu.tflite

Operator                       Count      Status

ADD                            1          More than one subgraph is not supported
ADD                            71         Mapped to Edge TPU
MAX_POOL_2D                    1          Mapped to Edge TPU
PAD                            35         Mapped to Edge TPU
MUL                            35         Mapped to Edge TPU
CONCATENATION                  1          More than one subgraph is not supported
QUANTIZE                       1          Operation is otherwise supported, but not mapped due to some unspecified limitation
QUANTIZE                       3          Mapped to Edge TPU
CONV_2D                        115        Mapped to Edge TPU
CONV_2D                        4          More than one subgraph is not supported
DEQUANTIZE                     1          Operation is working on an unsupported data type
RESIZE_BILINEAR                2          Operation is otherwise supported, but not mapped due to some unspecified limitation
RESIZE_BILINEAR                6          Mapped to Edge TPU
SOFTMAX                        1          Max 16000 elements supported

@mieszkokl mieszkokl added the TFLiteConverter For issues related to TFLite converter label Jun 5, 2020
@ravikyram ravikyram added comp:lite TF Lite related issues TF 2.2 Issues related to TF 2.2 type:support Support issues labels Jun 5, 2020
@chrisai-dev
Copy link

I'm having the exact same issue:
MobilenetV2 which I trained and quantized with the same settings as above, runs about 1.7FPS
MobilenetV2 from here https://www.tensorflow.org/lite/guide/hosted_models, runs about 7FPS

@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 5, 2020
@renjie-liu
Copy link
Member

wonder what model is it?

can you use benchmark tool (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark)

to get the detailed profiling?

thanks

@mieszkokl
Copy link
Author

It's semenatic segmentation FPN with ResNet101 backbone (trained using https://github.com/qubvel/segmentation_models).
Output from the benchmark tool:

STARTING!
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [8]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [model.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Max number of delegated partitions : [0]
Use gpu : [0]
Use xnnpack : [0]
Loaded model model.tflite
The input model file size (MB): 47.4346
Initialized session in 48.019ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=15908626

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=9 first=16856867 curr=16865810 min=16767506 max=17015249 avg=1.68505e+07 std=69772

Average inference timings in us: Warmup: 1.59086e+07, Init: 48019, Inference: 1.68505e+07
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=11.707 overall=181.352

@renjie-liu
Copy link
Member

Hi Chao, can you help verify if there's regression?

thanks

@multiverse-tf
Copy link
Contributor

It's semenatic segmentation FPN with ResNet101 backbone (trained using https://github.com/qubvel/segmentation_models).
Output from the benchmark tool:

STARTING!
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [8]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [model.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Max number of delegated partitions : [0]
Use gpu : [0]
Use xnnpack : [0]
Loaded model model.tflite
The input model file size (MB): 47.4346
Initialized session in 48.019ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=15908626

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=9 first=16856867 curr=16865810 min=16767506 max=17015249 avg=1.68505e+07 std=69772

Average inference timings in us: Warmup: 1.59086e+07, Init: 48019, Inference: 1.68505e+07
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=11.707 overall=181.352

On what HW architecture was this benchmarking performed? Coral board? And what's the compiling option did you use to compile the binary?

@mieszkokl
Copy link
Author

mieszkokl commented Jun 8, 2020

On the PC with i7-8650U CPU. All options default, I followed instructions from readme file.

@multiverse-tf
Copy link
Contributor

On the PC with i7-8650U CPU. All options default, I followed instructions from readme file.

I see. I added T.J who could give more insights here. He is much more familiar with the x86_64 optimization in the underlying math library RUY fo quant models in TFLite.

Btw, if you are ok w/ float models on x86, then you could try the new xnnpack delegate (i.e. see https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/xnnpack#enable-xnnpack-via-bazel-build-flags-recommended) that will deliver much better performance on x86.

As for perf. on Coral dev board (esp. w/ the edgetpu inside the board), could you report this issue to the Coral repo(i.e. https://github.com/google-coral/edgetpu/issues)?

@chrisai-dev
Copy link

chrisai-dev commented Jun 8, 2020

Here is a detailed report of my case, please note that I'm benchmarking the cpu-tflite models, the edge compiler output is only added for additional information.

CPU: Threadripper 1920x

###My Mobilenetv2:
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = True
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(units=2, activation='softmax')
])
model.build([None, 96, 512, 3])

...
#custom tf traning loop with gradienttape
...
model.save(f"./models/model_{epoch}.hdf5")

import tensorflow.compat.v1 as tf
def representative_data_gen():
dataset_list = tf.data.Dataset.list_files(df.Fn.values)
for i in range(df.shape[0]):
image = next(iter(dataset_list))
image = tf.io.read_file(image)
image = tf.io.decode_jpeg(image, channels=3)
image = tf.cast(image, tf.float32) / 255.0
image = tf.expand_dims(image, 0)
yield [image]

converter = tf.lite.TFLiteConverter.from_keras_model_file(chkp)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.representative_dataset = representative_data_gen
tflite_model = converter.convert()

with open('mobilenet_v2_1.0_224_quant.tflite', 'wb') as f:
f.write(tflite_model)

kriszfekete@datascience01:~/jupyter/tensorflow$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite --num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v2_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
External delegate path : []
External delegate options : []
Use gpu : [0]
Use xnnpack : [0]
Loaded model mobilenet_v2_1.0_224_quant.tflite
The input model file size (MB): 3.22496
Initialized session in 1.063ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=630344

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=621263 curr=631745 min=603622 max=655479 avg=623294 std=14346

Inference timings in us: Init: 1063, First inference: 630344, Warmup (avg): 630344, Inference (avg): 623294
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.38672 overall=5.89453



#Pretrained net from: https://www.tensorflow.org/lite/guide/hosted_models, Mobilenet_V2_1.0_224_quant
bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \

--graph=hub_mobilenet_v2_1.0_224_quant.tflite
--num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [hub_mobilenet_v2_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
External delegate path : []
External delegate options : []
Use gpu : [0]
Use xnnpack : [0]
Loaded model hub_mobilenet_v2_1.0_224_quant.tflite
The input model file size (MB): 3.57776
Initialized session in 0.772ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=166513 curr=159202 min=156239 max=166513 avg=159732 std=4064

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=158890 curr=162784 min=156544 max=167988 avg=161145 std=1948

Inference timings in us: Init: 772, First inference: 166513, Warmup (avg): 159732, Inference (avg): 161145
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=1.55859 overall=9.98047

#######################################################
Edge-TPU compiler outputs:
My model:
edgetpu_compiler mobilenet_v2_1.0_224_quant.tflite -s
Edge TPU Compiler version 2.1.302470888

Model compiled successfully in 478 ms.

Input model: mobilenet_v2_1.0_224_quant.tflite
Input size: 3.08MiB
Output model: mobilenet_v2_1.0_224_quant_edgetpu.tflite
Output size: 3.07MiB
On-chip memory used for caching model parameters: 3.33MiB
On-chip memory remaining for caching model parameters: 4.39MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 73
Operation log: mobilenet_v2_1.0_224_quant_edgetpu.log

Operator Count Status

PAD 5 Mapped to Edge TPU
QUANTIZE 2 Mapped to Edge TPU
CONV_2D 36 Mapped to Edge TPU
DEPTHWISE_CONV_2D 17 Mapped to Edge TPU
MEAN 1 Mapped to Edge TPU
SOFTMAX 1 Mapped to Edge TPU
FULLY_CONNECTED 1 Mapped to Edge TPU
ADD 10 Mapped to Edge TPU


MobileNet_V2 from hosted models:
edgetpu_compiler hub_mobilenet_v2_1.0_224_quant.tflite -s
Edge TPU Compiler version 2.1.302470888

Model compiled successfully in 378 ms.

Input model: hub_mobilenet_v2_1.0_224_quant.tflite
Input size: 3.41MiB
Output model: hub_mobilenet_v2_1.0_224_quant_edgetpu.tflite
Output size: 3.88MiB
On-chip memory used for caching model parameters: 3.75MiB
On-chip memory remaining for caching model parameters: 3.16MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 65
Operation log: hub_mobilenet_v2_1.0_224_quant_edgetpu.log

Operator Count Status

CONV_2D 36 Mapped to Edge TPU
DEPTHWISE_CONV_2D 17 Mapped to Edge TPU
RESHAPE 1 Mapped to Edge TPU
ADD 10 Mapped to Edge TPU
AVERAGE_POOL_2D 1 Mapped to Edge TPU

UPDATE: My coral edgetpu (usb stick) arrived, this is even weirder, my model which ran about 5* as slow (on CPU) as the one from the hosted models, is actually faster on edgetpu.

My model had a latency about 3.7-3.8 ms
The hosted model: 3.8-3.9 ms

@multiverse-tf
Copy link
Contributor

Here is a detailed report of my case, please note that I'm benchmarking the cpu-tflite models, the edge compiler output is only added for additional information.

CPU: Threadripper 1920x

###My Mobilenetv2:
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = True
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(units=2, activation='softmax')
])
model.build([None, 96, 512, 3])

...
#custom tf traning loop with gradienttape
...
model.save(f"./models/model_{epoch}.hdf5")

import tensorflow.compat.v1 as tf
def representative_data_gen():
dataset_list = tf.data.Dataset.list_files(df.Fn.values)
for i in range(df.shape[0]):
image = next(iter(dataset_list))
image = tf.io.read_file(image)
image = tf.io.decode_jpeg(image, channels=3)
image = tf.cast(image, tf.float32) / 255.0
image = tf.expand_dims(image, 0)
yield [image]

converter = tf.lite.TFLiteConverter.from_keras_model_file(chkp)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.representative_dataset = representative_data_gen
tflite_model = converter.convert()

with open('mobilenet_v2_1.0_224_quant.tflite', 'wb') as f:
f.write(tflite_model)

kriszfekete@datascience01:~/jupyter/tensorflow$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite --num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v2_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
External delegate path : []
External delegate options : []
Use gpu : [0]
Use xnnpack : [0]
Loaded model mobilenet_v2_1.0_224_quant.tflite
The input model file size (MB): 3.22496
Initialized session in 1.063ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=630344

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=621263 curr=631745 min=603622 max=655479 avg=623294 std=14346

Inference timings in us: Init: 1063, First inference: 630344, Warmup (avg): 630344, Inference (avg): 623294
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.38672 overall=5.89453

#Pretrained net from: https://www.tensorflow.org/lite/guide/hosted_models, Mobilenet_V2_1.0_224_quant
bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \

--graph=hub_mobilenet_v2_1.0_224_quant.tflite
--num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [hub_mobilenet_v2_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
External delegate path : []
External delegate options : []
Use gpu : [0]
Use xnnpack : [0]
Loaded model hub_mobilenet_v2_1.0_224_quant.tflite
The input model file size (MB): 3.57776
Initialized session in 0.772ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=166513 curr=159202 min=156239 max=166513 avg=159732 std=4064

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=158890 curr=162784 min=156544 max=167988 avg=161145 std=1948

Inference timings in us: Init: 772, First inference: 166513, Warmup (avg): 159732, Inference (avg): 161145
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=1.55859 overall=9.98047

#######################################################
Edge-TPU compiler outputs:
My model:
edgetpu_compiler mobilenet_v2_1.0_224_quant.tflite -s
Edge TPU Compiler version 2.1.302470888

Model compiled successfully in 478 ms.

Input model: mobilenet_v2_1.0_224_quant.tflite
Input size: 3.08MiB
Output model: mobilenet_v2_1.0_224_quant_edgetpu.tflite
Output size: 3.07MiB
On-chip memory used for caching model parameters: 3.33MiB
On-chip memory remaining for caching model parameters: 4.39MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 73
Operation log: mobilenet_v2_1.0_224_quant_edgetpu.log

Operator Count Status

PAD 5 Mapped to Edge TPU
QUANTIZE 2 Mapped to Edge TPU
CONV_2D 36 Mapped to Edge TPU
DEPTHWISE_CONV_2D 17 Mapped to Edge TPU
MEAN 1 Mapped to Edge TPU
SOFTMAX 1 Mapped to Edge TPU
FULLY_CONNECTED 1 Mapped to Edge TPU
ADD 10 Mapped to Edge TPU

MobileNet_V2 from hosted models:
edgetpu_compiler hub_mobilenet_v2_1.0_224_quant.tflite -s
Edge TPU Compiler version 2.1.302470888

Model compiled successfully in 378 ms.

Input model: hub_mobilenet_v2_1.0_224_quant.tflite
Input size: 3.41MiB
Output model: hub_mobilenet_v2_1.0_224_quant_edgetpu.tflite
Output size: 3.88MiB
On-chip memory used for caching model parameters: 3.75MiB
On-chip memory remaining for caching model parameters: 3.16MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 65
Operation log: hub_mobilenet_v2_1.0_224_quant_edgetpu.log

Operator Count Status

CONV_2D 36 Mapped to Edge TPU
DEPTHWISE_CONV_2D 17 Mapped to Edge TPU
RESHAPE 1 Mapped to Edge TPU
ADD 10 Mapped to Edge TPU
AVERAGE_POOL_2D 1 Mapped to Edge TPU

UPDATE: My coral edgetpu (usb stick) arrived, this is even weirder, my model which ran about 5* as slow (on CPU) as the one from the hosted models, is actually faster on edgetpu.
Acked. This is possible as it's possible that the quant execution path on x86-64 in TFLite hasn't been that well-optimized for that on ARM CPUs, edgetpu etc.

My model had a latency about 3.7-3.8 ms
The hosted model: 3.8-3.9 ms

@stefano555
Copy link

Hi,

OS Linux 18.04.04 LTS
TensorFLow nightly 2.3.0-dev20200601
CPU Intel Core i7-855OU

I have a very similar issue. A trained a ResNet-50 V2 model and it takes ~ 40ms per inference on my CPU with efficiency 64.15 %. I then converted first to tf lite and speed was ~ 89 ms efficiency identical. Then with dynamic range quantization and it took ~ 548 ms per inference efficiency dropped to 54.35%. On Monday I converted it to full integer quantization with uint8 input/output and it takes ~ 7 seconds on cpu and ~40 ms on Edge TPU. Worth mentioning that besides ridiculously slow speed on CPU the full integer quantized model predicts the same value on CPU and Edge TPU all the time. Therefore, slow and broken. Does anyone have a clue how to solve this? Here you can find a link to a google folder with original mode, conversion code, converted model, and the code I use to test the tf lite model. Many thanks!

https://drive.google.com/file/d/1hNc6xCLch1T9EEqahpiT6FzIDg3P423u/view?usp=sharing

@Namburger
Copy link

Hi all,
The model speeds up after deploying on the edgetpu (compiling from cpu tflite to edgetpu tflite) for both cases so it is the expect behavior. Since the compiler can only delegates from a fully quantized cpu tflite model, it can't do much about the original graph. It seems very odd to me that tflite model is performing much worse than the original graph model though.

It's also good to mention that I've also observed similar behavior (tiny bit difference) when testing out a yolov4 model (note that unfortunately, only 1/962 ops were mapped to edgetpu so we don't see much speed up here):

On my x86_64 debian 10:
Original model: ~55 seconds on CPU
(non quantized) tflite modes: ~5 seconds
(fully quantized) tflite model: ~56 seconds
(edgetpu) tflite model: ~55 seconds

A quick look into the model with netron, I can see many quantized/dequantized ops that I suspect is what's causing the slowdown. Again, tflite models wasn't optimized for x86_64, so I suspect this the issue.

Now let's check this again on my dev board, everything is as expected:

On my dev board:
Original model: Unfortunately cannot run this on the dev board.
(non quantized) tflite modes: ~ 27 seconds
(fully quantized) tflite model: ~ 13 seconds
(edgetpu) tflite model: ~12 seconds

My suggestion for all is to run the tflite model on arm platform since that's what tflite models are optimized for. If you are benchmarking a tflite model against a CPU graph model is not ideal.

Hope these finding is helpful!

@multiverse-tf
Copy link
Contributor

Hi all,
The model speeds up after deploying on the edgetpu (compiling from cpu tflite to edgetpu tflite) for both cases so it is the expect behavior. Since the compiler can only delegates from a fully quantized cpu tflite model, it can't do much about the original graph. It seems very odd to me that tflite model is performing much worse than the original graph model though.

It's also good to mention that I've also observed similar behavior (tiny bit difference) when testing out a yolov4 model (note that unfortunately, only 1/962 ops were mapped to edgetpu so we don't see much speed up here):

On my x86_64 debian 10:
Original model: ~55 seconds on CPU
(non quantized) tflite modes: ~5 seconds
(fully quantized) tflite model: ~56 seconds
(edgetpu) tflite model: ~55 seconds

A quick look into the model with netron, I can see many quantized/dequantized ops that I suspect is what's causing the slowdown. Again, tflite models wasn't optimized for x86_64, so I suspect this the issue.

Now let's check this again on my dev board, everything is as expected:

On my dev board:
Original model: Unfortunately cannot run this on the dev board.
(non quantized) tflite modes: ~ 27 seconds
(fully quantized) tflite model: ~ 13 seconds
(edgetpu) tflite model: ~12 seconds

My suggestion for all is to run the tflite model on arm platform since that's what tflite models are optimized for. If you are benchmarking a tflite model against a CPU graph model is not ideal.

Hope these finding is helpful!

TFLite on x86 CPUs may not have been fully optimized for quantized models. But w/ XNNPACK delegate, as mentioned in #40183 (comment), will deliver significant x86 performance improvements for float models.

@bjacob
Copy link
Contributor

bjacob commented Jun 10, 2020

It would be interesting to hear if performance is significantly different if you build with this flag,

bazel build -c opt --define=tflite_with_ruy=true

ruy is not heavily optimized for x86 as it is for ARM, which is part of why it isn't the default yet, but it might already perform better than the default.

However, ruy is only an implementation of matrix multiplication. If your model spends most of its time in other nodes, it will run into the fact that tflite's operators are implemented in NEON intrinsics, compiling on x86 thanks to a NEON->SSE intrinsics translation header. In other words, the compromise here has been minimal x86 implementation effort at the expense of x86 performance. It is to be expected that another inference engine with more first-class x86 implementation would outperform it, as mentioned in the previous comment.

@Namburger
Copy link

Namburger commented Jun 10, 2020

TFLite on x86 CPUs may not have been fully optimized for quantized models. But w/ XNNPACK delegate, as mentioned in #40183 (comment), will deliver significant x86 performance improvements for float models.

I understand that, I was suggesting that users deploy there tflite models to arm machines instead of comparing it to the graph on an x86.

@mieszkokl
Copy link
Author

I've checked the same model conversion but using much smaller MobileNetV2 classification model from tf.keras.applications. I can still observe increased processing time when running tflite model on x86 cpu(700ms using tflite vs 30ms using keras model), but as you said it's normal since it's optimized for ARM. After deploying to Coral Dev Board single frame processing time is ~7ms which is even faster then using CPU on my PC.
How can I check what makes my original segmentation model so slow? Is it simply too big or it contains some operations that cannot be mapped to Edge TPU?

@bjacob
Copy link
Contributor

bjacob commented Jun 12, 2020

TFLite has a couple of built-in profilers that are available wherever you can run tflite and look at terminal output. One is enabled by passing --define=ruy_profiler=true to bazel build, or equivalently in other buildsystems just add -DRUY_PROFILER to the compiler flags. If you build the benchmark_model binary with that, it will dump an ascii "treeview" in the terminal with % time spent in each node. (Despite having "ruy" in the name, this profiler is available regardless of whether tflite_with_ruy is true).

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 25, 2020
@pedroska777
Copy link

So can we conclude that tFLite post training optimizations are not for x86 cpu?
Is there any inference optimization for x86 CPUs?

@talumbau
Copy link
Member

Hi,

Apologies that this issue has gone stale. Some additional x86 optimizations have landed (for AVX, AVX2, and AVX512) and they will soon be the default on x86, but aren't yet. For this issue, it would be good to know if the poor performance persists for you on x86 CPU. Can you please do as follows:

  1. Please build with:
bazel build -c opt --define=tflite_with_ruy=true -copt=-DRUY_PROFILER
  1. Please run the benchmark_model tool with --enable_op_profiling=true

and then post the output to this issue. Also, please provide your exact build line for any executable you are running. Thanks!

@dwSun
Copy link

dwSun commented Mar 11, 2021

Hi,

Apologies that this issue has gone stale. Some additional x86 optimizations have landed (for AVX, AVX2, and AVX512) and they will soon be the default on x86, but aren't yet. For this issue, it would be good to know if the poor performance persists for you on x86 CPU. Can you please do as follows:

  1. Please build with:
bazel build -c opt --define=tflite_with_ruy=true -copt=-DRUY_PROFILER
  1. Please run the benchmark_model tool with --enable_op_profiling=true

and then post the output to this issue. Also, please provide your exact build line for any executable you are running. Thanks!

any plan on make this default?

@pjpratik
Copy link
Contributor

pjpratik commented Sep 7, 2023

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The TFLite team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the TFLite space.

Thanks.

@pjpratik pjpratik added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 7, 2023
@github-actions
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 15, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:lite TF Lite related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.2 Issues related to TF 2.2 TFLiteConverter For issues related to TFLite converter type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests