New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is quantized graph inference takes much more time than using the original graph? #4434

Closed
yossibiton opened this Issue Sep 18, 2016 · 21 comments

Comments

Projects
None yet
@yossibiton

yossibiton commented Sep 18, 2016

I followed this tutorial in order to quantize my graph into 8 bit.I can't share the exact graph here but i can say it's a simple convolutional neural network.

When i run the benchmark tool over the original and quantized networks it's clear that the quantized network is much much slower (100 ms vs. 4.5 ms).

Slowest nodes in original network :

time average [ms]   [%] [cdf%]  [Op]    [Name]
1.198   26.54%  26.54%  MatMul  fc10/fc10/MatMul
0.337   7.47%   34.02%  Conv2D  conv2/Conv2D
0.332   7.36%   41.37%  Conv2D  conv4/Conv2D
0.323   7.15%   48.53%  Conv2D  conv3/Conv2D
0.322   7.14%   55.66%  Conv2D  conv5/Conv2D
0.310   6.86%   62.53%  Conv2D  conv1/Conv2D
0.118   2.61%   65.13%  Conv2D  conv2_1/Conv2D
0.105   2.32%   67.45%  MaxPool pool1

Slowest nodes in quantized network :

time average [ms]   [%] [cdf%]  [Op]    [Name]
8.289   47.67%  47.67%  QuantizedMatMul fc10/fc10/MatMul_eightbit_quantized_bias_add
5.398   5.33%   53.00%  QuantizedConv2D conv5/Conv2D_eightbit_quantized_conv
5.248   5.18%   58.18%  QuantizedConv2D conv4/Conv2D_eightbit_quantized_conv
4.981   4.92%   63.10%  QuantizedConv2D conv2/Conv2D_eightbit_quantized_conv
4.908   4.85%   67.95%  QuantizedConv2D conv3/Conv2D_eightbit_quantized_conv
3.167   3.13%   71.07%  QuantizedConv2D conv5_1/Conv2D_eightbit_quantized_conv
3.049   3.01%   74.08%  QuantizedConv2D conv4_1/Conv2D_eightbit_quantized_conv
2.973   2.94%   77.02%  QuantizedMatMul fc11/MatMul_eightbit_quantized_bias_add

What is the reason for that ?
Is it the expected behavior for quantized network ?

Environment info

Operating System: Ubuntu 16.04
Installed from source, without GPU support :

  1. commit hash = 37256f4
  2. bazel version =
    Build label: 0.3.1
    Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
    Build time: Fri Jul 29 09:09:52 2016 (1469783392)
    Build timestamp: 1469783392
    Build timestamp as int: 1469783392
@jmchen-g

This comment has been minimized.

Show comment
Hide comment
@jmchen-g

jmchen-g Sep 20, 2016

Contributor

Is it just the first run or consistently slow for many inferences?

Contributor

jmchen-g commented Sep 20, 2016

Is it just the first run or consistently slow for many inferences?

@yossibiton

This comment has been minimized.

Show comment
Hide comment
@yossibiton

yossibiton Sep 20, 2016

Im using the benchmark tool, which run several inferences.
I also ran it several times and get the same results.

On Tue, Sep 20, 2016, 03:25 Jianmin Chen notifications@github.com wrote:

Is it just the first run or consistently slow for many inferences?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#4434 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGN1ANVdeD91M8ThW5gAZKAYLL4HH6I9ks5qrygHgaJpZM4J_0eh
.

yossibiton commented Sep 20, 2016

Im using the benchmark tool, which run several inferences.
I also ran it several times and get the same results.

On Tue, Sep 20, 2016, 03:25 Jianmin Chen notifications@github.com wrote:

Is it just the first run or consistently slow for many inferences?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#4434 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGN1ANVdeD91M8ThW5gAZKAYLL4HH6I9ks5qrygHgaJpZM4J_0eh
.

@yossibiton

This comment has been minimized.

Show comment
Hide comment
@yossibiton

yossibiton Sep 20, 2016

i will list few simple steps to reproduce the same behavior with inception graph :

  1. run my script classify_image.py - basically i took this script and added small changed to support quantized graph and timing.
    It will download the inception graph file and then run forward pass 100 times over an image. I got average time per forward pass = 785 [ms]
  2. now we will quantize the graph and replace the original graph file (run from you tensorflow repository folder)
    python tensorflow/contrib/quantization/tools/quantize_graph.py --input=/tmp/imagenet/classify_image_graph_def.pb --output_node_names="softmax" --output=/tmp/imagenet/classify_image_graph_q_def.pb --mode=eightbit
  3. we will replace the original graph file with the quantized one :
mv /tmp/imagenet/classify_image_graph_def.pb /tmp/imagenet/backup_classify_image_graph_def.pb
mv /tmp/imagenet/classify_image_graph_q_def.pb /tmp/imagenet/classify_image_graph_def.pb 
  1. running 'classify_image.py' again (now the quantized graph will be used).
    This time i got average time per forward pass = 6551 [ms]
    That means the quantized graph is about 10x slower

yossibiton commented Sep 20, 2016

i will list few simple steps to reproduce the same behavior with inception graph :

  1. run my script classify_image.py - basically i took this script and added small changed to support quantized graph and timing.
    It will download the inception graph file and then run forward pass 100 times over an image. I got average time per forward pass = 785 [ms]
  2. now we will quantize the graph and replace the original graph file (run from you tensorflow repository folder)
    python tensorflow/contrib/quantization/tools/quantize_graph.py --input=/tmp/imagenet/classify_image_graph_def.pb --output_node_names="softmax" --output=/tmp/imagenet/classify_image_graph_q_def.pb --mode=eightbit
  3. we will replace the original graph file with the quantized one :
mv /tmp/imagenet/classify_image_graph_def.pb /tmp/imagenet/backup_classify_image_graph_def.pb
mv /tmp/imagenet/classify_image_graph_q_def.pb /tmp/imagenet/classify_image_graph_def.pb 
  1. running 'classify_image.py' again (now the quantized graph will be used).
    This time i got average time per forward pass = 6551 [ms]
    That means the quantized graph is about 10x slower
@isabel-schwende

This comment has been minimized.

Show comment
Hide comment
@isabel-schwende

isabel-schwende Sep 28, 2016

Actually this has been reported several times before, see here: #2807 I've also found the same issue when I played around with 8-bit quantisation. So far it seems that the matrix multiplication as a core operation in the convolution is not optimised yet. It was mentioned in this thread that they are actively working on the problem #1592 Since 8-bit quantisation was also mentioned in their new paper on machine translation, I'd expect that the optimised version is being released soon (but this is just an educated guess)

isabel-schwende commented Sep 28, 2016

Actually this has been reported several times before, see here: #2807 I've also found the same issue when I played around with 8-bit quantisation. So far it seems that the matrix multiplication as a core operation in the convolution is not optimised yet. It was mentioned in this thread that they are actively working on the problem #1592 Since 8-bit quantisation was also mentioned in their new paper on machine translation, I'd expect that the optimised version is being released soon (but this is just an educated guess)

@jmchen-g

This comment has been minimized.

Show comment
Hide comment
@jmchen-g

jmchen-g Sep 30, 2016

Contributor

@petewarden Could you take a look? Thanks.

Contributor

jmchen-g commented Sep 30, 2016

@petewarden Could you take a look? Thanks.

@chenliu0831

This comment has been minimized.

Show comment
Hide comment
@chenliu0831

chenliu0831 Oct 26, 2016

Contributor

+1 on this. Any progress? We're also seeing slower performance on CPU.

Contributor

chenliu0831 commented Oct 26, 2016

+1 on this. Any progress? We're also seeing slower performance on CPU.

@austingg

This comment has been minimized.

Show comment
Hide comment
@austingg

austingg Dec 13, 2016

@jmchen-g @chenliu0831 have your benchmark done on mobile devices ?

austingg commented Dec 13, 2016

@jmchen-g @chenliu0831 have your benchmark done on mobile devices ?

@chenliu0831

This comment has been minimized.

Show comment
Hide comment
@chenliu0831

chenliu0831 Dec 14, 2016

Contributor

@austingg Our iOS developer didn't end up benchmarking it given those issues. Instead we are looking into metal to reconstruct the network with trained weight

Contributor

chenliu0831 commented Dec 14, 2016

@austingg Our iOS developer didn't end up benchmarking it given those issues. Instead we are looking into metal to reconstruct the network with trained weight

@austingg

This comment has been minimized.

Show comment
Hide comment
@austingg

austingg Dec 15, 2016

@chenliu0831 great. Metal use gpu, but support only limited operator.

austingg commented Dec 15, 2016

@chenliu0831 great. Metal use gpu, but support only limited operator.

@mddrill

This comment has been minimized.

Show comment
Hide comment
@mddrill

mddrill May 24, 2017

I'm running into this issue also, not only is the network 4 times slower, but it doesn't actually work anymore. I haven't measured the accuracy, but it's a pose estimation algorithm, so I don't need to measure the accuracy to see that it's no longer estimating the pose.

mddrill commented May 24, 2017

I'm running into this issue also, not only is the network 4 times slower, but it doesn't actually work anymore. I haven't measured the accuracy, but it's a pose estimation algorithm, so I don't need to measure the accuracy to see that it's no longer estimating the pose.

@clumsydzd

This comment has been minimized.

Show comment
Hide comment
@clumsydzd

clumsydzd Jul 11, 2017

@mddrill have you solved your problem? I ran into the same situation

clumsydzd commented Jul 11, 2017

@mddrill have you solved your problem? I ran into the same situation

@mddrill

This comment has been minimized.

Show comment
Hide comment
@mddrill

mddrill Jul 11, 2017

@clumsydzd no I was never able to figure it out.

mddrill commented Jul 11, 2017

@clumsydzd no I was never able to figure it out.

@tensorflowbutler

This comment has been minimized.

Show comment
Hide comment
@tensorflowbutler

tensorflowbutler Dec 20, 2017

Member

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

Member

tensorflowbutler commented Dec 20, 2017

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@tensorflowbutler

This comment has been minimized.

Show comment
Hide comment
@tensorflowbutler

tensorflowbutler Jan 4, 2018

Member

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

Member

tensorflowbutler commented Jan 4, 2018

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@oceliktutan

This comment has been minimized.

Show comment
Hide comment
@oceliktutan

oceliktutan Jan 18, 2018

Hello, I have a similar problem too. I cannot observe any difference between my optimised graph and my quantised graph in terms of run time. Were you able to solve your problem? Thanks.

oceliktutan commented Jan 18, 2018

Hello, I have a similar problem too. I cannot observe any difference between my optimised graph and my quantised graph in terms of run time. Were you able to solve your problem? Thanks.

@mddrill

This comment has been minimized.

Show comment
Hide comment
@mddrill

mddrill Jan 18, 2018

Nope. Was never able to do it.

mddrill commented Jan 18, 2018

Nope. Was never able to do it.

@tensorflowbutler

This comment has been minimized.

Show comment
Hide comment
@tensorflowbutler

tensorflowbutler Feb 6, 2018

Member

Nagging Awaiting TensorFlower: It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

Member

tensorflowbutler commented Feb 6, 2018

Nagging Awaiting TensorFlower: It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@yifeif

This comment has been minimized.

Show comment
Hide comment
@yifeif

yifeif Feb 7, 2018

Member

@petewarden @suharshs any comment? Thanks!

Member

yifeif commented Feb 7, 2018

@petewarden @suharshs any comment? Thanks!

@suharshs

This comment has been minimized.

Show comment
Hide comment
@suharshs

suharshs Feb 13, 2018

Member

From #2807:

We are focusing our eight-bit efforts on TF Lite (visible at tensorflow/contrib/lite), so we aren't expecting TensorFlow's quantized performance to improve in cases where it's not currently fast. These tend to be on x86 platforms (we're concentrating on ARM performance for mobile), and for models that use ops that we don't have quantized implementations for (which is most models outside a few vision-related ones we've optimized for).

Since we're not likely to see changes in this area soon, I'm closing this as infeasible. Pull requests or other help in this area would be very welcome of course!

Member

suharshs commented Feb 13, 2018

From #2807:

We are focusing our eight-bit efforts on TF Lite (visible at tensorflow/contrib/lite), so we aren't expecting TensorFlow's quantized performance to improve in cases where it's not currently fast. These tend to be on x86 platforms (we're concentrating on ARM performance for mobile), and for models that use ops that we don't have quantized implementations for (which is most models outside a few vision-related ones we've optimized for).

Since we're not likely to see changes in this area soon, I'm closing this as infeasible. Pull requests or other help in this area would be very welcome of course!

@suharshs suharshs closed this Feb 13, 2018

@csuestc

This comment has been minimized.

Show comment
Hide comment
@csuestc

csuestc Apr 10, 2018

@isabel-schwende -- can you please tell me which machine translation paper. I want to see. thanks

csuestc commented Apr 10, 2018

@isabel-schwende -- can you please tell me which machine translation paper. I want to see. thanks

@isabel-schwende

This comment has been minimized.

Show comment
Hide comment
@isabel-schwende

isabel-schwende Apr 10, 2018

There's a section in this paper from 2016 on how they used 8-bit quantization for machine translation https://arxiv.org/abs/1609.08144

isabel-schwende commented Apr 10, 2018

There's a section in this paper from 2016 on how they used 8-bit quantization for machine translation https://arxiv.org/abs/1609.08144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment