Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow 1.4.0 takes more resources and is slower on GPU and CPU #14107

Closed
johnsrude opened this issue Oct 30, 2017 · 37 comments
Closed

TensorFlow 1.4.0 takes more resources and is slower on GPU and CPU #14107

johnsrude opened this issue Oct 30, 2017 · 37 comments
Assignees
Labels
stat:awaiting response Status - Awaiting response from author type:bug Bug

Comments

@johnsrude
Copy link

johnsrude commented Oct 30, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): https://github.com/tkuanlun350/Tensorflow-SegNet
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 7 x64
  • TensorFlow installed from (source or binary): https://pypi.python.org/pypi/tensorflow-gpu/1.4.0rc1
  • TensorFlow version (use command below): 1.4.0
  • Python version: 3.5
  • CUDA/cuDNN version: Cuda release 8.0, V8.0.60. cuDNN 6.
  • GPU model and memory: NVIDIA P4
  • Exact command to reproduce: c:\python35\python3 main.py --log_dir=./logs --image_dir={image dir} --val_dir= {validation dir} --batch_size=15 --training=True

Describe the problem

Under 1.3.0 I was able to use a batch size of {15, put your max batch size here} for training. Under 1.4.0 I get Resource Exhausted errors for that batch size. So use of GPU resources is going up. Not the right direction.

For me here are the performance effects:

  • TensorFlow GPU 1.3.0: 9.8 images/sec for batch size: 15
  • TensorFlow GPU 1.4.0: Can't do batch size: 15. 7.8 images/sec for batch size: 12

Source code / logs

tf_bug2.txt

@bshao001
Copy link

bshao001 commented Nov 4, 2017

And it is slower than release 1.3, at least for the NMT model I am using. When I train it in 1.3, each epoch took about 600 seconds, now it takes about 700 seconds.

@jmaye
Copy link

jmaye commented Nov 4, 2017

same here on linux, performance drop, but I tried the rc1, I'll evaluate the official release.

@johnsrude johnsrude changed the title TensorFlow 1.4.0 GPU takes more GPU resources than 1.3.0 on Windows TensorFlow 1.4.0 GPU takes more resources and is slower Nov 4, 2017
@reedwm reedwm added the type:bug Bug label Nov 6, 2017
@bignamehyp
Copy link
Member

Thank you very much for your feedback. It seems some op consumed more memory than before. If you have time, can you please help figure out which op on your graph used more memory, maybe by simplifying your code and finding out the bottleneck? At the same time on our side, we plan to add better debugging tool for gpu memory allocation and add memory regression tests.

@jmaye
Copy link

jmaye commented Nov 6, 2017

On my side, I have a resnet in the same style as the examples in the official tensorflow models repository. Thanks a lot for looking into this.

@johnsrude
Copy link
Author

I have already provided the steps to reproduce the error. MNIST data works well. I have real issues with detailed performance profiling on TensorFlow so I've stopped looking at detailed profiling on an op-by-op basis for now. A better profiling tool would be much appreciated.

@angerson angerson added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 7, 2017
@ohommos
Copy link

ohommos commented Nov 15, 2017

Have the same performance drop as mentioned above (and a slightly increased use of memory). A batch takes 0.35 s on 1.4.0, used to take 0.25 s on 1.3.

norman-thomas referenced this issue in norman-thomas/tensorflow-gpu-mac Nov 20, 2017
…ierra 10.13.1 with CLT 8.2 for CUDA 9 / cuDNN 7)
@eyaler
Copy link

eyaler commented Nov 22, 2017

on cpu, my model's peak RAM usages is larger by more than 300MB for 1.4 compared to 1.3, which is a >30% increase.

@songgc
Copy link

songgc commented Nov 29, 2017

Same here. Training a resnet against imagenet becomes slower by 30% with v1.4 compared to v1.3. With v1.4, I noticed GPU's starvation (GPU usage is only 60-70% in average, and is fluctuating a lot).

@eyaler
Copy link

eyaler commented Dec 3, 2017

should i open a separate issue for what i'm seeing on CPU or should we change this bug the include it?

@johnsrude johnsrude changed the title TensorFlow 1.4.0 GPU takes more resources and is slower TensorFlow 1.4.0 takes more resources and is slower on GPU and CPU Dec 4, 2017
@colmantse
Copy link

colmantse commented Dec 5, 2017

it has been a 50 % increase in cpu for me and a 50% decrease in GPU. I mean utility. Due to the gpu starvation, performance is therefore 50% of the tf 1.3.

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Dec 5, 2017

@songgc btw, I've recently experimented with scripts from "High Performance Models" and have not observed CPU starvation/speed degradation on TF 1.4, even when running on V100 GPUs which put much more pressure on CPU

So the way to isolate the problem would be to see what the problematic script is doing that that @tfboyd's reference scripts are not doing (ie are they using probably-will-be-deprecated Queue's to read data?).

Also, the official resnet-50 performance has been stable since September (although it doesn't rule out degradation from 1.3 which was last summer). It uses fixed image size with autotune enabled, so that's another thing to try

https://benchmarks-dot-tensorflow-testing.appspot.com/test/tf-cnn-benchmark-resnet50

@ebrevdo
Copy link
Contributor

ebrevdo commented Dec 6, 2017

There is a performance regression that affects nmt decoders in tf 1.4 that we have solved in the nightlies. If you have a performance regression, can you check if it persists with the tf nightlies?

@colmantse
Copy link

hi, after installing tf nightlies via pip command, the problem persists.

@NicholaiStaalung
Copy link

NicholaiStaalung commented Dec 8, 2017

Im facing the same resource exhausted error when running predictions on validation and test sets (changing batch size doesnt affect the outcome) in tf 1.4 gpu. The same script worked flawlessly (But slowly) when i ran on the CPU (tf 1.3). The problem occured when i compiled tensorflow to run on the GPU. I have one GeForce 1080 ti 11gb.

My conclusion are so far that either (1) the GPU cant offload the memory fast enough when all free memory is in use, (2) or that tensorflow stores to much information without offloading, (3) or that my setup is to weak for the dataset im running(See details below).
Training shape (430056, 21)
Validation shape (119042, 21)
Test shape (119043, 21)
Test 2 shape (892816, 21)
Hidden layer 1 shape (2000,)
hidden layer 2 shape (1000,)

Any help would be appreciated (I can open a separate issue, but as this discussion deals with tensorflow using to many resources, which is my second conclusion, i've put it here)

@songgc
Copy link

songgc commented Dec 9, 2017

@yaroslavvb We do use queue and python threading for GPU feeding, which works well with v1.3.

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Dec 9, 2017

I used queues and python threading (and also StagingArea) for feeding as well. And I haven't seen significant speed difference in ResNet-50 training in 1.4.

@codrut3
Copy link
Contributor

codrut3 commented Dec 9, 2017

The OOM issue reported by @johnsrude is likely caused by fused batch norm. The documentation incorrectly states that tf.contrib.layers.batch_norm with Fused=None will use the default implementation. This is not true: it will call the newer, fused version, which is more expensive in terms of gpu resources.

@johnsrude, can you please replace batch_norm_layer in model.py with the following:

def batch_norm_layer(inputT, is_training, scope):
  return tf.cond(is_training,
          lambda: tf.contrib.layers.batch_norm(inputT, is_training=True,
                           center=False, updates_collections=None, scope=scope+"_bn", fused=False),
          lambda: tf.contrib.layers.batch_norm(inputT, is_training=False,
                           updates_collections=None, center=False, scope=scope+"_bn", reuse = True, fused=False))

Let me know if this solves the OOM problem.

@bshao001 @jmaye @11maxed11 @eyaler @songgc @colmantse @NicholaiStaalung
please consider posting the code that reproduces the issue, otherwise it is very hard to find the root cause.

@NicholaiStaalung
Copy link

So i partially solved my problem by splitting the graph and run it on both the CPU and GPU with tf.device(). What made most sense was running forward and backprop on the CPU while running the predictions on the GPU. Its a bit slower than running everything on the GPU, but way faster than CPU only. So i guess my problem was a combination of my conclusions 1 and 2, and thus not a tensorflow issue.

@jmaye
Copy link

jmaye commented Dec 13, 2017

Problem seems to be solved by using Tensorflow 1.4.1. Might be related to this commit: 03ef0a0

@codrut3
Copy link
Contributor

codrut3 commented Dec 14, 2017

Are you talking about the time performance issue? I can reproduce the OOM problem with Tensorflow 1.4.1 and the code above. I tracked it down to fused batch norm, which uses more memory because it transforms tensors internally from NHWC to NCHW.

@jmaye
Copy link

jmaye commented Dec 14, 2017

I'm talking about time performance issue indeed. This was far worse with tf 1.4 as compared to 1.3. At the beginning, it was starting good, but then the global step/sec was jittering like hell. Stopping and restarting made it stable again for a while.
For information, I'm using tf.layers.batch_normalization with fused=True in my implementation without any problem.

Edit: it was a false hope, it just lasted a bit more before seeing the performance drop:)

@eyaler
Copy link

eyaler commented Dec 15, 2017

i am running the code for tensorpack/pix2pix on on windows 10 64 with gpu 1080ti (cuda 8, cudnn 6):
https://github.com/ppwwyyxx/tensorpack/blob/master/examples/GAN/Image2Image.py
in tf 1.4 i am seeing an increase of 30% in runtime over 1.3

@colmantse
Copy link

Yes. This should occur if you do not have a clean build of tf for win10 from source. I had the same prob, with cpu utility goes up triple and gpu utility drops to the same extent. I decided to switch to linux and build from source (much easier) and now achieve full gpu utility.

@ppwwyyxx
Copy link
Contributor

@eyaler On a P100 I saw no performance difference between 1.3 and 1.4, also with Image2Image.py. My tensorflow was installed from pypi.

eyaler referenced this issue in tensorpack/tensorpack Jan 1, 2018
@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@songgc
Copy link

songgc commented Jan 8, 2018

Some updates. I also had a similar issue like #14942, in which there were just data pipelines reading proto data and preprocessing rather than any training ops in the graph. I saw throughput degradation with v1.4.x.

After I moved to v1.5rc0 with CUDA 9, CUDNN 7, all problems go away, and both data pipelines and training process go back to normal and even slightly faster by (2%-3%). I also noticed RAM usage reduction (but no exact number). I have decided to skip v1.4.

@eyaler
Copy link

eyaler commented Jan 8, 2018

@songgc question is if you should not expect better performance, i.e. there may be still a problem but perhaps it is negated by cuda 9 etc

@songgc
Copy link

songgc commented Jan 8, 2018

@eyaler I tried two cases. One was the whole training process, and another was data reading and preprocessing only. I agree that the first one might be affected by different CUDA versions . However, the second one used CPU only and had nothing to do with GPUs, in which I saw a better throughput with v1.5.

@jmaye
Copy link

jmaye commented Jan 16, 2018

on my side, 1.5 definitely solves the problem.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 23, 2018
@tensorflowbutler
Copy link
Member

A member of the TensorFlow organization has replied after the stat:awaiting tensorflower label was applied.

@eyaler
Copy link

eyaler commented Jan 24, 2018

my memory issue on linux/cpu is solved. however i am now seeing +90% (!) longer train times on windows with tf1.4 and cuda 8 on 1080ti. same issue with tf 1.5rc1 + cuda 9 + cudnn 7. would appreciate any help debugging this.

@tatatodd
Copy link
Contributor

@johnsrude does switching to TF 1.5 solve your problem, as many others have reported?

@eyaler please file a new issue with details describing your problem, and a minimum reproducible test case (if possible).

@tatatodd tatatodd added the stat:awaiting response Status - Awaiting response from author label Jan 26, 2018
@tensorflowbutler
Copy link
Member

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

2 similar comments
@tensorflowbutler
Copy link
Member

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

@tensorflowbutler
Copy link
Member

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue?

@angerson
Copy link
Contributor

angerson commented Apr 3, 2018

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

@angerson angerson closed this as completed Apr 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author type:bug Bug
Projects
None yet
Development

No branches or pull requests