TensorFlow 1.4.0 takes more resources and is slower on GPU and CPU #14107

johnsrude · 2017-10-30T20:11:43Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): https://github.com/tkuanlun350/Tensorflow-SegNet
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 7 x64
TensorFlow installed from (source or binary): https://pypi.python.org/pypi/tensorflow-gpu/1.4.0rc1
TensorFlow version (use command below): 1.4.0
Python version: 3.5
CUDA/cuDNN version: Cuda release 8.0, V8.0.60. cuDNN 6.
GPU model and memory: NVIDIA P4
Exact command to reproduce: c:\python35\python3 main.py --log_dir=./logs --image_dir={image dir} --val_dir= {validation dir} --batch_size=15 --training=True

Describe the problem

Under 1.3.0 I was able to use a batch size of {15, put your max batch size here} for training. Under 1.4.0 I get Resource Exhausted errors for that batch size. So use of GPU resources is going up. Not the right direction.

For me here are the performance effects:

TensorFlow GPU 1.3.0: 9.8 images/sec for batch size: 15
TensorFlow GPU 1.4.0: Can't do batch size: 15. 7.8 images/sec for batch size: 12

Source code / logs

tf_bug2.txt

The text was updated successfully, but these errors were encountered:

bshao001 · 2017-11-04T13:27:40Z

And it is slower than release 1.3, at least for the NMT model I am using. When I train it in 1.3, each epoch took about 600 seconds, now it takes about 700 seconds.

jmaye · 2017-11-04T21:12:54Z

same here on linux, performance drop, but I tried the rc1, I'll evaluate the official release.

bignamehyp · 2017-11-06T21:18:21Z

Thank you very much for your feedback. It seems some op consumed more memory than before. If you have time, can you please help figure out which op on your graph used more memory, maybe by simplifying your code and finding out the bottleneck? At the same time on our side, we plan to add better debugging tool for gpu memory allocation and add memory regression tests.

jmaye · 2017-11-06T21:26:13Z

On my side, I have a resnet in the same style as the examples in the official tensorflow models repository. Thanks a lot for looking into this.

johnsrude · 2017-11-06T22:09:06Z

I have already provided the steps to reproduce the error. MNIST data works well. I have real issues with detailed performance profiling on TensorFlow so I've stopped looking at detailed profiling on an op-by-op basis for now. A better profiling tool would be much appreciated.

ohommos · 2017-11-15T09:34:05Z

Have the same performance drop as mentioned above (and a slightly increased use of memory). A batch takes 0.35 s on 1.4.0, used to take 0.25 s on 1.3.

…ierra 10.13.1 with CLT 8.2 for CUDA 9 / cuDNN 7)

eyaler · 2017-11-22T23:30:43Z

on cpu, my model's peak RAM usages is larger by more than 300MB for 1.4 compared to 1.3, which is a >30% increase.

songgc · 2017-11-29T02:38:43Z

Same here. Training a resnet against imagenet becomes slower by 30% with v1.4 compared to v1.3. With v1.4, I noticed GPU's starvation (GPU usage is only 60-70% in average, and is fluctuating a lot).

eyaler · 2017-12-03T12:21:35Z

should i open a separate issue for what i'm seeing on CPU or should we change this bug the include it?

colmantse · 2017-12-05T14:27:11Z

it has been a 50 % increase in cpu for me and a 50% decrease in GPU. I mean utility. Due to the gpu starvation, performance is therefore 50% of the tf 1.3.

yaroslavvb · 2017-12-05T17:27:13Z

@songgc btw, I've recently experimented with scripts from "High Performance Models" and have not observed CPU starvation/speed degradation on TF 1.4, even when running on V100 GPUs which put much more pressure on CPU

So the way to isolate the problem would be to see what the problematic script is doing that that @tfboyd's reference scripts are not doing (ie are they using probably-will-be-deprecated Queue's to read data?).

Also, the official resnet-50 performance has been stable since September (although it doesn't rule out degradation from 1.3 which was last summer). It uses fixed image size with autotune enabled, so that's another thing to try

https://benchmarks-dot-tensorflow-testing.appspot.com/test/tf-cnn-benchmark-resnet50

ebrevdo · 2017-12-06T06:29:41Z

There is a performance regression that affects nmt decoders in tf 1.4 that we have solved in the nightlies. If you have a performance regression, can you check if it persists with the tf nightlies?

colmantse · 2017-12-06T07:51:50Z

hi, after installing tf nightlies via pip command, the problem persists.

NicholaiStaalung · 2017-12-08T10:05:21Z

Im facing the same resource exhausted error when running predictions on validation and test sets (changing batch size doesnt affect the outcome) in tf 1.4 gpu. The same script worked flawlessly (But slowly) when i ran on the CPU (tf 1.3). The problem occured when i compiled tensorflow to run on the GPU. I have one GeForce 1080 ti 11gb.

My conclusion are so far that either (1) the GPU cant offload the memory fast enough when all free memory is in use, (2) or that tensorflow stores to much information without offloading, (3) or that my setup is to weak for the dataset im running(See details below).
Training shape (430056, 21)
Validation shape (119042, 21)
Test shape (119043, 21)
Test 2 shape (892816, 21)
Hidden layer 1 shape (2000,)
hidden layer 2 shape (1000,)

Any help would be appreciated (I can open a separate issue, but as this discussion deals with tensorflow using to many resources, which is my second conclusion, i've put it here)

songgc · 2017-12-09T00:47:25Z

@yaroslavvb We do use queue and python threading for GPU feeding, which works well with v1.3.

ppwwyyxx · 2017-12-09T00:53:54Z

I used queues and python threading (and also StagingArea) for feeding as well. And I haven't seen significant speed difference in ResNet-50 training in 1.4.

codrut3 · 2017-12-09T13:12:35Z

The OOM issue reported by @johnsrude is likely caused by fused batch norm. The documentation incorrectly states that tf.contrib.layers.batch_norm with Fused=None will use the default implementation. This is not true: it will call the newer, fused version, which is more expensive in terms of gpu resources.

@johnsrude, can you please replace batch_norm_layer in model.py with the following:

def batch_norm_layer(inputT, is_training, scope):
  return tf.cond(is_training,
          lambda: tf.contrib.layers.batch_norm(inputT, is_training=True,
                           center=False, updates_collections=None, scope=scope+"_bn", fused=False),
          lambda: tf.contrib.layers.batch_norm(inputT, is_training=False,
                           updates_collections=None, center=False, scope=scope+"_bn", reuse = True, fused=False))

Let me know if this solves the OOM problem.

@bshao001 @jmaye @11maxed11 @eyaler @songgc @colmantse @NicholaiStaalung
please consider posting the code that reproduces the issue, otherwise it is very hard to find the root cause.

NicholaiStaalung · 2017-12-09T13:51:26Z

So i partially solved my problem by splitting the graph and run it on both the CPU and GPU with tf.device(). What made most sense was running forward and backprop on the CPU while running the predictions on the GPU. Its a bit slower than running everything on the GPU, but way faster than CPU only. So i guess my problem was a combination of my conclusions 1 and 2, and thus not a tensorflow issue.

jmaye · 2017-12-13T08:46:01Z

Problem seems to be solved by using Tensorflow 1.4.1. Might be related to this commit: 03ef0a0

codrut3 · 2017-12-14T07:26:21Z

Are you talking about the time performance issue? I can reproduce the OOM problem with Tensorflow 1.4.1 and the code above. I tracked it down to fused batch norm, which uses more memory because it transforms tensors internally from NHWC to NCHW.

jmaye · 2017-12-14T08:04:24Z

I'm talking about time performance issue indeed. This was far worse with tf 1.4 as compared to 1.3. At the beginning, it was starting good, but then the global step/sec was jittering like hell. Stopping and restarting made it stable again for a while.
For information, I'm using tf.layers.batch_normalization with fused=True in my implementation without any problem.

Edit: it was a false hope, it just lasted a bit more before seeing the performance drop:)

eyaler · 2017-12-15T17:26:27Z

i am running the code for tensorpack/pix2pix on on windows 10 64 with gpu 1080ti (cuda 8, cudnn 6):
https://github.com/ppwwyyxx/tensorpack/blob/master/examples/GAN/Image2Image.py
in tf 1.4 i am seeing an increase of 30% in runtime over 1.3

colmantse · 2017-12-15T17:42:53Z

Yes. This should occur if you do not have a clean build of tf for win10 from source. I had the same prob, with cpu utility goes up triple and gpu utility drops to the same extent. I decided to switch to linux and build from source (much easier) and now achieve full gpu utility.

ppwwyyxx · 2017-12-16T03:01:04Z

@eyaler On a P100 I saw no performance difference between 1.3 and 1.4, also with Image2Image.py. My tensorflow was installed from pypi.

tensorflowbutler · 2018-01-03T07:42:30Z

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

songgc · 2018-01-08T20:59:13Z

Some updates. I also had a similar issue like #14942, in which there were just data pipelines reading proto data and preprocessing rather than any training ops in the graph. I saw throughput degradation with v1.4.x.

After I moved to v1.5rc0 with CUDA 9, CUDNN 7, all problems go away, and both data pipelines and training process go back to normal and even slightly faster by (2%-3%). I also noticed RAM usage reduction (but no exact number). I have decided to skip v1.4.

eyaler · 2018-01-08T21:08:54Z

@songgc question is if you should not expect better performance, i.e. there may be still a problem but perhaps it is negated by cuda 9 etc

songgc · 2018-01-08T23:21:43Z

@eyaler I tried two cases. One was the whole training process, and another was data reading and preprocessing only. I agree that the first one might be affected by different CUDA versions . However, the second one used CPU only and had nothing to do with GPUs, in which I saw a better throughput with v1.5.

jmaye · 2018-01-16T09:07:47Z

on my side, 1.5 definitely solves the problem.

tensorflowbutler · 2018-01-23T23:15:56Z

A member of the TensorFlow organization has replied after the stat:awaiting tensorflower label was applied.

eyaler · 2018-01-24T23:54:15Z

my memory issue on linux/cpu is solved. however i am now seeing +90% (!) longer train times on windows with tf1.4 and cuda 8 on 1080ti. same issue with tf 1.5rc1 + cuda 9 + cudnn 7. would appreciate any help debugging this.

tatatodd · 2018-01-26T00:29:19Z

@johnsrude does switching to TF 1.5 solve your problem, as many others have reported?

@eyaler please file a new issue with details describing your problem, and a minimum reproducible test case (if possible).

tensorflowbutler · 2018-02-09T19:13:12Z

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-03-03T07:56:17Z

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-03-17T14:59:42Z

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-04-01T12:32:49Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue?

angerson · 2018-04-03T01:24:37Z

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

johnsrude changed the title ~~TensorFlow 1.4.0 GPU takes more GPU resources than 1.3.0 on Windows~~ TensorFlow 1.4.0 GPU takes more resources and is slower Nov 4, 2017

reedwm added the type:bug Bug label Nov 6, 2017

angerson mentioned this issue Nov 7, 2017

CUDA_ERROR_OUT_OF_MEMORY tensorflow 1.4 #14301

Closed

angerson added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 7, 2017

aligokalppeker mentioned this issue Nov 19, 2017

TensorFlow is 1.3-7 times slower than Theano for small models #5516

Closed

norman-thomas referenced this issue in norman-thomas/tensorflow-gpu-mac Nov 20, 2017

added compiled TensorFlow 1.4.0 for Python 3.6 (built on macOS High S…

b34993e

…ierra 10.13.1 with CLT 8.2 for CUDA 9 / cuDNN 7)

johnsrude changed the title ~~TensorFlow 1.4.0 GPU takes more resources and is slower~~ TensorFlow 1.4.0 takes more resources and is slower on GPU and CPU Dec 4, 2017

smcdufff mentioned this issue Dec 14, 2017

Crashing -- Memory leak ? tensorflow/tensor2tensor#401

Closed

angerson mentioned this issue Dec 15, 2017

GPU memory usage changed from TF 1.3.0 to 1.4.0 - runs out of memory #15373

Closed

eyaler referenced this issue in tensorpack/tensorpack Jan 1, 2018

Deprecated LeakyReLU to use tf.nn.leaky_relu

7da83a5

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 23, 2018

tatatodd added the stat:awaiting response Status - Awaiting response from author label Jan 26, 2018

tensorflowbutler assigned angerson Apr 3, 2018

angerson closed this as completed Apr 3, 2018

TensorFlow 1.4.0 takes more resources and is slower on GPU and CPU #14107

TensorFlow 1.4.0 takes more resources and is slower on GPU and CPU #14107

Comments

johnsrude commented Oct 30, 2017 • edited

System information

Describe the problem

Source code / logs

bshao001 commented Nov 4, 2017

jmaye commented Nov 4, 2017

bignamehyp commented Nov 6, 2017

jmaye commented Nov 6, 2017

johnsrude commented Nov 6, 2017

ohommos commented Nov 15, 2017 • edited

eyaler commented Nov 22, 2017 • edited

songgc commented Nov 29, 2017

eyaler commented Dec 3, 2017

colmantse commented Dec 5, 2017 • edited

yaroslavvb commented Dec 5, 2017 • edited

ebrevdo commented Dec 6, 2017

colmantse commented Dec 6, 2017

NicholaiStaalung commented Dec 8, 2017 • edited

songgc commented Dec 9, 2017

ppwwyyxx commented Dec 9, 2017

codrut3 commented Dec 9, 2017 • edited

NicholaiStaalung commented Dec 9, 2017

jmaye commented Dec 13, 2017

codrut3 commented Dec 14, 2017

jmaye commented Dec 14, 2017 • edited

eyaler commented Dec 15, 2017

colmantse commented Dec 15, 2017

ppwwyyxx commented Dec 16, 2017

tensorflowbutler commented Jan 3, 2018

songgc commented Jan 8, 2018 • edited

eyaler commented Jan 8, 2018

songgc commented Jan 8, 2018

jmaye commented Jan 16, 2018

tensorflowbutler commented Jan 23, 2018

eyaler commented Jan 24, 2018 • edited

tatatodd commented Jan 26, 2018

tensorflowbutler commented Feb 9, 2018

tensorflowbutler commented Mar 3, 2018

tensorflowbutler commented Mar 17, 2018

tensorflowbutler commented Apr 1, 2018

angerson commented Apr 3, 2018

johnsrude commented Oct 30, 2017 •

edited

ohommos commented Nov 15, 2017 •

edited

eyaler commented Nov 22, 2017 •

edited

colmantse commented Dec 5, 2017 •

edited

yaroslavvb commented Dec 5, 2017 •

edited

NicholaiStaalung commented Dec 8, 2017 •

edited

codrut3 commented Dec 9, 2017 •

edited

jmaye commented Dec 14, 2017 •

edited

songgc commented Jan 8, 2018 •

edited

eyaler commented Jan 24, 2018 •

edited