-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorFlow 1.4.0 takes more resources and is slower on GPU and CPU #14107
Comments
And it is slower than release 1.3, at least for the NMT model I am using. When I train it in 1.3, each epoch took about 600 seconds, now it takes about 700 seconds. |
same here on linux, performance drop, but I tried the rc1, I'll evaluate the official release. |
Thank you very much for your feedback. It seems some op consumed more memory than before. If you have time, can you please help figure out which op on your graph used more memory, maybe by simplifying your code and finding out the bottleneck? At the same time on our side, we plan to add better debugging tool for gpu memory allocation and add memory regression tests. |
On my side, I have a resnet in the same style as the examples in the official tensorflow models repository. Thanks a lot for looking into this. |
I have already provided the steps to reproduce the error. MNIST data works well. I have real issues with detailed performance profiling on TensorFlow so I've stopped looking at detailed profiling on an op-by-op basis for now. A better profiling tool would be much appreciated. |
Have the same performance drop as mentioned above (and a slightly increased use of memory). A batch takes 0.35 s on 1.4.0, used to take 0.25 s on 1.3. |
…ierra 10.13.1 with CLT 8.2 for CUDA 9 / cuDNN 7)
on cpu, my model's peak RAM usages is larger by more than 300MB for 1.4 compared to 1.3, which is a >30% increase. |
Same here. Training a resnet against imagenet becomes slower by 30% with v1.4 compared to v1.3. With v1.4, I noticed GPU's starvation (GPU usage is only 60-70% in average, and is fluctuating a lot). |
should i open a separate issue for what i'm seeing on CPU or should we change this bug the include it? |
it has been a 50 % increase in cpu for me and a 50% decrease in GPU. I mean utility. Due to the gpu starvation, performance is therefore 50% of the tf 1.3. |
@songgc btw, I've recently experimented with scripts from "High Performance Models" and have not observed CPU starvation/speed degradation on TF 1.4, even when running on V100 GPUs which put much more pressure on CPU So the way to isolate the problem would be to see what the problematic script is doing that that @tfboyd's reference scripts are not doing (ie are they using probably-will-be-deprecated Queue's to read data?). Also, the official resnet-50 performance has been stable since September (although it doesn't rule out degradation from 1.3 which was last summer). It uses fixed image size with autotune enabled, so that's another thing to try https://benchmarks-dot-tensorflow-testing.appspot.com/test/tf-cnn-benchmark-resnet50 |
There is a performance regression that affects nmt decoders in tf 1.4 that we have solved in the nightlies. If you have a performance regression, can you check if it persists with the tf nightlies? |
hi, after installing tf nightlies via pip command, the problem persists. |
Im facing the same resource exhausted error when running predictions on validation and test sets (changing batch size doesnt affect the outcome) in tf 1.4 gpu. The same script worked flawlessly (But slowly) when i ran on the CPU (tf 1.3). The problem occured when i compiled tensorflow to run on the GPU. I have one GeForce 1080 ti 11gb. My conclusion are so far that either (1) the GPU cant offload the memory fast enough when all free memory is in use, (2) or that tensorflow stores to much information without offloading, (3) or that my setup is to weak for the dataset im running(See details below). Any help would be appreciated (I can open a separate issue, but as this discussion deals with tensorflow using to many resources, which is my second conclusion, i've put it here) |
@yaroslavvb We do use queue and python threading for GPU feeding, which works well with v1.3. |
I used queues and python threading (and also StagingArea) for feeding as well. And I haven't seen significant speed difference in ResNet-50 training in 1.4. |
The OOM issue reported by @johnsrude is likely caused by fused batch norm. The documentation incorrectly states that @johnsrude, can you please replace
Let me know if this solves the OOM problem. @bshao001 @jmaye @11maxed11 @eyaler @songgc @colmantse @NicholaiStaalung |
So i partially solved my problem by splitting the graph and run it on both the CPU and GPU with tf.device(). What made most sense was running forward and backprop on the CPU while running the predictions on the GPU. Its a bit slower than running everything on the GPU, but way faster than CPU only. So i guess my problem was a combination of my conclusions 1 and 2, and thus not a tensorflow issue. |
Problem seems to be solved by using Tensorflow 1.4.1. Might be related to this commit: 03ef0a0 |
Are you talking about the time performance issue? I can reproduce the OOM problem with Tensorflow 1.4.1 and the code above. I tracked it down to fused batch norm, which uses more memory because it transforms tensors internally from NHWC to NCHW. |
I'm talking about time performance issue indeed. This was far worse with tf 1.4 as compared to 1.3. At the beginning, it was starting good, but then the global step/sec was jittering like hell. Stopping and restarting made it stable again for a while. Edit: it was a false hope, it just lasted a bit more before seeing the performance drop:) |
i am running the code for tensorpack/pix2pix on on windows 10 64 with gpu 1080ti (cuda 8, cudnn 6): |
Yes. This should occur if you do not have a clean build of tf for win10 from source. I had the same prob, with cpu utility goes up triple and gpu utility drops to the same extent. I decided to switch to linux and build from source (much easier) and now achieve full gpu utility. |
@eyaler On a P100 I saw no performance difference between 1.3 and 1.4, also with Image2Image.py. My tensorflow was installed from pypi. |
It has been 14 days with no activity and the |
Some updates. I also had a similar issue like #14942, in which there were just data pipelines reading proto data and preprocessing rather than any training ops in the graph. I saw throughput degradation with v1.4.x. After I moved to v1.5rc0 with CUDA 9, CUDNN 7, all problems go away, and both data pipelines and training process go back to normal and even slightly faster by (2%-3%). I also noticed RAM usage reduction (but no exact number). I have decided to skip v1.4. |
@songgc question is if you should not expect better performance, i.e. there may be still a problem but perhaps it is negated by cuda 9 etc |
@eyaler I tried two cases. One was the whole training process, and another was data reading and preprocessing only. I agree that the first one might be affected by different CUDA versions . However, the second one used CPU only and had nothing to do with GPUs, in which I saw a better throughput with v1.5. |
on my side, 1.5 definitely solves the problem. |
A member of the TensorFlow organization has replied after the stat:awaiting tensorflower label was applied. |
my memory issue on linux/cpu is solved. however i am now seeing +90% (!) longer train times on windows with tf1.4 and cuda 8 on 1080ti. same issue with tf 1.5rc1 + cuda 9 + cudnn 7. would appreciate any help debugging this. |
@johnsrude does switching to TF 1.5 solve your problem, as many others have reported? @eyaler please file a new issue with details describing your problem, and a minimum reproducible test case (if possible). |
Nagging Awaiting Response: It has been 14 days with no activityand the |
2 similar comments
Nagging Awaiting Response: It has been 14 days with no activityand the |
Nagging Awaiting Response: It has been 14 days with no activityand the |
It has been 14 days with no activity and the |
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks! |
System information
c:\python35\python3 main.py --log_dir=./logs --image_dir={image dir} --val_dir= {validation dir} --batch_size=15 --training=True
Describe the problem
Under 1.3.0 I was able to use a batch size of {15, put your max batch size here} for training. Under 1.4.0 I get Resource Exhausted errors for that batch size. So use of GPU resources is going up. Not the right direction.
For me here are the performance effects:
Source code / logs
tf_bug2.txt
The text was updated successfully, but these errors were encountered: