Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero volatile GPU-Util but high GPU Memory Usage #543

Closed
lglhuada opened this issue Dec 18, 2015 · 13 comments
Closed

Zero volatile GPU-Util but high GPU Memory Usage #543

lglhuada opened this issue Dec 18, 2015 · 13 comments

Comments

@lglhuada
Copy link

Hi I am running a model implemented by tensorflow with only one GPU, the GPU usage is 95% while the volatile GPU-Util is 0.

Specifically I have Tesla k40m with cuda 7.0 and cudnn 6.5v2 installed on Centos 7.0. There are three files: data_loader.py, model.py and train.py in my project. In the train.py I firstly declared
" with tf.device('/gpu:0'):" and then sess.run([train_op]). When I run the code, errors raised:

"tensorflow/core/common_runtime/gpu/gpu_init.cc:45] cannot enable peer access from device ordinal 0 to device ordinal 2"

On the other hand, I installed tensorflow with Pip.

Any help are more than welcome.

@zheng-xq
Copy link
Contributor

cannot enable peer access from device ordinal 0 to device ordinal 2"

@lglhuada, This is not an error, just log. It just means there is no efficient way to transfer data from gpu:0 to gpu:2. You can exclude either one of them through CUDA_VISIBLE_DEVICES

the GPU usage is 95% while the volatile GPU-Util is 0.

This is also expected. TensorFlow always reserves most the GPU memory when it initializes, even before the first GPU kernels are received. So you will see a high memory usage at the beginning. But the fact GPU is not actually utilized means none of the kernels are running on GPU.

Could you try to run the tutorials and see if you can get any GPU utilized?

bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

If you still have problems after that, please provide more information about your machine set up.

@lglhuada
Copy link
Author

@zheng-xq Thanks for your answers. I have been trying to run the tutorial to check whether GPU are utilized. However, there are a lot errors. I will add comments if errors were still available.

@lglhuada
Copy link
Author

@zheng-xq hi, I have checked that I can use GPU with bazel-example and the GPU-util is 21%. So now I need bazel to build my project, right? thanks

@zheng-xq
Copy link
Contributor

In this case, it is okay to use bazel to build your project, although it
shouldn't be necessary.

Please make sure you installed the GPU-enabled TensorFlow binaries. If you
built from source, please use:

bazel build -c opt --config=cuda
//tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install /tmp/tensorflow_pkg/tensorflow-0.6.0-cp27-none-linux_x86_64.whl

On Mon, Dec 28, 2015 at 1:21 AM, Guangliang Liu notifications@github.com
wrote:

@zheng-xq https://github.com/zheng-xq hi, I have checked that I can use
GPU with bazel-example and the GPU-util is 21%. So now I need bazel to
build my project, right? thanks


Reply to this email directly or view it on GitHub
#543 (comment)
.

@lglhuada
Copy link
Author

hi @zheng-xq thanks for your quick feedback. Actually I tried to run my code without bazel-build and GPU-util is 0 with command " with tf.device('/gpu:0') " , and I am sure I have installed GPU-enabled tensorflow binaries, cudatoolkit 7.0 and cudnn 6.5. What might lead to this issue?

@lglhuada
Copy link
Author

Hi I have solved my problem without bazel-build, and I modified codes following the code example of cifar10_multi_gpu_train. Thanks.

@OswinGuai
Copy link

@lglhuada I encountered the same problem. How did you make the code run in GPU? What's the key code?

@lglhuada
Copy link
Author

lglhuada commented Mar 8, 2017

@OswinGuai If your computation is not as much as possible( such as add or plus operation), the GPU util is not obvious, try implement large models. ;)

tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017
Revert "Use open() instead of tf.gfile.FastGFile()"
@TillLindemann
Copy link

TillLindemann commented Nov 6, 2017

@lglhuada Hi, recently I have the same issue, I tried three different types of neural network :
1.the simple feedforward network ,2.the pixelcnn 3.GAN.
And weird things happened,
1 uses very low usage of the gpu, and that project has very large training data.
2 uses almost 100% of gpu .
3 the usage of gpu is preodic , sometimes up to 50% sometimes down to 0%.
After exiperiment, I finally found the reason, it's the code is to blame, actually the code was written by someone who has no gpu on his laptop,so it's totally unfriendly with gpu,most operations do run on the gpu,but there still some operations are run on the cpu,and when the code running on the cpu,the gpu has to wait,that causes the gap. there are some solutions for it , change the type of variables and constants to tf.float32. but that would not change a lot , so if the code is not friendly with gpu ,you should use tensorflow-cpu version,that might be faster than gpu-version. So In conclusion, if your gpu does running on some codes and not running on some codes,it probably the code's problem.

@tuobay
Copy link

tuobay commented Nov 6, 2017

When I train the faster-rcnn-resnet101 model with object-detection repo, no matter how many GPUs I used, the train.py will consume all of the GPUs. e.g. when I use 1, 2 or 4 GPUs, train.py occupy all the memory of them, but only one GPU 'Volatile GPU-Util ' is 100%, the others are 0%.

@Kirancgi
Copy link

%.

so what you did? ... actually i am also facing same issue can u please help

@turowicz
Copy link

I had this problem when the .record files were invalid.

cc @Kirancgi

@Kirancgi
Copy link

thanks buddy will try it out
@turowicz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants