Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU utilization down to 0% without any error infos #79

Open
iamxiaoyubei opened this issue May 7, 2019 · 13 comments
Open

GPU utilization down to 0% without any error infos #79

iamxiaoyubei opened this issue May 7, 2019 · 13 comments

Comments

@iamxiaoyubei
Copy link

Hi, I've been training models for almost two days. Today, the GPU utilization dropped suddenly to 0%, but all GPU memory were still occupied by the experiment. Besides, the experimental log does not continue to display any information, whether it is training or error messages.

The upper-left part of the following figure is logs. The lower-left part of the figure shows nvidia-smi.

WXWorkCapture_15572357425815(1)

So anyone know what's going on?

In addition, I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!

@jonathanasdf
Copy link
Contributor

Can you kill the job and restart it? It should resume training.

I've never seen it just stop progressing without any error messages, so I have no idea what could be going on.

@iamxiaoyubei
Copy link
Author

Thank you~
I have experienced this situation twice, and it almost happened that after two or three days of running. I stopped it and restarted, it can resume training.

@iamxiaoyubei
Copy link
Author

Could you please have a look at this: I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!

I have tried many ways to solve it, but it still not works.
Besides, I did not see the installation of tensorflow in the dockerfile. However, the 1.14.1 version of tensorflow will appear after installing the environment via docker. But, without docker the installation of the 1.14.1 version of tensorflow needs to be installed from the source, because pypi library doesn't have it. I wanna know why docker can install that version directly. Is my problem related to the version of tensorflow?

@jonathanasdf
Copy link
Contributor

The docker installs tf-nightly (here

RUN pip --no-cache-dir install tf-nightly$(test "$base_image" != "$cpu_base_image" && echo "-gpu")
)

That might be the source of problem if you have a different tensorflow version.

@iamxiaoyubei
Copy link
Author

That's right! It’s really the problem. Thanks!

@datavizweb
Copy link

I am seeing the same issue with libri recipe.

While running libri grapheme recipe (with default params, no changes to recipe) I see that loss starts reducing over steps. But after some times, I see that losses are not computed at all. I also see that it remains in same step (same step is check pointed again). It happened first time and killed and restarted. After few more steps, GPU utilization became zero again.

** runs with decent GPU utilization, loss reduces, steps per second seems okay too **

I0505 21:07:48.797380 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137719
I0505 21:07:53.444988 140580814300928 trainer.py:520] step: 11449 fraction_of_correct_next_step_preds:0.98058969 fraction_of_correct_next_step_preds/logits:0.98058969 grad_norm/all:1.6253868 grad_scale_all:0.61523813 log_pplx:0.062722519 log_pplx/logits:0.062722519 loss:0.062722519 loss/logits:0.06272
2519 num_samples_in_batch:384 var_norm/all:608.57135
I0505 21:07:58.806945 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137722
I0505 21:08:02.775715 140580814300928 trainer.py:520] step: 11450 fraction_of_correct_next_step_preds:0.98301238 fraction_of_correct_next_step_preds/logits:0.98301238 grad_norm/all:1.4966037 grad_scale_all:0.66817957 log_pplx:0.054031234 log_pplx/logits:0.054031234 loss:0.054031234 loss/logits:0.05403
1234 num_samples_in_batch:384 var_norm/all:608.56183

from here on loss are not computed. GPU usage becomes zero

I0505 21:08:08.816323 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137720
I0505 21:08:18.826544 140580822693632 trainer.py:371] Steps/second: 0.117506, Examples/second: 48.132058
I0505 21:08:28.836873 140580822693632 trainer.py:371] Steps/second: 0.117492, Examples/second: 48.126397
I0505 21:08:38.846771 140580822693632 trainer.py:371] Steps/second: 0.117479, Examples/second: 48.120738
I0505 21:08:48.856947 140580822693632 trainer.py:371] Steps/second: 0.117465, Examples/second: 48.115080
I0505 21:08:58.866631 140580822693632 trainer.py:371] Steps/second: 0.117451, Examples/second: 48.109423
I0505 21:09:08.877096 140580822693632 trainer.py:371] Steps/second: 0.117437, Examples/second: 48.103767
I0505 21:09:18.887014 140580822693632 trainer.py:371] Steps/second: 0.117423, Examples/second: 48.098113

** Same chekpoint saved again **
I0505 22:16:35.073483 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0505 22:16:42.773545 140580822693632 trainer.py:371] Steps/second: 0.112100, Examples/second: 45.917725
I0505 22:16:52.783426 140580822693632 trainer.py:371] Steps/second: 0.112088, Examples/second: 45.912573
I0505 22:17:02.793808 140580822693632 trainer.py:371] Steps/second: 0.112075, Examples/second: 45.907422

I0505 22:26:35.644196 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0505 22:26:43.352466 140580822693632 trainer.py:371] Steps/second: 0.111351, Examples/second: 45.610651
I0505 22:26:53.362185 140580822693632 trainer.py:371] Steps/second: 0.111338, Examples/second: 45.605568
I0505 22:27:03.372393 140580822693632 trainer.py:371] Steps/second: 0.111326, Examples/second: 45.600485

I0506 04:06:55.412421 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0506 04:07:03.148743 140580822693632 trainer.py:371] Steps/second: 0.090723, Examples/second: 37.161114
I0506 04:07:13.158670 140580822693632 trainer.py:371] Steps/second: 0.090714, Examples/second: 37.157740
I0506 04:07:23.168779 140580822693632 trainer.py:371] Steps/second: 0.090706, Examples/second: 37.154366
I0506 04:07:33.178835 140580822693632 trainer.py:371] Steps/second: 0.090698, Examples/second: 37.150993

@iamxiaoyubei
Copy link
Author

It's very strange. My experiment didn't continue to display any info and also not save ckpt.

@datavizweb
Copy link

datavizweb commented May 10, 2019

In the async mode I am seeing the same issue. It stops after say 14k steps and GPU utilization becomes zero. Memory usage remains the same. Unlike sync mode (previous post) I don't see any progress here. It runs normal after I kill it and restart.

@AaronSeunghi
Copy link

In the async mode with two trainers for Librispeech960Wpm, I observed exactly the same phenomenon as datavizweb reported. The same step (33299) is checkpointed again and again.

@jonathanasdf
Copy link
Contributor

I wonder if it is some kind of threading issue / race condition due to running controller and trainer in the same binary. Internally we always run the jobs as separate binaries and have never observed this problem. That is the only real difference I can think of.

@NiHaoUCAS
Copy link

@iamxiaoyubei Did you solve the problem? I meet the same problem.

@iamxiaoyubei
Copy link
Author

I didn't solve this problem. I just restart running again to deal with it.😂

@NiHaoUCAS
Copy link

I solve the problem by set: export TF_CUDNN_USE_AUTOTUNE=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants