GPU utilization down to 0% without any error infos #79

iamxiaoyubei · 2019-05-07T16:46:06Z

Hi, I've been training models for almost two days. Today, the GPU utilization dropped suddenly to 0%, but all GPU memory were still occupied by the experiment. Besides, the experimental log does not continue to display any information, whether it is training or error messages.

The upper-left part of the following figure is logs. The lower-left part of the figure shows nvidia-smi.

So anyone know what's going on?

In addition, I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!

jonathanasdf · 2019-05-07T21:45:46Z

Can you kill the job and restart it? It should resume training.

I've never seen it just stop progressing without any error messages, so I have no idea what could be going on.

iamxiaoyubei · 2019-05-08T01:45:30Z

Thank you~
I have experienced this situation twice, and it almost happened that after two or three days of running. I stopped it and restarted, it can resume training.

iamxiaoyubei · 2019-05-08T01:55:00Z

Could you please have a look at this: I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!

I have tried many ways to solve it, but it still not works.
Besides, I did not see the installation of tensorflow in the dockerfile. However, the 1.14.1 version of tensorflow will appear after installing the environment via docker. But, without docker the installation of the 1.14.1 version of tensorflow needs to be installed from the source, because pypi library doesn't have it. I wanna know why docker can install that version directly. Is my problem related to the version of tensorflow?

jonathanasdf · 2019-05-08T01:58:02Z

The docker installs tf-nightly (here

lingvo/docker/dev.dockerfile

Line 75 in e649e65

    
           RUN pip --no-cache-dir install tf-nightly$(test "$base_image" != "$cpu_base_image" && echo "-gpu")

)

That might be the source of problem if you have a different tensorflow version.

iamxiaoyubei · 2019-05-08T02:45:49Z

That's right! It’s really the problem. Thanks!

datavizweb · 2019-05-08T05:13:31Z

I am seeing the same issue with libri recipe.

While running libri grapheme recipe (with default params, no changes to recipe) I see that loss starts reducing over steps. But after some times, I see that losses are not computed at all. I also see that it remains in same step (same step is check pointed again). It happened first time and killed and restarted. After few more steps, GPU utilization became zero again.

** runs with decent GPU utilization, loss reduces, steps per second seems okay too **

I0505 21:07:48.797380 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137719
I0505 21:07:53.444988 140580814300928 trainer.py:520] step: 11449 fraction_of_correct_next_step_preds:0.98058969 fraction_of_correct_next_step_preds/logits:0.98058969 grad_norm/all:1.6253868 grad_scale_all:0.61523813 log_pplx:0.062722519 log_pplx/logits:0.062722519 loss:0.062722519 loss/logits:0.06272
2519 num_samples_in_batch:384 var_norm/all:608.57135
I0505 21:07:58.806945 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137722
I0505 21:08:02.775715 140580814300928 trainer.py:520] step: 11450 fraction_of_correct_next_step_preds:0.98301238 fraction_of_correct_next_step_preds/logits:0.98301238 grad_norm/all:1.4966037 grad_scale_all:0.66817957 log_pplx:0.054031234 log_pplx/logits:0.054031234 loss:0.054031234 loss/logits:0.05403
1234 num_samples_in_batch:384 var_norm/all:608.56183

from here on loss are not computed. GPU usage becomes zero

I0505 21:08:08.816323 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137720
I0505 21:08:18.826544 140580822693632 trainer.py:371] Steps/second: 0.117506, Examples/second: 48.132058
I0505 21:08:28.836873 140580822693632 trainer.py:371] Steps/second: 0.117492, Examples/second: 48.126397
I0505 21:08:38.846771 140580822693632 trainer.py:371] Steps/second: 0.117479, Examples/second: 48.120738
I0505 21:08:48.856947 140580822693632 trainer.py:371] Steps/second: 0.117465, Examples/second: 48.115080
I0505 21:08:58.866631 140580822693632 trainer.py:371] Steps/second: 0.117451, Examples/second: 48.109423
I0505 21:09:08.877096 140580822693632 trainer.py:371] Steps/second: 0.117437, Examples/second: 48.103767
I0505 21:09:18.887014 140580822693632 trainer.py:371] Steps/second: 0.117423, Examples/second: 48.098113

** Same chekpoint saved again **
I0505 22:16:35.073483 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0505 22:16:42.773545 140580822693632 trainer.py:371] Steps/second: 0.112100, Examples/second: 45.917725
I0505 22:16:52.783426 140580822693632 trainer.py:371] Steps/second: 0.112088, Examples/second: 45.912573
I0505 22:17:02.793808 140580822693632 trainer.py:371] Steps/second: 0.112075, Examples/second: 45.907422

I0505 22:26:35.644196 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0505 22:26:43.352466 140580822693632 trainer.py:371] Steps/second: 0.111351, Examples/second: 45.610651
I0505 22:26:53.362185 140580822693632 trainer.py:371] Steps/second: 0.111338, Examples/second: 45.605568
I0505 22:27:03.372393 140580822693632 trainer.py:371] Steps/second: 0.111326, Examples/second: 45.600485

I0506 04:06:55.412421 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0506 04:07:03.148743 140580822693632 trainer.py:371] Steps/second: 0.090723, Examples/second: 37.161114
I0506 04:07:13.158670 140580822693632 trainer.py:371] Steps/second: 0.090714, Examples/second: 37.157740
I0506 04:07:23.168779 140580822693632 trainer.py:371] Steps/second: 0.090706, Examples/second: 37.154366
I0506 04:07:33.178835 140580822693632 trainer.py:371] Steps/second: 0.090698, Examples/second: 37.150993

iamxiaoyubei · 2019-05-08T07:48:10Z

It's very strange. My experiment didn't continue to display any info and also not save ckpt.

datavizweb · 2019-05-10T05:57:22Z

In the async mode I am seeing the same issue. It stops after say 14k steps and GPU utilization becomes zero. Memory usage remains the same. Unlike sync mode (previous post) I don't see any progress here. It runs normal after I kill it and restart.

AaronSeunghi · 2019-05-21T08:20:47Z

In the async mode with two trainers for Librispeech960Wpm, I observed exactly the same phenomenon as datavizweb reported. The same step (33299) is checkpointed again and again.

jonathanasdf · 2019-05-21T20:50:58Z

I wonder if it is some kind of threading issue / race condition due to running controller and trainer in the same binary. Internally we always run the jobs as separate binaries and have never observed this problem. That is the only real difference I can think of.

NiHaoUCAS · 2019-06-26T16:50:27Z

@iamxiaoyubei Did you solve the problem? I meet the same problem.

iamxiaoyubei · 2019-06-27T16:15:56Z

I didn't solve this problem. I just restart running again to deal with it.😂

NiHaoUCAS · 2021-05-10T06:58:28Z

I solve the problem by set: export TF_CUDNN_USE_AUTOTUNE=0

iamxiaoyubei closed this as completed May 8, 2019

jonathanasdf reopened this Aug 9, 2019

jonathanasdf mentioned this issue Aug 9, 2019

low GPU utilization rate #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU utilization down to 0% without any error infos #79

GPU utilization down to 0% without any error infos #79

iamxiaoyubei commented May 7, 2019

jonathanasdf commented May 7, 2019

iamxiaoyubei commented May 8, 2019

iamxiaoyubei commented May 8, 2019

jonathanasdf commented May 8, 2019

iamxiaoyubei commented May 8, 2019

datavizweb commented May 8, 2019

iamxiaoyubei commented May 8, 2019

datavizweb commented May 10, 2019 •

edited

Loading

AaronSeunghi commented May 21, 2019

jonathanasdf commented May 21, 2019

NiHaoUCAS commented Jun 26, 2019

iamxiaoyubei commented Jun 27, 2019

NiHaoUCAS commented May 10, 2021

GPU utilization down to 0% without any error infos #79

GPU utilization down to 0% without any error infos #79

Comments

iamxiaoyubei commented May 7, 2019

jonathanasdf commented May 7, 2019

iamxiaoyubei commented May 8, 2019

iamxiaoyubei commented May 8, 2019

jonathanasdf commented May 8, 2019

iamxiaoyubei commented May 8, 2019

datavizweb commented May 8, 2019

iamxiaoyubei commented May 8, 2019

datavizweb commented May 10, 2019 • edited Loading

AaronSeunghi commented May 21, 2019

jonathanasdf commented May 21, 2019

NiHaoUCAS commented Jun 26, 2019

iamxiaoyubei commented Jun 27, 2019

NiHaoUCAS commented May 10, 2021

datavizweb commented May 10, 2019 •

edited

Loading