-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU utilization down to 0% without any error infos #79
Comments
Can you kill the job and restart it? It should resume training. I've never seen it just stop progressing without any error messages, so I have no idea what could be going on. |
Thank you~ |
Could you please have a look at this: I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!! I have tried many ways to solve it, but it still not works. |
The docker installs tf-nightly (here Line 75 in e649e65
That might be the source of problem if you have a different tensorflow version. |
That's right! It’s really the problem. Thanks! |
I am seeing the same issue with libri recipe. While running libri grapheme recipe (with default params, no changes to recipe) I see that loss starts reducing over steps. But after some times, I see that losses are not computed at all. I also see that it remains in same step (same step is check pointed again). It happened first time and killed and restarted. After few more steps, GPU utilization became zero again. ** runs with decent GPU utilization, loss reduces, steps per second seems okay too ** I0505 21:07:48.797380 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137719 from here on loss are not computed. GPU usage becomes zero I0505 21:08:08.816323 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137720 ** Same chekpoint saved again ** I0505 22:26:35.644196 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450 I0506 04:06:55.412421 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450 |
It's very strange. My experiment didn't continue to display any info and also not save ckpt. |
In the async mode I am seeing the same issue. It stops after say 14k steps and GPU utilization becomes zero. Memory usage remains the same. Unlike sync mode (previous post) I don't see any progress here. It runs normal after I kill it and restart. |
In the async mode with two trainers for Librispeech960Wpm, I observed exactly the same phenomenon as datavizweb reported. The same step (33299) is checkpointed again and again. |
I wonder if it is some kind of threading issue / race condition due to running controller and trainer in the same binary. Internally we always run the jobs as separate binaries and have never observed this problem. That is the only real difference I can think of. |
@iamxiaoyubei Did you solve the problem? I meet the same problem. |
I didn't solve this problem. I just restart running again to deal with it.😂 |
I solve the problem by set: export TF_CUDNN_USE_AUTOTUNE=0 |
Hi, I've been training models for almost two days. Today, the GPU utilization dropped suddenly to 0%, but all GPU memory were still occupied by the experiment. Besides, the experimental log does not continue to display any information, whether it is training or error messages.
The upper-left part of the following figure is logs. The lower-left part of the figure shows nvidia-smi.
So anyone know what's going on?
In addition, I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!
The text was updated successfully, but these errors were encountered: