Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss nan and cost nan while running full librispeech setup #34

Closed
manish-kumar-garg opened this issue Dec 20, 2019 · 5 comments
Closed

Comments

@manish-kumar-garg
Copy link

Hi, I am training the full librispeech setup on 1 GPU with no changes in configuration.
I am getting below logs. Is this expected?

pretrain epoch 8, step 634, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9799331023823471, error:decision 0.0, error:output/output_prob 0.9799331023823471, loss nan, max_size:classes 57, max_size:data 1595, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.044 sec/step, elapsed 0:30:28, exp. remaining 0:19:15, complete 61.27%
pretrain epoch 8, step 635, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9800000237300992, error:decision 0.0, error:output/output_prob 0.9800000237300992, loss nan, max_size:classes 52, max_size:data 1636, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.082 sec/step, elapsed 0:30:31, exp. remaining 0:19:12, complete 61.39%
pretrain epoch 8, step 636, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9806678476743401, error:decision 0.0, error:output/output_prob 0.9806678476743401, loss nan, max_size:classes 59, max_size:data 1683, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.192 sec/step, elapsed 0:30:34, exp. remaining 0:19:03, complete 61.60%
pretrain epoch 8, step 637, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9802513704635203, error:decision 0.0, error:output/output_prob 0.9802513704635203, loss nan, max_size:classes 57, max_size:data 1640, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.111 sec/step, elapsed 0:30:37, exp. remaining 0:18:56, complete 61.78%
pretrain epoch 8, step 638, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.980902785086073, error:decision 0.0, error:output/output_prob 0.980902785086073, loss nan, max_size:classes 59, max_size:data 1677, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.173 sec/step, elapsed 0:30:41, exp. remaining 0:18:51, complete 61.93%
pretrain epoch 8, step 639, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9810126828961073, error:decision 0.0, error:output/output_prob 0.9810126828961073, loss nan, max_size:classes 59, max_size:data 1656, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.182 sec/step, elapsed 0:30:44, exp. remaining 0:18:48, complete 62.03%
pretrain epoch 8, step 640, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9810996705200523, error:decision 0.0, error:output/output_prob 0.9810996705200523, loss nan, max_size:classes 57, max_size:data 1695, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.207 sec/step, elapsed 0:30:47, exp. remaining 0:18:46, complete 62.12%
pretrain epoch 8, step 641, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9812792513985187, error:decision 0.0, error:output/output_prob 0.9812792513985187, loss nan, max_size:classes 58, max_size:data 1590, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.055 sec/step, elapsed 0:30:50, exp. remaining 0:18:44, complete 62.20%
@albertz
Copy link
Member

albertz commented Dec 20, 2019

No. I assume this did not happen in the first epoch (first step) but only sometime later? Which is the first epoch/step where it happens?
Unfortunately the training is non deterministic, and this can happen, although this specific case (getting nan) should be very rare. Can you try to just restart? (Delete the existing models first, such that it starts from scratch.)
If this happens again, can you report the whole log? Or most importantly the beginning, where we see TF version and your GPU information.

@manish-kumar-garg
Copy link
Author

manish-kumar-garg commented Dec 20, 2019

@albertz I tried retraining after deleting. Again same loss.
stdout here

@manish-kumar-garg
Copy link
Author

At some places I can see
W ./tensorflow/core/util/ctc/ctc_loss_calculator.h:499] No valid path found.

@albertz
Copy link
Member

albertz commented Dec 21, 2019

Thanks. So it trains ok initially and then first occurs in epoch 6, step 171.
The CTC warning (No valid path found) is valid and can be ignored (it happens when the input seq len is shorter than the target seq len, which is invalid for CTC).
Can you maybe try again a few times? Is it always the same epoch/step? As I said, I assume this is non-deterministic, and maybe you just had bad luck.

Otherwise, I can only guess here. Maybe there is sth strange in this TF version (1.15). I think I trained it with TF 1.8. I would assume that TF 1.13 or so should also be fine. I'm not sure if people used TF 1.15 much, but I wonder why that should make problems. But maybe you can just try some other TF versions (1.8 or 1.13).

Maybe the hyper params are not optimal for your GPU (Tesla K80). I often observed that hyper params are often optimal for one specific GPU type, and less optimal on another GPU. We mostly used GTX 1080 TI for our experiments. If you have access to another GPU, you might also want to try that. Or otherwise, maybe slightly play around with the hyper params. E.g. make the learning rate warmup (learning_rates in the config) a bit longer (e.g. 15 steps), or start lower (e.g. 0.0001).

Or maybe use one of our new configs for LibriSpeech. For example this one. It might be more stable, and will anyway give you better results. (Make a diff to see the differences. You might also need to adjusts some of the file paths.)

@manish-kumar-garg
Copy link
Author

changing learning rate warmup to 15 steps works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants