loss nan and cost nan while running full librispeech setup #34

manish-kumar-garg · 2019-12-20T15:17:57Z

Hi, I am training the full librispeech setup on 1 GPU with no changes in configuration.
I am getting below logs. Is this expected?

pretrain epoch 8, step 634, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9799331023823471, error:decision 0.0, error:output/output_prob 0.9799331023823471, loss nan, max_size:classes 57, max_size:data 1595, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.044 sec/step, elapsed 0:30:28, exp. remaining 0:19:15, complete 61.27%
pretrain epoch 8, step 635, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9800000237300992, error:decision 0.0, error:output/output_prob 0.9800000237300992, loss nan, max_size:classes 52, max_size:data 1636, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.082 sec/step, elapsed 0:30:31, exp. remaining 0:19:12, complete 61.39%
pretrain epoch 8, step 636, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9806678476743401, error:decision 0.0, error:output/output_prob 0.9806678476743401, loss nan, max_size:classes 59, max_size:data 1683, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.192 sec/step, elapsed 0:30:34, exp. remaining 0:19:03, complete 61.60%
pretrain epoch 8, step 637, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9802513704635203, error:decision 0.0, error:output/output_prob 0.9802513704635203, loss nan, max_size:classes 57, max_size:data 1640, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.111 sec/step, elapsed 0:30:37, exp. remaining 0:18:56, complete 61.78%
pretrain epoch 8, step 638, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.980902785086073, error:decision 0.0, error:output/output_prob 0.980902785086073, loss nan, max_size:classes 59, max_size:data 1677, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.173 sec/step, elapsed 0:30:41, exp. remaining 0:18:51, complete 61.93%
pretrain epoch 8, step 639, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9810126828961073, error:decision 0.0, error:output/output_prob 0.9810126828961073, loss nan, max_size:classes 59, max_size:data 1656, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.182 sec/step, elapsed 0:30:44, exp. remaining 0:18:48, complete 62.03%
pretrain epoch 8, step 640, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9810996705200523, error:decision 0.0, error:output/output_prob 0.9810996705200523, loss nan, max_size:classes 57, max_size:data 1695, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.207 sec/step, elapsed 0:30:47, exp. remaining 0:18:46, complete 62.12%
pretrain epoch 8, step 641, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9812792513985187, error:decision 0.0, error:output/output_prob 0.9812792513985187, loss nan, max_size:classes 58, max_size:data 1590, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.055 sec/step, elapsed 0:30:50, exp. remaining 0:18:44, complete 62.20%

The text was updated successfully, but these errors were encountered:

albertz · 2019-12-20T15:21:41Z

No. I assume this did not happen in the first epoch (first step) but only sometime later? Which is the first epoch/step where it happens?
Unfortunately the training is non deterministic, and this can happen, although this specific case (getting nan) should be very rare. Can you try to just restart? (Delete the existing models first, such that it starts from scratch.)
If this happens again, can you report the whole log? Or most importantly the beginning, where we see TF version and your GPU information.

manish-kumar-garg · 2019-12-20T17:05:13Z

@albertz I tried retraining after deleting. Again same loss.
stdout here

manish-kumar-garg · 2019-12-21T08:50:48Z

At some places I can see
W ./tensorflow/core/util/ctc/ctc_loss_calculator.h:499] No valid path found.

albertz · 2019-12-21T10:57:51Z

Thanks. So it trains ok initially and then first occurs in epoch 6, step 171.
The CTC warning (No valid path found) is valid and can be ignored (it happens when the input seq len is shorter than the target seq len, which is invalid for CTC).
Can you maybe try again a few times? Is it always the same epoch/step? As I said, I assume this is non-deterministic, and maybe you just had bad luck.

Otherwise, I can only guess here. Maybe there is sth strange in this TF version (1.15). I think I trained it with TF 1.8. I would assume that TF 1.13 or so should also be fine. I'm not sure if people used TF 1.15 much, but I wonder why that should make problems. But maybe you can just try some other TF versions (1.8 or 1.13).

Maybe the hyper params are not optimal for your GPU (Tesla K80). I often observed that hyper params are often optimal for one specific GPU type, and less optimal on another GPU. We mostly used GTX 1080 TI for our experiments. If you have access to another GPU, you might also want to try that. Or otherwise, maybe slightly play around with the hyper params. E.g. make the learning rate warmup (learning_rates in the config) a bit longer (e.g. 15 steps), or start lower (e.g. 0.0001).

Or maybe use one of our new configs for LibriSpeech. For example this one. It might be more stable, and will anyway give you better results. (Make a diff to see the differences. You might also need to adjusts some of the file paths.)

manish-kumar-garg · 2019-12-24T10:35:22Z

changing learning rate warmup to 15 steps works.

Might be a better default for learning rate warmup, as reported in #34. Maybe related to Tesla K80 or TensorFlow 1.15.

manish-kumar-garg closed this as completed Dec 24, 2019

albertz mentioned this issue Dec 24, 2019

specAugment Implementation #23

Closed

albertz added a commit that referenced this issue Dec 27, 2019

Update returnn.config

a35b6ad

Might be a better default for learning rate warmup, as reported in #34. Maybe related to Tesla K80 or TensorFlow 1.15.

iamweiweishi mentioned this issue Apr 8, 2020

How to evaluate a trained model directly? rwth-i6/returnn#282

Closed

albertz mentioned this issue Jun 3, 2020

Training instability (Inf/nan score) with TensorFlow 1.15 rwth-i6/returnn#297

Open

yanghongjiazheng mentioned this issue Oct 21, 2020

loss nan and cost nan while running my own corpus using librispeech sets #54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss nan and cost nan while running full librispeech setup #34

loss nan and cost nan while running full librispeech setup #34

manish-kumar-garg commented Dec 20, 2019

albertz commented Dec 20, 2019

manish-kumar-garg commented Dec 20, 2019 •

edited

manish-kumar-garg commented Dec 21, 2019

albertz commented Dec 21, 2019

manish-kumar-garg commented Dec 24, 2019

loss nan and cost nan while running full librispeech setup #34

loss nan and cost nan while running full librispeech setup #34

Comments

manish-kumar-garg commented Dec 20, 2019

albertz commented Dec 20, 2019

manish-kumar-garg commented Dec 20, 2019 • edited

manish-kumar-garg commented Dec 21, 2019

albertz commented Dec 21, 2019

manish-kumar-garg commented Dec 24, 2019

manish-kumar-garg commented Dec 20, 2019 •

edited