New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss nan and cost nan while running full librispeech setup #34
Comments
No. I assume this did not happen in the first epoch (first step) but only sometime later? Which is the first epoch/step where it happens? |
At some places I can see |
Thanks. So it trains ok initially and then first occurs in epoch 6, step 171. Otherwise, I can only guess here. Maybe there is sth strange in this TF version (1.15). I think I trained it with TF 1.8. I would assume that TF 1.13 or so should also be fine. I'm not sure if people used TF 1.15 much, but I wonder why that should make problems. But maybe you can just try some other TF versions (1.8 or 1.13). Maybe the hyper params are not optimal for your GPU (Tesla K80). I often observed that hyper params are often optimal for one specific GPU type, and less optimal on another GPU. We mostly used GTX 1080 TI for our experiments. If you have access to another GPU, you might also want to try that. Or otherwise, maybe slightly play around with the hyper params. E.g. make the learning rate warmup ( Or maybe use one of our new configs for LibriSpeech. For example this one. It might be more stable, and will anyway give you better results. (Make a diff to see the differences. You might also need to adjusts some of the file paths.) |
changing learning rate warmup to 15 steps works. |
Might be a better default for learning rate warmup, as reported in #34. Maybe related to Tesla K80 or TensorFlow 1.15.
Hi, I am training the full librispeech setup on 1 GPU with no changes in configuration.
I am getting below logs. Is this expected?
The text was updated successfully, but these errors were encountered: