New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json #1777
Comments
Okay seems as if the training may not be started distributedly. Running on a single GPU does not result in the error. |
Hi @padmalcom there is no reason why it could not be started with DDP. @Adel-Moumen @anautsch @BenoitWang Please have a look. |
Hi @TParcollet, thanks for taking a look. A single GPU training runs fine with this command:
... while a distributed training fails with the mentioned exception. The command is:
Is there an issue with the arguments? I did not find any information on how hifi_gan is trained using DDP so I assumed it is similar to the tacotron2 training. |
It is most likely a run_on_main missing somewhere. It may also crash if the whole data preparation is longer than 15 minutes due to some very slow NFS for instance. Indeed 15min, I believe, is the default maximum DDP waiting time for non-main process. Hence, if the main process is data_processing for longer than 15min, the others will continue and fail because the train has not been created yet. |
The training (including the preprocessing) stops right after I submit the command so it is not a timeout. When I do the preprocessing with only one GPU and no usage of DDP, the data is processed find and if I interrupt and restart with DDP, the error is somewhat different: |
@TParcollet a side note: we do not have a dedicated DDP testing - I can have a look into automating this AFTER the refactoring PRs are through. (That follow-up PR could be e.g. on reducing recipe testing time & DDP.) The LJSpeech train recipe lacks the DDP init line - compare
|
Adding the mentioned line seems to do the trick. Thanks @anautsch. The json file(s) are created now, even with ddp. |
FYI I created a short PR for this issue: #1781 |
Describe the bug
When I start the hifigan training on ljspeech I get the error FileNotFoundError: [Errno 2] No such file or directory: './results/hifi_gan/1234/save/train.json'
I looked for the train.json and could not find it. I guess it should be created by the ljspeech_prepare.py script but it is not.
Expected behaviour
I expect the train.json to be created automatically when I start the training.
To Reproduce
No response
Versions
No response
Relevant log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: