Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json #1777

Closed
padmalcom opened this issue Jan 1, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@padmalcom
Copy link
Contributor

Describe the bug

When I start the hifigan training on ljspeech I get the error FileNotFoundError: [Errno 2] No such file or directory: './results/hifi_gan/1234/save/train.json'

I looked for the train.json and could not find it. I guess it should be created by the ljspeech_prepare.py script but it is not.

Expected behaviour

I expect the train.json to be created automatically when I start the training.

To Reproduce

No response

Versions

No response

Relevant log output

No response

Additional context

No response

@padmalcom padmalcom added the bug Something isn't working label Jan 1, 2023
@padmalcom
Copy link
Contributor Author

Okay seems as if the training may not be started distributedly. Running on a single GPU does not result in the error.

@TParcollet
Copy link
Collaborator

TParcollet commented Jan 1, 2023

Hi @padmalcom there is no reason why it could not be started with DDP. @Adel-Moumen @anautsch @BenoitWang Please have a look.

@TParcollet TParcollet reopened this Jan 1, 2023
@padmalcom
Copy link
Contributor Author

Hi @TParcollet, thanks for taking a look. A single GPU training runs fine with this command:

python train.py hparams/train.yaml --data_folder=/home/user1/training

... while a distributed training fails with the mentioned exception. The command is:

python -m torch.distributed.launch --nproc_per_node=7 train.py hparams/train.yaml --distributed_launch --data_folder=/home/user1/training --distributed_backend='nccl'

Is there an issue with the arguments? I did not find any information on how hifi_gan is trained using DDP so I assumed it is similar to the tacotron2 training.

@TParcollet
Copy link
Collaborator

It is most likely a run_on_main missing somewhere. It may also crash if the whole data preparation is longer than 15 minutes due to some very slow NFS for instance. Indeed 15min, I believe, is the default maximum DDP waiting time for non-main process. Hence, if the main process is data_processing for longer than 15min, the others will continue and fail because the train has not been created yet.

@padmalcom
Copy link
Contributor Author

The training (including the preprocessing) stops right after I submit the command so it is not a timeout. When I do the preprocessing with only one GPU and no usage of DDP, the data is processed find and if I interrupt and restart with DDP, the error is somewhat different:
=========WARNING=========: Please add sb.ddp_init_group() into your exp.pyTo use DDP backend, start your script with...
So ddp might not be initialized correctly... But that is only a guess.

@anautsch
Copy link
Collaborator

anautsch commented Jan 3, 2023

@TParcollet a side note: we do not have a dedicated DDP testing - I can have a look into automating this AFTER the refactoring PRs are through. (That follow-up PR could be e.g. on reducing recipe testing time & DDP.)

The LJSpeech train recipe lacks the DDP init line - compare

sb.utils.distributed.ddp_init_group(run_opts)

@padmalcom
Copy link
Contributor Author

Adding the mentioned line seems to do the trick. Thanks @anautsch. The json file(s) are created now, even with ddp.

@anautsch anautsch closed this as completed Jan 3, 2023
@padmalcom
Copy link
Contributor Author

FYI I created a short PR for this issue: #1781

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants