[Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json #1777

padmalcom · 2023-01-01T15:35:55Z

Describe the bug

When I start the hifigan training on ljspeech I get the error FileNotFoundError: [Errno 2] No such file or directory: './results/hifi_gan/1234/save/train.json'

I looked for the train.json and could not find it. I guess it should be created by the ljspeech_prepare.py script but it is not.

Expected behaviour

I expect the train.json to be created automatically when I start the training.

To Reproduce

No response

Versions

No response

Relevant log output

No response

Additional context

No response

padmalcom · 2023-01-01T16:17:51Z

Okay seems as if the training may not be started distributedly. Running on a single GPU does not result in the error.

TParcollet · 2023-01-01T16:53:43Z

Hi @padmalcom there is no reason why it could not be started with DDP. @Adel-Moumen @anautsch @BenoitWang Please have a look.

padmalcom · 2023-01-01T21:54:29Z

Hi @TParcollet, thanks for taking a look. A single GPU training runs fine with this command:

python train.py hparams/train.yaml --data_folder=/home/user1/training

... while a distributed training fails with the mentioned exception. The command is:

python -m torch.distributed.launch --nproc_per_node=7 train.py hparams/train.yaml --distributed_launch --data_folder=/home/user1/training --distributed_backend='nccl'

Is there an issue with the arguments? I did not find any information on how hifi_gan is trained using DDP so I assumed it is similar to the tacotron2 training.

TParcollet · 2023-01-01T22:28:47Z

It is most likely a run_on_main missing somewhere. It may also crash if the whole data preparation is longer than 15 minutes due to some very slow NFS for instance. Indeed 15min, I believe, is the default maximum DDP waiting time for non-main process. Hence, if the main process is data_processing for longer than 15min, the others will continue and fail because the train has not been created yet.

padmalcom · 2023-01-02T20:22:38Z

The training (including the preprocessing) stops right after I submit the command so it is not a timeout. When I do the preprocessing with only one GPU and no usage of DDP, the data is processed find and if I interrupt and restart with DDP, the error is somewhat different:
=========WARNING=========: Please add sb.ddp_init_group() into your exp.pyTo use DDP backend, start your script with...
So ddp might not be initialized correctly... But that is only a guess.

anautsch · 2023-01-03T11:18:39Z

@TParcollet a side note: we do not have a dedicated DDP testing - I can have a look into automating this AFTER the refactoring PRs are through. (That follow-up PR could be e.g. on reducing recipe testing time & DDP.)

The LJSpeech train recipe lacks the DDP init line - compare

speechbrain/recipes/CommonVoice/ASR/transformer/train.py

Line 379 in ff4366a

sb.utils.distributed.ddp_init_group(run_opts)

padmalcom · 2023-01-03T12:51:04Z

Adding the mentioned line seems to do the trick. Thanks @anautsch. The json file(s) are created now, even with ddp.

padmalcom · 2023-01-03T13:28:21Z

FYI I created a short PR for this issue: #1781

padmalcom added the bug Something isn't working label Jan 1, 2023

padmalcom closed this as completed Jan 1, 2023

TParcollet reopened this Jan 1, 2023

anautsch closed this as completed Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json #1777

[Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json #1777

padmalcom commented Jan 1, 2023

padmalcom commented Jan 1, 2023

TParcollet commented Jan 1, 2023 •

edited

padmalcom commented Jan 1, 2023

TParcollet commented Jan 1, 2023

padmalcom commented Jan 2, 2023

anautsch commented Jan 3, 2023

padmalcom commented Jan 3, 2023

padmalcom commented Jan 3, 2023

[Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json #1777

[Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json #1777

Comments

padmalcom commented Jan 1, 2023

Describe the bug

Expected behaviour

To Reproduce

Versions

Relevant log output

Additional context

padmalcom commented Jan 1, 2023

TParcollet commented Jan 1, 2023 • edited

padmalcom commented Jan 1, 2023

TParcollet commented Jan 1, 2023

padmalcom commented Jan 2, 2023

anautsch commented Jan 3, 2023

padmalcom commented Jan 3, 2023

padmalcom commented Jan 3, 2023

TParcollet commented Jan 1, 2023 •

edited