Duration predictor training is really slow. #7

erogol · 2023-01-23T18:06:55Z

I observe very slow progress with the duration loss at the second stage of the training. Is this something accepted or you might think of any issue that might be causing it?

For each epoch, the eval loss is going 2.21 -> 2.20 -> 2.18 ... whereas the F0 loss converged very quickly.

BTW I am using VCTK + LibriTTS.

I also tried reducing the dropout to 0.1 for the duration projection layer but didn't help.

yl4579 · 2023-01-24T02:11:08Z

I have checked my learning curve on LibriTTS, which goes from 0.730 -> 0.714 -> 0.705 ..., not sure what happened to your settings. It could be because of the implementation difference in the alignment. I will change the code to include both implementations and let you choose which one you want to use.

The current implementation uses cross-entropy loss for alignment learning, and it gets the attention from the mel dimension. In contrast, the old one (original paper) uses L1 loss for alignment, and gets the attention from the text dimension. The old one is more stable in training as the monotonic loss aligns with the S2S loss but produces a worse alignment than the new one, because the old one is strictly speaking not mel-spectrogram alignment but text alignment, though it works nevertheless.

erogol · 2023-01-24T08:32:08Z

have you tried https://arxiv.org/abs/2108.10447 ?

we use it in 🐸TTS and it works good.

PS I try to implement StyleTTS version that is compatible with Coqui.

amitaie · 2023-01-24T08:55:01Z

PS I try to implement StyleTTS version that is compatible with Coqui.

That is great! Looking forward to that.

yl4579 · 2023-01-24T18:25:12Z

@erogol I have updated the repo to include the original implementation and set the default to the original implementation. You can try to train it again and see if the problem persists.

The monotonic loss is very similar to https://arxiv.org/abs/2108.10447 (I have included two implementations, one using L1 loss and another using cross-entropy loss). Still, the forward sum loss (CTC loss) is likely worse than our S2S loss because we are using actual auto-regressive cross attention instead of product dot from CNN models, so we believe the S2S loss is better than the CTC loss.

By the way, thanks for your effort in incorporating StyleTTS into Coqui!

erogol · 2023-01-24T21:29:27Z

Thanks for the update. I'll give it a try.

One last question, I hope it is not being very annoying.

If we would use the same sequence length between the durations and the mel frames, what changes should we do?

I guess disabling upsampling and downsampling is one. Is there anything else that comes to your mind?

yl4579 · 2023-01-25T05:23:46Z

You can remove the downsampling in the ASR model and upsampling in the decoder and retrain both models. But an easier way is simply to upsample the attention using interpolation without retraining the ASR model. Since a phoneme barely takes less than one frame of mel spectrogram (12.5 ms in our case), it is safe to assume the interpolated attention will be the same as the attention without downsampling.

erogol · 2023-01-25T09:45:25Z

it makes sense thanks.

erogol · 2023-01-27T21:55:51Z

I close this as I was able to train the DP. Thanks for all the help.

yl4579 closed this as completed Jan 24, 2023

yl4579 reopened this Jan 24, 2023

erogol closed this as completed Jan 27, 2023

erogol mentioned this issue Jan 31, 2023

Training the model end2end #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duration predictor training is really slow. #7

Duration predictor training is really slow. #7

erogol commented Jan 23, 2023

yl4579 commented Jan 24, 2023 •

edited

Loading

erogol commented Jan 24, 2023

amitaie commented Jan 24, 2023

yl4579 commented Jan 24, 2023

erogol commented Jan 24, 2023

yl4579 commented Jan 25, 2023

erogol commented Jan 25, 2023

erogol commented Jan 27, 2023

Duration predictor training is really slow. #7

Duration predictor training is really slow. #7

Comments

erogol commented Jan 23, 2023

yl4579 commented Jan 24, 2023 • edited Loading

erogol commented Jan 24, 2023

amitaie commented Jan 24, 2023

yl4579 commented Jan 24, 2023

erogol commented Jan 24, 2023

yl4579 commented Jan 25, 2023

erogol commented Jan 25, 2023

erogol commented Jan 27, 2023

yl4579 commented Jan 24, 2023 •

edited

Loading