Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duration predictor training is really slow. #7

Closed
erogol opened this issue Jan 23, 2023 · 8 comments
Closed

Duration predictor training is really slow. #7

erogol opened this issue Jan 23, 2023 · 8 comments

Comments

@erogol
Copy link

erogol commented Jan 23, 2023

I observe very slow progress with the duration loss at the second stage of the training. Is this something accepted or you might think of any issue that might be causing it?

For each epoch, the eval loss is going 2.21 -> 2.20 -> 2.18 ... whereas the F0 loss converged very quickly.

BTW I am using VCTK + LibriTTS.

I also tried reducing the dropout to 0.1 for the duration projection layer but didn't help.

@yl4579
Copy link
Owner

yl4579 commented Jan 24, 2023

I have checked my learning curve on LibriTTS, which goes from 0.730 -> 0.714 -> 0.705 ..., not sure what happened to your settings. It could be because of the implementation difference in the alignment. I will change the code to include both implementations and let you choose which one you want to use.

The current implementation uses cross-entropy loss for alignment learning, and it gets the attention from the mel dimension. In contrast, the old one (original paper) uses L1 loss for alignment, and gets the attention from the text dimension. The old one is more stable in training as the monotonic loss aligns with the S2S loss but produces a worse alignment than the new one, because the old one is strictly speaking not mel-spectrogram alignment but text alignment, though it works nevertheless.

@erogol
Copy link
Author

erogol commented Jan 24, 2023

have you tried https://arxiv.org/abs/2108.10447 ?

we use it in 🐸TTS and it works good.

PS I try to implement StyleTTS version that is compatible with Coqui.

@amitaie
Copy link

amitaie commented Jan 24, 2023

PS I try to implement StyleTTS version that is compatible with Coqui.

That is great! Looking forward to that.

@yl4579
Copy link
Owner

yl4579 commented Jan 24, 2023

@erogol I have updated the repo to include the original implementation and set the default to the original implementation. You can try to train it again and see if the problem persists.

The monotonic loss is very similar to https://arxiv.org/abs/2108.10447 (I have included two implementations, one using L1 loss and another using cross-entropy loss). Still, the forward sum loss (CTC loss) is likely worse than our S2S loss because we are using actual auto-regressive cross attention instead of product dot from CNN models, so we believe the S2S loss is better than the CTC loss.

By the way, thanks for your effort in incorporating StyleTTS into Coqui!

@yl4579 yl4579 closed this as completed Jan 24, 2023
@yl4579 yl4579 reopened this Jan 24, 2023
@erogol
Copy link
Author

erogol commented Jan 24, 2023

Thanks for the update. I'll give it a try.

One last question, I hope it is not being very annoying.

If we would use the same sequence length between the durations and the mel frames, what changes should we do?

I guess disabling upsampling and downsampling is one. Is there anything else that comes to your mind?

@yl4579
Copy link
Owner

yl4579 commented Jan 25, 2023

You can remove the downsampling in the ASR model and upsampling in the decoder and retrain both models. But an easier way is simply to upsample the attention using interpolation without retraining the ASR model. Since a phoneme barely takes less than one frame of mel spectrogram (12.5 ms in our case), it is safe to assume the interpolated attention will be the same as the attention without downsampling.

@erogol
Copy link
Author

erogol commented Jan 25, 2023

it makes sense thanks.

@erogol
Copy link
Author

erogol commented Jan 27, 2023

I close this as I was able to train the DP. Thanks for all the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants