Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract alignment from tacotron2? #88

Open
CanKorkut opened this issue Sep 24, 2020 · 6 comments
Open

How to extract alignment from tacotron2? #88

CanKorkut opened this issue Sep 24, 2020 · 6 comments

Comments

@CanKorkut
Copy link

CanKorkut commented Sep 24, 2020

Hi,

I want to try fastspeech on different dataset. therefore, can you share how to extract alignment from tacotron2?

I tried this code, but get bad result for synthesis when inference long sentence .

_, _, _, alignments = model.inference(sequence)
d = alignments.float().data.cpu().numpy()[0].T
x = np.zeros(d.shape[0])
for i,y in enumerate(d):
x[i] = y.sum()
np.save("path_to_save_folder"+name+".npy",x.astype(np.dtype('i4')))

Thank you.

@prorev
Copy link

prorev commented Oct 2, 2020

Why are alignments used for after all? Tacotron-2 paper will not mention alignments.

@prorev
Copy link

prorev commented Oct 2, 2020

I found this in FastSpeech2 paper:

The training of FastSpeech relies on an autoregressive teacher model to provide 1) the duration of each phoneme to train a duration predictor, and 2) the generated mel-spectrograms for knowledge distillation. While these designs in FastSpeech ease the learning of the one-to-many mapping problem in TTS, they also bring several disadvantages: 1) the two-stage teacher-student distillation pipeline is complicated; 2) the duration extracted from the attention map of the teacher model is not accurate enough, and the target mel-spectrograms distilled from the teacher model suffer from information loss due to data simplification, both of which limit the voice quality and prosody.

This speaks clearly that you need another trained model to work with FastSpeech custom dataset, which is not so smart.

Or, the alignments are such a big problem, because based on those alignments the the training is possible. No alignments, no training. This paper "FastSpeech" is worth inspecting to understand how is done (in principle), but for some out of the box training possible is not the best choice.

You may find the alignments.py file was present in this project before but was removed. Commit id: e11b60d, but no commit message has been set to explain.

@CanKorkut
Copy link
Author

Thank you, i found alignments.py previous commit and tried it. In result, synthesis quality not bad, but when i inference long sentence long than five or six words, there was stuttering and missing letters problem in synthesis. Now i try FastSpeech2. Alignments are really such a big problem.

@cuongnguyengit
Copy link

Hi, i have the same question. I also try to train my language with FastSpeech2, but alignments are really difficult.
My tacotron2 model is trained very good with my dataset. Therefore, its alignment will be good, but synthesis is quite bad.
They seem to be able to understand and are mixed. So, my question is whether durations generated by Tacotron matchs mels, energies, pitches generated by librosa or TacotronSTFT module. This problem is so complexity to explain how to FastSpeech2 made good quality audios. Thanks

@CanKorkut
Copy link
Author

CanKorkut commented Jan 5, 2021

image
I researched this problem and saw something about reduction factor. I didn't clearly understand architecture but we can say tacotron can easly learn with large reduction factor, however there is no reduction factor nvidia tacotron2 implementation. Maybe nvidia tacotron good for synthesis but it bad at for extract alignment. I'm not sure, i will research and editing.

@khainh3101
Copy link

@CanKorkut Hi, i'm using that alignment.py (Commit id: e11b60d) to extract alignments files but the result show different dimension with LJSpeech alignment files (in this source code Fast Speech already had). Can you show me your code to extract exactly alignment files to train another language ? thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants