-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with Tacotron GTA mel specs #52
Comments
I think you're in the same case as: |
@Rayhane-mamah thanks for the quick reply. But I think my case is different, since I am using aligned spectrograms output from your Tacotron (not real spectrograms generated from training audio) ?? |
Oh in that case it's something worth looking at! :) Could you report tacotron model params? Reduction factor, training steps, etc? If it's caused by the feature prediction network I'll find the issue. |
I think the spectrograms from your Tacotron are good. I trained for 137000 steps on the standard settings (no changes in hparams) and synthesised GTA. Here's the plots... When training wavenet vocoder on real spectrograms I get decent results after about 50k steps (8bit softmax) But using GTA Tacotron generated spectrograms I'm not getting good quality even after 100k+ steps (see zip example I posted previously) ?? Any help would be greatly appreciated! |
Judging from the loss value and the alignments, it seems you trained your model using reduction factor r=5. Try using this model to generate GTA mels and retrain the wavenet if you can. Reduction factor is the first difference I can think of between our work and T2 original paper, so that's the first thing I would suspect to cause such quality failure. The given model current state: |
Ah of course. I should have noticed that. Thanks for the pretrained model I will try with r=1 |
A little off topic: I'm wondering if you get such a smooth and natural (at a glance) mel-spectrogram in fact. Did you use https://github.com/Rayhane-mamah/Tacotron-2/blob/1547b2502305f4ee58bceede1384054c22b0497a/tacotron/utils/plot.py#L36-L38 for plotting mel-spectrogram? If so, can you try additional params
I remember I had hard time to get smooth mel-spectrogram when I was working on DeepVoice3 (even on Tacotron). I think RNNs can do better than CNNs but I'm curious if Tacotron2 actually performs better than Deepvoice3 and Tacotron. |
@r9y9 @Rayhane-mamah correct me if I'm wrong. If I use external mel spectrograms from Taco2, then I only need to make sure the Taco2 specs have the same shape as the wavenet vocoder specs generated from preprocess? (i.e. same sample rate, same num hops, 80 dimensional) I presume they do not need to have the exact same frequency ranges or amplitudes? |
@duvtedudug Thanks! Looks good. I should definitely check out the details of Tacotron2. |
@r9y9 @Rayhane-mamah no problem. Can anybody confirm my question above? re: Taco specs shape only requirement? |
I think it should be no problem. I have set my hparams exactly with r9y9
wavenet. The only difference in our preprocessing is that I set my output
distribution to -4, 4 to enable for a possible better detail in the mel
spectro reconstruction.
If there is a wavenet limitation for local conditioning to be in 0, 1 you
could try shifting and rrscaling tacotron output. As far as I know there is
no such limitation but it would be nice to test it out if quality is still
poor.
…On Wed, 25 Apr 2018, 13:51 duvtedudug, ***@***.***> wrote:
@r9y9 <https://github.com/r9y9> @Rayhane-mamah
<https://github.com/Rayhane-mamah> no problem. Can anybody confirm my
question above? re: Taco specs shape only requirement?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AhFSwOKYHmu5I7ns45sVP-J1KbZXo5UFks5tsHE_gaJpZM4TgCoJ>
.
|
Thanks! @r9y9 is there a wavenet limitation for local conditioning to be in 0, 1 ? Maybe I should be more patient and train for 200-300k steps? |
Yeah further training might also be the cause, since mels are noisier than
real ones and are in a wider range ( I'm supposing weights are initialized
close to 0 ).
I will leave the final words to the wavenet expert however :)
…On Wed, 25 Apr 2018, 15:11 duvtedudug, ***@***.***> wrote:
Thanks!
@r9y9 <https://github.com/r9y9> is there a wavenet limitation for local
conditioning to be in 0, 1 ?
Maybe I should be more patient and train for 200-300k steps?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AhFSwBkMexHsUlud0lM0IowjjqZ6HDW9ks5tsIP5gaJpZM4TgCoJ>
.
|
There's no limitation for the range of conditional features. Notice that current implementation assumes the range of [0, 1] for simplicity; wavenet_vocoder/wavenet_vocoder/wavenet.py Lines 162 to 164 in 4d5f68c
|
Are you sure you did correct time resolution adjustment as I did in https://github.com/r9y9/deepvoice3_pytorch/blob/3226e415ef1d8412bb159b228aa3c9212fdb892e/generate_aligned_predictions.py#L38-L42? Also did you use exactly same audio feature extraction pipeline with Tacotron2 and WaveNet? If both you did correctly, then I think you should just be more patient. As I mentioned in #45 (comment), it will generally take 1000k steps to get sufficient good quality with MoL output layer. |
The main difference I could see is the WaveNet specs have extra padding of a few frames at the end. I have padded the Taco generated specs with silence, to match the audio files of WaveNet preprocess. Otherwise audio is identical and mel timing is aligned. I have There is difference in the LWS stft of WaveNet seems to be in range [0.4, 0.8] approx ? I have approximated this [0.4, 0.8] with scaling.
I'm using 256 mulaw quantised for quicker results (on real specs starts to sound good 50k-100k for me) |
train_eval audio showing some promise. 190k file not there yet but 180k file sounds not too bad. |
@duvtedudug could you help to share your hpara settings in wavenet to train the synthesised GTA npy from Tacotron 2 ? |
@butterl Normal hparam settings just changed to 256, mulaw-quantised |
@r9y9 Why did you include RELu? The Wavenet paper does not discuss using RELu in the transposition network. wavenet_vocoder/wavenet_vocoder/wavenet.py Lines 161 to 163 in 740219b
I asked the Tacotron 2 authors. They did not normalize a Mel-Spectrogram for Wavenet. |
Comment says the reason. I don't think it matters much. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I've generated a set of ground-truth aligned mel spectrograms from Rahane's Tacotron-2.
I've trained for over 100k steps but still getting poor results for longer sequences...
gta-taco-wave-120k.zip
Any ideas on how to improve this?
The text was updated successfully, but these errors were encountered: