Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with Tacotron GTA mel specs #52

Closed
duvtedudug opened this issue Apr 23, 2018 · 23 comments
Closed

Training with Tacotron GTA mel specs #52

duvtedudug opened this issue Apr 23, 2018 · 23 comments
Labels

Comments

@duvtedudug
Copy link

I've generated a set of ground-truth aligned mel spectrograms from Rahane's Tacotron-2.

I've trained for over 100k steps but still getting poor results for longer sequences...
gta-taco-wave-120k.zip

Any ideas on how to improve this?

@Rayhane-mamah
Copy link

I think you're in the same case as:

Rayhane-mamah/Tacotron-2#29

@duvtedudug
Copy link
Author

@Rayhane-mamah thanks for the quick reply.

But I think my case is different, since I am using aligned spectrograms output from your Tacotron (not real spectrograms generated from training audio)

??

@Rayhane-mamah
Copy link

Oh in that case it's something worth looking at! :)

Could you report tacotron model params? Reduction factor, training steps, etc?

If it's caused by the feature prediction network I'll find the issue.

@duvtedudug
Copy link
Author

duvtedudug commented Apr 23, 2018

I think the spectrograms from your Tacotron are good. I trained for 137000 steps on the standard settings (no changes in hparams) and synthesised GTA.

Here's the plots...

step-137000-real-mel-spectrogram
step-137000-align
step-137000-pred-mel-spectrogram

When training wavenet vocoder on real spectrograms I get decent results after about 50k steps (8bit softmax)

But using GTA Tacotron generated spectrograms I'm not getting good quality even after 100k+ steps (see zip example I posted previously)

??

Any help would be greatly appreciated!

@Rayhane-mamah
Copy link

Judging from the loss value and the alignments, it seems you trained your model using reduction factor r=5.

Try using this model to generate GTA mels and retrain the wavenet if you can.

Reduction factor is the first difference I can think of between our work and T2 original paper, so that's the first thing I would suspect to cause such quality failure.

The given model current state:

step-136500-align
step-136500-pred-mel-spectrogram
step-136500-real-mel-spectrogram

@duvtedudug
Copy link
Author

Ah of course. I should have noticed that. Thanks for the pretrained model I will try with r=1

@r9y9
Copy link
Owner

r9y9 commented Apr 24, 2018

A little off topic: I'm wondering if you get such a smooth and natural (at a glance) mel-spectrogram in fact. Did you use https://github.com/Rayhane-mamah/Tacotron-2/blob/1547b2502305f4ee58bceede1384054c22b0497a/tacotron/utils/plot.py#L36-L38 for plotting mel-spectrogram? If so, can you try additional params interpolation="none" to imshow?

plt.imshow(np.rot90(spectrogram))
plt.imshow(np.rot90(spectrogram), interpolation="none")

I remember I had hard time to get smooth mel-spectrogram when I was working on DeepVoice3 (even on Tacotron). I think RNNs can do better than CNNs but I'm curious if Tacotron2 actually performs better than Deepvoice3 and Tacotron.

@duvtedudug
Copy link
Author

@r9y9 @Rayhane-mamah correct me if I'm wrong. If I use external mel spectrograms from Taco2, then I only need to make sure the Taco2 specs have the same shape as the wavenet vocoder specs generated from preprocess? (i.e. same sample rate, same num hops, 80 dimensional) I presume they do not need to have the exact same frequency ranges or amplitudes?

@duvtedudug
Copy link
Author

Here's quick check with no interpolation:
specs_check_interp_none

@r9y9
Copy link
Owner

r9y9 commented Apr 25, 2018

@duvtedudug Thanks! Looks good. I should definitely check out the details of Tacotron2.

@duvtedudug
Copy link
Author

@r9y9 @Rayhane-mamah no problem. Can anybody confirm my question above? re: Taco specs shape only requirement?

@Rayhane-mamah
Copy link

Rayhane-mamah commented Apr 25, 2018 via email

@duvtedudug
Copy link
Author

Thanks!

@r9y9 is there a wavenet limitation for local conditioning to be in 0, 1 ?

Maybe I should be more patient and train for 200-300k steps?

@Rayhane-mamah
Copy link

Rayhane-mamah commented Apr 25, 2018 via email

@r9y9
Copy link
Owner

r9y9 commented Apr 25, 2018

There's no limitation for the range of conditional features. Notice that current implementation assumes the range of [0, 1] for simplicity;

# assuming we use [0, 1] scaled features
# this should avoid non-negative upsampling output
self.upsample_conv.append(nn.ReLU(inplace=True))
.

@r9y9
Copy link
Owner

r9y9 commented Apr 25, 2018

I've generated a set of ground-truth aligned mel spectrograms from Rahane's Tacotron-2.
I've trained for over 100k steps but still getting poor results for longer sequences...
Any ideas on how to improve this?

Are you sure you did correct time resolution adjustment as I did in https://github.com/r9y9/deepvoice3_pytorch/blob/3226e415ef1d8412bb159b228aa3c9212fdb892e/generate_aligned_predictions.py#L38-L42? Also did you use exactly same audio feature extraction pipeline with Tacotron2 and WaveNet? If both you did correctly, then I think you should just be more patient. As I mentioned in #45 (comment), it will generally take 1000k steps to get sufficient good quality with MoL output layer.

@duvtedudug
Copy link
Author

Are you sure you did correct time resolution adjustment

The main difference I could see is the WaveNet specs have extra padding of a few frames at the end. I have padded the Taco generated specs with silence, to match the audio files of WaveNet preprocess. Otherwise audio is identical and mel timing is aligned. I have assert taco.shape==wavenet.shape before training.

There is difference in the LWS stft of WaveNet seems to be in range [0.4, 0.8] approx ?

I have approximated this [0.4, 0.8] with scaling.

it will generally take 1000k steps to get sufficient good quality with MoL

I'm using 256 mulaw quantised for quicker results (on real specs starts to sound good 50k-100k for me)

@duvtedudug
Copy link
Author

train_eval.zip

train_eval audio showing some promise. 190k file not there yet but 180k file sounds not too bad.

@butterl
Copy link

butterl commented Apr 27, 2018

@duvtedudug could you help to share your hpara settings in wavenet to train the synthesised GTA npy from Tacotron 2 ?

@duvtedudug
Copy link
Author

@butterl Normal hparam settings just changed to 256, mulaw-quantised

@PetrochukM
Copy link
Contributor

PetrochukM commented May 12, 2018

@r9y9 Why did you include RELu? The Wavenet paper does not discuss using RELu in the transposition network.

# assuming we use [0, 1] scaled features
# this should avoid non-negative upsampling output
self.upsample_conv.append(nn.ReLU(inplace=True))

I asked the Tacotron 2 authors. They did not normalize a Mel-Spectrogram for Wavenet.

@r9y9
Copy link
Owner

r9y9 commented May 13, 2018

Comment says the reason. I don't think it matters much.

@stale
Copy link

stale bot commented May 30, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 30, 2019
@stale stale bot closed this as completed Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants