Training with Tacotron GTA mel specs #52

duvtedudug · 2018-04-23T14:48:31Z

I've generated a set of ground-truth aligned mel spectrograms from Rahane's Tacotron-2.

I've trained for over 100k steps but still getting poor results for longer sequences...
gta-taco-wave-120k.zip

Any ideas on how to improve this?

Rayhane-mamah · 2018-04-23T15:18:27Z

I think you're in the same case as:

Rayhane-mamah/Tacotron-2#29

duvtedudug · 2018-04-23T16:16:34Z

@Rayhane-mamah thanks for the quick reply.

But I think my case is different, since I am using aligned spectrograms output from your Tacotron (not real spectrograms generated from training audio)

??

Rayhane-mamah · 2018-04-23T16:27:27Z

Oh in that case it's something worth looking at! :)

Could you report tacotron model params? Reduction factor, training steps, etc?

If it's caused by the feature prediction network I'll find the issue.

duvtedudug · 2018-04-23T16:52:54Z

I think the spectrograms from your Tacotron are good. I trained for 137000 steps on the standard settings (no changes in hparams) and synthesised GTA.

Here's the plots...

When training wavenet vocoder on real spectrograms I get decent results after about 50k steps (8bit softmax)

But using GTA Tacotron generated spectrograms I'm not getting good quality even after 100k+ steps (see zip example I posted previously)

??

Any help would be greatly appreciated!

Rayhane-mamah · 2018-04-23T19:12:28Z

Judging from the loss value and the alignments, it seems you trained your model using reduction factor r=5.

Try using this model to generate GTA mels and retrain the wavenet if you can.

Reduction factor is the first difference I can think of between our work and T2 original paper, so that's the first thing I would suspect to cause such quality failure.

The given model current state:

duvtedudug · 2018-04-23T21:18:56Z

Ah of course. I should have noticed that. Thanks for the pretrained model I will try with r=1

r9y9 · 2018-04-24T07:59:00Z

A little off topic: I'm wondering if you get such a smooth and natural (at a glance) mel-spectrogram in fact. Did you use https://github.com/Rayhane-mamah/Tacotron-2/blob/1547b2502305f4ee58bceede1384054c22b0497a/tacotron/utils/plot.py#L36-L38 for plotting mel-spectrogram? If so, can you try additional params interpolation="none" to imshow?

plt.imshow(np.rot90(spectrogram))

plt.imshow(np.rot90(spectrogram), interpolation="none")

I remember I had hard time to get smooth mel-spectrogram when I was working on DeepVoice3 (even on Tacotron). I think RNNs can do better than CNNs but I'm curious if Tacotron2 actually performs better than Deepvoice3 and Tacotron.

duvtedudug · 2018-04-24T08:58:49Z

@r9y9 @Rayhane-mamah correct me if I'm wrong. If I use external mel spectrograms from Taco2, then I only need to make sure the Taco2 specs have the same shape as the wavenet vocoder specs generated from preprocess? (i.e. same sample rate, same num hops, 80 dimensional) I presume they do not need to have the exact same frequency ranges or amplitudes?

duvtedudug · 2018-04-24T09:58:25Z

Here's quick check with no interpolation:

r9y9 · 2018-04-25T02:35:46Z

@duvtedudug Thanks! Looks good. I should definitely check out the details of Tacotron2.

duvtedudug · 2018-04-25T12:51:09Z

@r9y9 @Rayhane-mamah no problem. Can anybody confirm my question above? re: Taco specs shape only requirement?

Rayhane-mamah · 2018-04-25T13:20:34Z

I think it should be no problem. I have set my hparams exactly with r9y9 wavenet. The only difference in our preprocessing is that I set my output distribution to -4, 4 to enable for a possible better detail in the mel spectro reconstruction. If there is a wavenet limitation for local conditioning to be in 0, 1 you could try shifting and rrscaling tacotron output. As far as I know there is no such limitation but it would be nice to test it out if quality is still poor.

…

On Wed, 25 Apr 2018, 13:51 duvtedudug, ***@***.***> wrote: @r9y9 <https://github.com/r9y9> @Rayhane-mamah <https://github.com/Rayhane-mamah> no problem. Can anybody confirm my question above? re: Taco specs shape only requirement? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AhFSwOKYHmu5I7ns45sVP-J1KbZXo5UFks5tsHE_gaJpZM4TgCoJ> .

duvtedudug · 2018-04-25T14:11:02Z

Thanks!

@r9y9 is there a wavenet limitation for local conditioning to be in 0, 1 ?

Maybe I should be more patient and train for 200-300k steps?

Rayhane-mamah · 2018-04-25T14:35:17Z

Yeah further training might also be the cause, since mels are noisier than real ones and are in a wider range ( I'm supposing weights are initialized close to 0 ). I will leave the final words to the wavenet expert however :)

…

On Wed, 25 Apr 2018, 15:11 duvtedudug, ***@***.***> wrote: Thanks! @r9y9 <https://github.com/r9y9> is there a wavenet limitation for local conditioning to be in 0, 1 ? Maybe I should be more patient and train for 200-300k steps? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AhFSwBkMexHsUlud0lM0IowjjqZ6HDW9ks5tsIP5gaJpZM4TgCoJ> .

r9y9 · 2018-04-25T14:39:19Z

There's no limitation for the range of conditional features. Notice that current implementation assumes the range of [0, 1] for simplicity;

wavenet_vocoder/wavenet_vocoder/wavenet.py

Lines 162 to 164 in 4d5f68c

    
           # assuming we use [0, 1] scaled features 
        
           # this should avoid non-negative upsampling output 
        
           self.upsample_conv.append(nn.ReLU(inplace=True))

.

r9y9 · 2018-04-25T14:46:51Z

I've generated a set of ground-truth aligned mel spectrograms from Rahane's Tacotron-2.
I've trained for over 100k steps but still getting poor results for longer sequences...
Any ideas on how to improve this?

Are you sure you did correct time resolution adjustment as I did in https://github.com/r9y9/deepvoice3_pytorch/blob/3226e415ef1d8412bb159b228aa3c9212fdb892e/generate_aligned_predictions.py#L38-L42? Also did you use exactly same audio feature extraction pipeline with Tacotron2 and WaveNet? If both you did correctly, then I think you should just be more patient. As I mentioned in #45 (comment), it will generally take 1000k steps to get sufficient good quality with MoL output layer.

duvtedudug · 2018-04-25T16:34:17Z

Are you sure you did correct time resolution adjustment

The main difference I could see is the WaveNet specs have extra padding of a few frames at the end. I have padded the Taco generated specs with silence, to match the audio files of WaveNet preprocess. Otherwise audio is identical and mel timing is aligned. I have assert taco.shape==wavenet.shape before training.

There is difference in the LWS stft of WaveNet seems to be in range [0.4, 0.8] approx ?

I have approximated this [0.4, 0.8] with scaling.

it will generally take 1000k steps to get sufficient good quality with MoL

I'm using 256 mulaw quantised for quicker results (on real specs starts to sound good 50k-100k for me)

duvtedudug · 2018-04-26T05:40:39Z

train_eval.zip

train_eval audio showing some promise. 190k file not there yet but 180k file sounds not too bad.

butterl · 2018-04-27T09:57:59Z

@duvtedudug could you help to share your hpara settings in wavenet to train the synthesised GTA npy from Tacotron 2 ?

duvtedudug · 2018-04-28T17:51:04Z

@butterl Normal hparam settings just changed to 256, mulaw-quantised

PetrochukM · 2018-05-12T15:06:03Z

@r9y9 Why did you include RELu? The Wavenet paper does not discuss using RELu in the transposition network.

wavenet_vocoder/wavenet_vocoder/wavenet.py

Lines 161 to 163 in 740219b

    
           # assuming we use [0, 1] scaled features 
        
           # this should avoid non-negative upsampling output 
        
           self.upsample_conv.append(nn.ReLU(inplace=True))

I asked the Tacotron 2 authors. They did not normalize a Mel-Spectrogram for Wavenet.

r9y9 · 2018-05-13T08:22:17Z

Comment says the reason. I don't think it matters much.

stale · 2019-05-30T02:21:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label May 30, 2019

stale bot closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with Tacotron GTA mel specs #52

Training with Tacotron GTA mel specs #52

duvtedudug commented Apr 23, 2018

Rayhane-mamah commented Apr 23, 2018

duvtedudug commented Apr 23, 2018

Rayhane-mamah commented Apr 23, 2018

duvtedudug commented Apr 23, 2018 •

edited

Loading

Rayhane-mamah commented Apr 23, 2018

duvtedudug commented Apr 23, 2018

r9y9 commented Apr 24, 2018

duvtedudug commented Apr 24, 2018

duvtedudug commented Apr 24, 2018

r9y9 commented Apr 25, 2018

duvtedudug commented Apr 25, 2018

Rayhane-mamah commented Apr 25, 2018 via email

duvtedudug commented Apr 25, 2018

Rayhane-mamah commented Apr 25, 2018 via email

r9y9 commented Apr 25, 2018

r9y9 commented Apr 25, 2018

duvtedudug commented Apr 25, 2018

duvtedudug commented Apr 26, 2018

butterl commented Apr 27, 2018

duvtedudug commented Apr 28, 2018

PetrochukM commented May 12, 2018 •

edited

Loading

r9y9 commented May 13, 2018

stale bot commented May 30, 2019

Training with Tacotron GTA mel specs #52

Training with Tacotron GTA mel specs #52

Comments

duvtedudug commented Apr 23, 2018

Rayhane-mamah commented Apr 23, 2018

duvtedudug commented Apr 23, 2018

Rayhane-mamah commented Apr 23, 2018

duvtedudug commented Apr 23, 2018 • edited Loading

Rayhane-mamah commented Apr 23, 2018

duvtedudug commented Apr 23, 2018

r9y9 commented Apr 24, 2018

duvtedudug commented Apr 24, 2018

duvtedudug commented Apr 24, 2018

r9y9 commented Apr 25, 2018

duvtedudug commented Apr 25, 2018

Rayhane-mamah commented Apr 25, 2018 via email

duvtedudug commented Apr 25, 2018

Rayhane-mamah commented Apr 25, 2018 via email

r9y9 commented Apr 25, 2018

r9y9 commented Apr 25, 2018

duvtedudug commented Apr 25, 2018

duvtedudug commented Apr 26, 2018

butterl commented Apr 27, 2018

duvtedudug commented Apr 28, 2018

PetrochukM commented May 12, 2018 • edited Loading

r9y9 commented May 13, 2018

stale bot commented May 30, 2019

duvtedudug commented Apr 23, 2018 •

edited

Loading

PetrochukM commented May 12, 2018 •

edited

Loading