Planned TODOs #1

r9y9 · 2017-12-31T05:06:54Z

r9y9 · 2017-12-31T05:33:23Z

At the moment, I think I finished to implement basic features (batch/incremental inference, local/global conditioning) and confirmed that unconditioned WaveNet trained on CMU Arctic (~1200 utterances, 16kHz) can generate sounds like speech. Audio samples are attached.

step80000.zip

top: real speech, bottom: generated speech. The only first one sample of real-speech was fed to the WaveNet decoder as an initial input.

step90000.zip

geneing · 2017-12-31T23:24:00Z

For reference, these are other wavenet projects I know of:
https://github.com/ibab/tensorflow-wavenet
https://github.com/tomlepaine/fast-wavenet - faster version of the original wavenet paper.

r9y9 · 2018-01-01T02:37:10Z

Other projects I know of:

r9y9 · 2018-01-01T06:19:29Z

Still not quite high quality, but vocoder conditioned on mel-spectrogram started to work. Audio samples from a model trained 10 hours are attached.

step90000.zip

step95000.zip

r9y9 · 2018-01-02T14:55:55Z

Finished transposed convolution support at 8c0b5a9. Started training again.

jamestang0219 · 2018-01-03T04:11:30Z

Hi, I've already tried to use linguistic features as local features, but I found there might be a problem that linguistic features are based on phoneme class, mel-specs are based on frame class, but the local features of wavenet inputs are based on sample point class.

Here is a case, if a phoneme's duration is 0.25s, and its sample rate is 16k, in order to create the wavenet inputs, I have to duplicate the single phoneme's linguistic feature to int(0.25 * 16000) times as their samples' local features. Do you think my practice is right or not? How do you process the mel-spec features while they are frame class?

Thanks for answering me.

jamestang0219 · 2018-01-03T04:14:40Z

Wavenet can capture the differences even if many samples' local features are same as long as its receptive field is wide?

r9y9 · 2018-01-03T06:05:37Z

@jamestang0219 I think you are right. In the paper http://www.isca-speech.org/archive/Interspeech_2017/pdfs/0314.PDF, they use log-f0 and mel-cepstrum as conditional features and duplicate them to adjust time resolution. I also tried this idea and got reasonable result.

r9y9 · 2018-01-03T06:17:47Z

Latest audio sample attached. Mel-spectrogram are repeated to adjust time resolution. See

wavenet_vocoder/audio.py

Lines 39 to 40 in b8ee2ce

    
           upsample_factor = quantized.size // mel.shape[0] 
        
           mel = np.repeat(mel, upsample_factor, axis=0)

. In this case upsample_factor was always 256.

step70000.zip

jamestang0219 · 2018-01-03T06:32:27Z

@r9y9 In your source code, you use transposed convolution to implement the upsample process? Have you ever checked which method is better for upsampling?

r9y9 · 2018-01-03T06:41:27Z

@jamestang0219 I implemented transposed convolution but haven't got success yet. I wonder 256x upsampling is hard to train, especially for small dataset which I'm experimenting with now. WaveNet authors reported transposed convolution is better, though.

r9y9 · 2018-01-03T06:44:34Z

wavenet_vocoder/hparams.py

Lines 43 to 47 in 3c9deb1

    
           # If True, use transposed convolutions to upsample conditional features, 
        
           # otherwise repeat features to adjast time resolution 
        
           upsample_conditional_features=False, 
        
           # should np.prod(upsample_scales) == hop_size 
        
           upsample_scales=[16, 16],

For now I am not using transposed convolution.

jamestang0219 · 2018-01-03T06:56:33Z

@r9y9 May I know your hyper parameters for extracting mel spectrogram? Frame shift is 0.0125s and frame width is 0.05s? If this is your parameters, but why you use 256 as the upsample factor instead of sr(16000) * frame_shift(0.0125) = 200? Any tricks here? Forgive me for many questions:( because I also wanna reproduce tacotron2 result

r9y9 · 2018-01-03T07:00:31Z

@jamestang0219 Hyper parameters for audio parameter extraction:

wavenet_vocoder/hparams.py

Lines 19 to 28 in 3c9deb1

    
           # Audio: 
        
           sample_rate=16000, 
        
           silence_threshold=2, 
        
           num_mels=80, 
        
           fft_size=1024, 
        
           # shift can be specified by either hop_size or frame_shift_ms 
        
           hop_size=256, 
        
           frame_shift_ms=None, 
        
           min_level_db=-100, 
        
           ref_level_db=20,

I use frame shift 256 samples / 16 ms.

jamestang0219 · 2018-01-03T07:01:30Z

@r9y9 Thanks:)

npuichigo · 2018-01-03T11:15:44Z

@r9y9 I notice that in Tacotron2, two upsampling layers with transposed convolution are used. But in my WaveNet implementation, it still can't work.

r9y9 · 2018-01-03T11:35:59Z

@npuichigo Could you share what parameters (padding, kernel_size, etc) you are using? I tried 1d transposed covolution with stride=16, kernel_size=16, padding=0 two times to upsample inputs to 256x.

wavenet_vocoder/wavenet_vocoder/wavenet.py

Lines 105 to 112 in 8c0b5a9

    
           if upsample_conditional_features: 
        
               self.upsample_conv = nn.ModuleList() 
        
               for s in upsample_scales: 
        
                   self.upsample_conv.append(ConvTranspose1d( 
        
                       cin_channels, cin_channels, kernel_size=s, padding=0, 
        
                       dilation=1, stride=s, std_mul=1.0)) 
        
                   # Is this non-lineality necessary? 
        
                   self.upsample_conv.append(nn.ReLU(inplace=True))

npuichigo · 2018-01-03T12:02:26Z

@r9y9 Parameters of mine are listed below. Because I use frame shift which is 12.5ms, upsampling factor is 200.

# Audio
num_mels=80,
num_freq=1025,
sample_rate=16000,
frame_length_ms=50,
frame_shift_ms=12.5,
min_level_db=-100,
ref_level_db=20

# Tranposed convolution 10*20=200 (tensorflow)
up_lc_batch = tf.expand_dims(lc_batch, 1)
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, self.out_channels, (1, 10),
       strides=(1, 10), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / self.out_channels))
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, self.out_channels, (1, 20),
       strides=(1, 20), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / self.out_channels))
up_lc_batch = tf.squeeze(up_lc_batch, 1)

r9y9 · 2018-01-04T16:46:43Z

https://r9y9.github.io/wavenet_vocoder/

Created a simple project page and uploaded audio samples for speaker-dependent WaveNet vocoder. I'm working on global conditioning (speaker embedding) now.

npuichigo · 2018-01-05T05:29:32Z

@r9y9 Regarding upsampling network, I found that 2D transposed convolution works well, while 1D version will generate speech with unnatural prosody, maybe because 2D transpose convolution only consider local information in frequency domain.

height_width = 3  # kernel width along frequency axis
up_lc_batch = tf.expand_dims(lc_batch, 3)
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, 1, (10, height_width),
       strides=(10, 1), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / height_width))
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, 1, (20, height_width),
       strides=(20, 1), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / height_width))
up_lc_batch = tf.squeeze(up_lc_batch, 3)

r9y9 · 2018-01-05T11:02:42Z

@npuichigo Thank you for sharing that! Did you check the output of the upsampling network? Could upsampling network actually learn upsampling? I mean, did you get high-resolution mel-spectrogram? I was wondering if I need to add loss term regarding upsampling (e.g., MSE between coarse mel-spectrogram and 1-shift high resolution mel-spectrogram) and I'm curious whether it could be learned without upsampling specific loss.

npuichigo · 2018-01-06T13:04:05Z

@r9y9 I think transposed convolution with same stride and kernel size is similar to duplicating. Like the following picture, if the kernel is one everywhere, then it's just duplicating. So maybe I need to check the values of kernel after training.

r9y9 · 2018-01-06T17:58:13Z

https://r9y9.github.io/wavenet_vocoder/

Added audio samples for multi-speaker version of WaveNet vocoder.

rishabh135 · 2018-01-08T06:43:37Z

Hello @r9y9 , great work and awesome samples, would you mind sharing the weights of the network for the wavenet_vocoder trained on mel_spectrograms with CMU artic dataset without speaker embedding ? I would like to use and compare them with griffin-lim reconstruction to see which works better.

r9y9 · 2018-01-08T09:43:14Z

@rishabh135 Not at all. Here it is: https://www.dropbox.com/sh/b1p32sxywo6xdnb/AAB2TU2DGhPDJgUzNc38Cz75a?dl=0

Note that you have to use exactly same mel-spectrogram extraction

wavenet_vocoder/audio.py

Lines 66 to 69 in f05e520

    
           def melspectrogram(y): 
        
               D = _lws_processor().stft(y).T 
        
               S = _amp_to_db(_linear_to_mel(np.abs(D))) 
        
               return _normalize(S)

and same hyper parameters

wavenet_vocoder/hparams.py

Lines 20 to 28 in f05e520

    
           sample_rate=16000, 
        
           silence_threshold=2, 
        
           num_mels=80, 
        
           fft_size=1024, 
        
           # shift can be specified by either hop_size or frame_shift_ms 
        
           hop_size=256, 
        
           frame_shift_ms=None, 
        
           min_level_db=-100, 
        
           ref_level_db=20,

r9y9 · 2018-01-08T17:59:36Z

Using the transposed convolution below, I can get good initialization for the upsampling network. Very nice, thanks @npuichigo !

kernel_size = 3
padding = (kernel_size - 1) // 2
upsample_factor = 16

conv = nn.ConvTranspose2d(1,1,kernel_size=(kernel_size,upsample_factor),
                          stride=(1,upsample_factor), padding=(padding,0))
conv.bias.data.zero_()
conv.weight.data.fill_(1/kernel_size);

Mel-spectrogram (hop_size = 256)

16x upsampled mel-spectrogram

r9y9 · 2018-02-12T03:18:28Z

https://r9y9.github.io/wavenet_vocoder/

Update samples of multi-speaker WN. Used mixture of logistic distributions. It was quite costly to train.. Also added ground truth audio samples for ease of comparison.

rafaelvalle · 2018-02-15T02:11:36Z

@r9y9 what do you mean by costly do train? what are the biggest challenges?

r9y9 · 2018-02-15T11:56:18Z

I meant it's much time consuming. It took a week or more to get sufficient good quality for LJSpeech and CMU ARCTIC.

rafaelvalle · 2018-02-15T16:39:51Z

Can you share the loss curve?

r9y9 · 2018-02-15T16:47:19Z

I’m in a short business trip and do not have access to my GPU PC right now. I can share when I come back home after a week.

rafaelvalle · 2018-02-15T18:37:18Z

That's great, Ryuchi! Thank you@

bliep · 2018-02-19T21:36:12Z

In the original Salimans pixel-cnn++ code the loss is converted to bits per output dimension which is actually quite handy for comparison with other implementations and experiments. For this just divide the loss by the dimensionality of the output * ln(2). How many bits is the model able to predict?

rafaelvalle · 2018-02-19T21:47:34Z

This is unclear to me, probably because I haven't read the paper. You're saying that n_bits = loss / C, that is the higher the loss the more bits the model can output? On Feb 19, 2018 1:36 PM, "bliep" <notifications@github.com> wrote: In the original Salimans pixel-cnn++ code the loss is converted to bits per output dimension which is actually quite handy for comparison with other implementations and experiments. For this just divide the loss by the dimensionality of the output * ln(2). How many bits is the model able to predict? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACMij7ep9ln_wukDD2lFUKJr_ihgIZOXks5tWelNgaJpZM4RPxW4> .

bliep · 2018-02-19T22:09:25Z

The loss is the negative log probability, and averaged over the output dimension it is an estimate of the entropy in a sample. In the original paper (predicting pixels in an image) the residual entropy was around 3 bits (out of 8, so predicting 5 bits). Since it is not easy for me to figure out the output dimension of this wavenet implementation, a loss of 56-57 doesn't tell me much.
(see https://github.com/openai/pixel-cnn/blob/master/train.py#L148)

rafaelvalle · 2018-02-20T01:09:11Z

I see now, it just the loss but normalized to bits, thus facilitating comparison as you mentioned!

From what I understand the model has 10 mixture of logistics with 3 params each (pi, mean, log-scale), producing a total of 30 channels.

This is what I understand from what @r9y9 has on the hparams.py file https://github.com/r9y9/wavenet_vocoder/blob/master/hparams.py

et1234et · 2018-02-20T10:58:31Z

@r9y9
I tested LJSpeech using latest code.(~1000K) but it slightly noisy..
Is same Latest code to updated sample setting? (https://r9y9.github.io/wavenet_vocoder/)
checkpoint_step001000000.zip

r9y9 · 2018-02-21T04:46:03Z

Yes, current master is the latest one and this is what I locally have. Maybe training procedure I described in #1 (comment) is important for quality.

dyelax · 2018-02-21T17:41:35Z

@r9y9 would you mind re-sharing your weights for the mel-conditioned wavenet? The link you shared earlier is broken. Thanks!

r9y9 · 2018-02-22T07:25:04Z

@dyelax Can you check the links in #19 instead?

azraelkuan · 2018-02-26T07:56:01Z

@r9y9 for multi gpu training, i test that we only need to fix

wavenet_vocoder/train.py

Line 639 in 26e4305

y_hat = model(x, c=c, g=g, softmax=False)

to y_hat = torch.nn.parallel.data_parallel(model, (x, c, g, False)) and increase the num_workers, batch_size
Also, we can set the device_ids and output_device for different cmd args

bliep · 2018-02-27T21:05:30Z

Efficient Neural Audio Synthesis https://arxiv.org/abs/1802.08435
Lots of interesting tricks and the claim is real-time on a mobile cpu due to weight pruning.

r9y9 · 2018-04-06T15:30:24Z

https://github.com/r9y9/wavenet_vocoder#pre-trained-models Added link to pre-trained models.

twidddj · 2018-04-11T06:53:24Z

Hi @r9y9, Thank you so much for sharing your work.

We have followed yours and got some results in Tensorflow. While we have not many tested yet, It works in the same parameters as yours except without Dropout, WeightNorm techniques. You can find some results in here. If I get another information during testing, I'll let you know about it. Thanks!

r9y9 · 2018-04-11T13:19:05Z

@twidddj Nice! I'm looking forward to your results.

r9y9 · 2018-05-12T05:02:13Z

I think I can close this now. Discussion on remained issues (e.g, DeepVoice + WaveNet) can continue on specific issue.

This was referenced Dec 31, 2017

Tacotron 2 r9y9/deepvoice3_pytorch#11

Closed

Any plan for WORLD vocoder? r9y9/deepvoice3_pytorch#7

Closed

r9y9 mentioned this issue Jan 3, 2018

Support for WaveNet r9y9/nnmnkwii#44

Closed

r9y9 mentioned this issue Jan 6, 2018

WIP: Support for Wavenet vocoder r9y9/deepvoice3_pytorch#21

Open

5 tasks

r9y9 mentioned this issue Jan 31, 2018

Experimental: Mixture of logistic distributions #5

Merged

r9y9 added the discussion label Feb 3, 2018

r9y9 mentioned this issue Feb 6, 2018

Ask about Phoneme Segmentation and Phoneme Duration #12

Closed

r9y9 mentioned this issue Mar 12, 2018

It's not necessary to sum up 'log_probs' at the last dimension in discretized_mix_logistic_loss function #33

Closed

This was referenced Apr 15, 2018

Multi-GPU Support #50

Closed

Cannot reproduce as good audios as the demo shows #45

Closed

r9y9 closed this as completed May 12, 2018

r9y9 mentioned this issue Jan 28, 2019

Possible to produce good quality unconditional speech? #138

Closed

Interfish mentioned this issue May 8, 2019

About the exact iteration number of LJSpeech pretrained model #153

Closed

JasonWei512 mentioned this issue Jan 15, 2020

WaveNet 进展 JasonWei512/Tacotron-2-Chinese#4

Open

Planned TODOs #1

Planned TODOs #1

Comments

r9y9 commented Dec 31, 2017 • edited

Goal

Model

Training script

Experiments

Misc

Sampling frequency

Advanced (lower priority)

r9y9 commented Dec 31, 2017

geneing commented Dec 31, 2017

r9y9 commented Jan 1, 2018 • edited

r9y9 commented Jan 1, 2018

r9y9 commented Jan 2, 2018

jamestang0219 commented Jan 3, 2018

jamestang0219 commented Jan 3, 2018

r9y9 commented Jan 3, 2018

r9y9 commented Jan 3, 2018

jamestang0219 commented Jan 3, 2018

r9y9 commented Jan 3, 2018

r9y9 commented Jan 3, 2018

jamestang0219 commented Jan 3, 2018

r9y9 commented Jan 3, 2018

jamestang0219 commented Jan 3, 2018

npuichigo commented Jan 3, 2018

r9y9 commented Jan 3, 2018

npuichigo commented Jan 3, 2018

r9y9 commented Jan 4, 2018

npuichigo commented Jan 5, 2018

r9y9 commented Jan 5, 2018

npuichigo commented Jan 6, 2018

r9y9 commented Jan 6, 2018

rishabh135 commented Jan 8, 2018

r9y9 commented Jan 8, 2018

r9y9 commented Jan 8, 2018

r9y9 commented Feb 12, 2018

rafaelvalle commented Feb 15, 2018

r9y9 commented Feb 15, 2018

rafaelvalle commented Feb 15, 2018

r9y9 commented Feb 15, 2018

rafaelvalle commented Feb 15, 2018

bliep commented Feb 19, 2018

rafaelvalle commented Feb 19, 2018 via email • edited

bliep commented Feb 19, 2018 • edited

rafaelvalle commented Feb 20, 2018 • edited

et1234et commented Feb 20, 2018

r9y9 commented Feb 21, 2018

dyelax commented Feb 21, 2018

r9y9 commented Feb 22, 2018

azraelkuan commented Feb 26, 2018 • edited

bliep commented Feb 27, 2018

r9y9 commented Apr 6, 2018

twidddj commented Apr 11, 2018

r9y9 commented Apr 11, 2018

r9y9 commented May 12, 2018

r9y9 commented Dec 31, 2017 •

edited

r9y9 commented Jan 1, 2018 •

edited

rafaelvalle commented Feb 19, 2018 via email •

edited

bliep commented Feb 19, 2018 •

edited

rafaelvalle commented Feb 20, 2018 •

edited

azraelkuan commented Feb 26, 2018 •

edited