About generation and input&output types #93

evinpinar · 2018-07-05T11:01:02Z

Hey, thank you for your code, it is very helpful to understand how the model works. I am implementing the wavenet on my own from scratch and have some questions:

I give scalar inputs to the model, and quantized targets to find out the cross entropy loss. Then in generation, i decode the output by mulaw into scalar value in range [-1,1], expand the generation audio and refeed into model for next sample. It works when I try to overfit a small data and generate. As well, when I train with whole dataset, the loss decreases down. However, when generating after ~700K steps without conditioning, it only produces a constant value. What could be the problem?
In your code, I see this part in incremental generation:
x = F.softmax(x.view(B, -1), dim=1) if softmax else x.view(B, -1) if quantize: sample = np.random.choice( np.arange(self.out_channels), p=x.view(-1).data.cpu().numpy())
I am not sure why you randomly choose a value here.

The text was updated successfully, but these errors were encountered:

r9y9 · 2018-07-05T15:27:20Z

For the second question, it's just a sampling from the categorical distribution conditioned on previously generated samples; x_{t} ~ p(x_{t} | x_1, x_2, ..., x_{t-1}, c_1, c_2, ..., c_{T}), where x_{t} and c_{t} is a sample and conditional feature at time t, respectively. https://towardsdatascience.com/the-softmax-function-neural-net-outputs-as-probabilities-and-ensemble-classifiers-9bd94d75932 might help you understand what the softmax output does.

As for the first question, you might want to see if your incremental generation is correct, but it could happen if your dataset has a lot of silence regions and your model might fitted to the silence regions, resulting in generating only silences.

evinpinar · 2018-07-05T16:57:32Z

For second question: Yeah i did not get why you choose a random sample instead of the one with maximum value, which has the highest probability.

First question: Even though I do standard trimming on LJspeech dataset, I might need to check again. Otherwise, probably i have an implementation error..

r9y9 · 2018-07-07T04:47:34Z

We want to sample from a generative model. Choosing a sample with the highest probability makes more sense in classification tasks.

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made.

evinpinar · 2018-07-16T14:38:33Z

Oh i see, you are randomly sampling from the distribution, not directly getting the result with the maximum value. Thanks!

evinpinar closed this as completed Jul 16, 2018

wenyong-h mentioned this issue Sep 11, 2018

why not use greedy decoding? #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About generation and input&output types #93

About generation and input&output types #93

evinpinar commented Jul 5, 2018

r9y9 commented Jul 5, 2018 •

edited

evinpinar commented Jul 5, 2018

r9y9 commented Jul 7, 2018

evinpinar commented Jul 16, 2018

About generation and input&output types #93

About generation and input&output types #93

Comments

evinpinar commented Jul 5, 2018

r9y9 commented Jul 5, 2018 • edited

evinpinar commented Jul 5, 2018

r9y9 commented Jul 7, 2018

evinpinar commented Jul 16, 2018

r9y9 commented Jul 5, 2018 •

edited