Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About generation and input&output types #93

Closed
evinpinar opened this issue Jul 5, 2018 · 4 comments
Closed

About generation and input&output types #93

evinpinar opened this issue Jul 5, 2018 · 4 comments

Comments

@evinpinar
Copy link

Hey, thank you for your code, it is very helpful to understand how the model works. I am implementing the wavenet on my own from scratch and have some questions:

  • I give scalar inputs to the model, and quantized targets to find out the cross entropy loss. Then in generation, i decode the output by mulaw into scalar value in range [-1,1], expand the generation audio and refeed into model for next sample. It works when I try to overfit a small data and generate. As well, when I train with whole dataset, the loss decreases down. However, when generating after ~700K steps without conditioning, it only produces a constant value. What could be the problem?

  • In your code, I see this part in incremental generation:
    x = F.softmax(x.view(B, -1), dim=1) if softmax else x.view(B, -1) if quantize: sample = np.random.choice( np.arange(self.out_channels), p=x.view(-1).data.cpu().numpy())
    I am not sure why you randomly choose a value here.

@r9y9
Copy link
Owner

r9y9 commented Jul 5, 2018

For the second question, it's just a sampling from the categorical distribution conditioned on previously generated samples; x_{t} ~ p(x_{t} | x_1, x_2, ..., x_{t-1}, c_1, c_2, ..., c_{T}), where x_{t} and c_{t} is a sample and conditional feature at time t, respectively. https://towardsdatascience.com/the-softmax-function-neural-net-outputs-as-probabilities-and-ensemble-classifiers-9bd94d75932 might help you understand what the softmax output does.

As for the first question, you might want to see if your incremental generation is correct, but it could happen if your dataset has a lot of silence regions and your model might fitted to the silence regions, resulting in generating only silences.

@evinpinar
Copy link
Author

For second question: Yeah i did not get why you choose a random sample instead of the one with maximum value, which has the highest probability.

First question: Even though I do standard trimming on LJspeech dataset, I might need to check again. Otherwise, probably i have an implementation error..

@r9y9
Copy link
Owner

r9y9 commented Jul 7, 2018

We want to sample from a generative model. Choosing a sample with the highest probability makes more sense in classification tasks.

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made.

@evinpinar
Copy link
Author

Oh i see, you are randomly sampling from the distribution, not directly getting the result with the maximum value. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants