Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about gradient #15

Open
CS123n opened this issue Jul 17, 2022 · 3 comments
Open

A question about gradient #15

CS123n opened this issue Jul 17, 2022 · 3 comments

Comments

@CS123n
Copy link

CS123n commented Jul 17, 2022

target = {
ModelMeanType.PREVIOUS_X: self.q_posterior_mean_variance(
x_start=x_start, x_t=x_t, t=t
)[0],
ModelMeanType.START_X: x_start,
ModelMeanType.EPSILON: noise,
}[self.model_mean_type]
assert model_output.shape == target.shape == x_start.shape
terms["mse"] = mean_flat((target - model_output) ** 2)

Hi, I see there are three main losses. MSE, decoder_nll and kl_T. I think decoder_nll is designed for the decoder and MSE is designed for the diffusion model. However, I see x_start is not detached from these two losses, so these two losses also influence the embedding part. Is this a bug or a particular design?

@XiangLi1999
Copy link
Owner

Hi, this is not a bug. We need to backdrop gradient signal from these two losses to the embedding function in order to jointly train.

@summmeer
Copy link

summmeer commented Aug 5, 2022

@XiangLi1999 Hi, I'm curious about the loss funtion too. I can not understand why we need to compute decoder_nll using x_start in decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids). I think x_start is the word embedding added with extra noise, and this decode loss is trying to recover the noise, and this has no relation with diffusion model. Besides, this is not consistent with the formulation in loss function, in the \log p_theta (w|x_0) part. Why can't we replace x_start to the predicted model_out_x_start? Is this more reasonable? (But the experiment results is not good)

@AlonzoLeeeooo
Copy link

@XiangLi1999 Hi, I'm curious about the loss funtion too. I can not understand why we need to compute decoder_nll using x_start in decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids). I think x_start is the word embedding added with extra noise, and this decode loss is trying to recover the noise, and this has no relation with diffusion model. Besides, this is not consistent with the formulation in loss function, in the \log p_theta (w|x_0) part. Why can't we replace x_start to the predicted model_out_x_start? Is this more reasonable? (But the experiment results is not good)

Hi @summmeer ,
I also share the same feelings about this problem. It seems that decoder_nll has less correlation with the training of diffusion models. And I wonder how is it performing during your experiments? As I was training the model, I found that NLL loss equals to zero for a long training period (about 8k iterations). At about 10k training steps, the NLL loss occurs with increasing values. Have you ever encountered the similar situation? How is the NLL loss during your experiments?

Thanks for your reply in advance. It would help me a lot.

Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants