A question about gradient #15

CS123n · 2022-07-17T13:16:35Z

Diffusion-LM/improved-diffusion/improved_diffusion/gaussian_diffusion.py

Lines 1549 to 1557 in 352b28c

    
           target = { 
        
               ModelMeanType.PREVIOUS_X: self.q_posterior_mean_variance( 
        
                   x_start=x_start, x_t=x_t, t=t 
        
               )[0], 
        
               ModelMeanType.START_X: x_start, 
        
               ModelMeanType.EPSILON: noise, 
        
           }[self.model_mean_type] 
        
           assert model_output.shape == target.shape == x_start.shape 
        
           terms["mse"] = mean_flat((target - model_output) ** 2)

Hi, I see there are three main losses. MSE, decoder_nll and kl_T. I think decoder_nll is designed for the decoder and MSE is designed for the diffusion model. However, I see x_start is not detached from these two losses, so these two losses also influence the embedding part. Is this a bug or a particular design?

XiangLi1999 · 2022-07-18T00:33:51Z

Hi, this is not a bug. We need to backdrop gradient signal from these two losses to the embedding function in order to jointly train.

summmeer · 2022-08-05T08:51:01Z

@XiangLi1999 Hi, I'm curious about the loss funtion too. I can not understand why we need to compute decoder_nll using x_start in decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids). I think x_start is the word embedding added with extra noise, and this decode loss is trying to recover the noise, and this has no relation with diffusion model. Besides, this is not consistent with the formulation in loss function, in the \log p_theta (w|x_0) part. Why can't we replace x_start to the predicted model_out_x_start? Is this more reasonable? (But the experiment results is not good)

AlonzoLeeeooo · 2023-06-20T15:02:24Z

@XiangLi1999 Hi, I'm curious about the loss funtion too. I can not understand why we need to compute decoder_nll using x_start in decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids). I think x_start is the word embedding added with extra noise, and this decode loss is trying to recover the noise, and this has no relation with diffusion model. Besides, this is not consistent with the formulation in loss function, in the \log p_theta (w|x_0) part. Why can't we replace x_start to the predicted model_out_x_start? Is this more reasonable? (But the experiment results is not good)

Hi @summmeer ,
I also share the same feelings about this problem. It seems that decoder_nll has less correlation with the training of diffusion models. And I wonder how is it performing during your experiments? As I was training the model, I found that NLL loss equals to zero for a long training period (about 8k iterations). At about 10k training steps, the NLL loss occurs with increasing values. Have you ever encountered the similar situation? How is the NLL loss during your experiments?

Thanks for your reply in advance. It would help me a lot.

Best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about gradient #15

A question about gradient #15

CS123n commented Jul 17, 2022 •

edited

Loading

XiangLi1999 commented Jul 18, 2022

summmeer commented Aug 5, 2022

AlonzoLeeeooo commented Jun 20, 2023

A question about gradient #15

A question about gradient #15

Comments

CS123n commented Jul 17, 2022 • edited Loading

XiangLi1999 commented Jul 18, 2022

summmeer commented Aug 5, 2022

AlonzoLeeeooo commented Jun 20, 2023

CS123n commented Jul 17, 2022 •

edited

Loading