Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About BP problem mentioned in the introduction #47

Open
Cooperx521 opened this issue Apr 25, 2024 · 2 comments
Open

About BP problem mentioned in the introduction #47

Cooperx521 opened this issue Apr 25, 2024 · 2 comments

Comments

@Cooperx521
Copy link

Hello~ I recently read your brilliant paper, but confused anout BP problem mentioned in the introduction:
Moreover, this would also hinder the back-propagation for the prediction module, which needs to calculate the probability distribution of whether to keep the token even if it is finally eliminated.
My understanding is that the deleted tokens do not participate in subsequent attention calculations, meaning there is no information exchange. They are also irrelevant to the calculation of loss. Therefore, it seems that directly deleting these tokens during training does not affect the correct backpropagation of gradients. I am a bit confused about this statement in the article and would appreciate it if you could clarify any misunderstandings.

@raoyongming
Copy link
Owner

Hi, thanks for your interest in our work. I think the core problem here is to optimize the prediction module. Directly deleting these tokens is correct if we only want to finetune the ViT and improve its performance on incomplete tokens. Here we use a strategy similar to policy gradient in RL by keeping the gradient of probabilities of dropped tokens to guide the prediction module to better explore possible sparaification polices.

@Cooperx521
Copy link
Author

Thanks a lot for your prompt and insightful response!
I have a bit of confusion and would appreciate your help in identifying where my understanding might be incorrect:
The reason the prediction module can be updated through gradients is that the output of the prediction module, hard_keep_decision (num_image_tokens, 1), establishes a gradient connection with the parameters of the prediction module via the Gumbel softmax. There are two pathways for the gradient to be transmitted back to hard_keep_decision from the loss: 1. Ratio loss, where during calculation, both 0 and 1 in hard_keep_decision can successfully transmit the gradient back. 2. Other losses, with the forward propagation path being: hard_keep_decision -> attention map -> subsequent layers -> loss. In the step from hard_keep_decision to attention map, even though the attention score of the dropped tokens is zero, the gradient there can still be backpropagated. Therefore, if tokens are removed, it would interrupt the gradient pathway of the dropped tokens in the second path, but theoretically, the prediction module would still update through the gradient transmission path of the kept tokens. However, removing tokens and training with an attention mask (keep tokens) would yield different results. I would like to inquire whether there are any theoretical advantages or disadvantages between these two methods, and whether retaining the results produced by the token would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants