Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeBERTa P-Tuning v2 speed #36

Closed
kefirski opened this issue Aug 3, 2022 · 6 comments
Closed

DeBERTa P-Tuning v2 speed #36

kefirski opened this issue Aug 3, 2022 · 6 comments

Comments

@kefirski
Copy link

kefirski commented Aug 3, 2022

I've observed that training DeBERTaV2 with P-Tuning v2 takes significantly more time to evaluate than other methods. Have you observed such behaviour?

It even takes significantly more time than P-Tuning v1 despite the fact that v1 have larger complexity to evaluate attention.

It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

@kefirski
Copy link
Author

kefirski commented Aug 3, 2022

The same holds for DeBERTa V1.
I also have noticed that for P-Tuning v2 GPU utilization is dramatically lower compared to P-Tuning v1

@dyh1998
Copy link

dyh1998 commented Aug 11, 2022

I have a question that which is faster between p-tuning-v2 and fine-tune. To my understanding, p-tuning-v2 should be slowly than p-tuning, that becaues it updates more parameter than p-tuning.

@kefirski
Copy link
Author

While P-Tuning v1 has fewer parameters to update, the evaluation of the attention mechanism for this scheme requires O((n + d)^2) operations to compute the result since sequence length is naively extended by prompt with length d.

For P-Tuning v2, the complexity is equal to O(n(n + d)) since only tokens from the original sequence are attended to prefixes added to computations via past_key_values. Furthermore, you don't have to evaluate the remaining Transformer layers (e.g., Positional-Wise mappings) on prefixes like P-Tuning v1.

@kefirski
Copy link
Author

I observed that the evaluation of P-Tuning v2 for a prefix with the length 100 is about twice as fast as P-Tuning v1 for RoBERTa. Although, for DeBERTa, P-Tuning v2 for about 40 times slower which does not seem to be a legit result

@dyh1998
Copy link

dyh1998 commented Aug 11, 2022

Alright, thanks for your patient explanations, that make me relaize i have a shortage of model's architecture details. And i don't know much about DeBERTa. Maybe somebody can help you.

Best,

@Xiao9905
Copy link
Member

@kefirski Hi,

Thanks for your interest in our work! Sorry for the late reply.

It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

Yes, I think this is the reason. At the time when we are experimenting P-Tuning v2, DeBERTa had not yet being officially implemented in huggingface transformers, and thus we implement the past_key_value functions ourselves. We are sorry that our own implementation can be slower than the huggingface official implemented one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants