DeBERTa P-Tuning v2 speed #36

kefirski · 2022-08-03T07:15:38Z

I've observed that training DeBERTaV2 with P-Tuning v2 takes significantly more time to evaluate than other methods. Have you observed such behaviour?

It even takes significantly more time than P-Tuning v1 despite the fact that v1 have larger complexity to evaluate attention.

It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

The text was updated successfully, but these errors were encountered:

kefirski · 2022-08-03T09:41:25Z

The same holds for DeBERTa V1.
I also have noticed that for P-Tuning v2 GPU utilization is dramatically lower compared to P-Tuning v1

dyh1998 · 2022-08-11T10:09:58Z

I have a question that which is faster between p-tuning-v2 and fine-tune. To my understanding, p-tuning-v2 should be slowly than p-tuning, that becaues it updates more parameter than p-tuning.

kefirski · 2022-08-11T10:14:21Z

While P-Tuning v1 has fewer parameters to update, the evaluation of the attention mechanism for this scheme requires O((n + d)^2) operations to compute the result since sequence length is naively extended by prompt with length d.

For P-Tuning v2, the complexity is equal to O(n(n + d)) since only tokens from the original sequence are attended to prefixes added to computations via past_key_values. Furthermore, you don't have to evaluate the remaining Transformer layers (e.g., Positional-Wise mappings) on prefixes like P-Tuning v1.

kefirski · 2022-08-11T10:23:35Z

I observed that the evaluation of P-Tuning v2 for a prefix with the length 100 is about twice as fast as P-Tuning v1 for RoBERTa. Although, for DeBERTa, P-Tuning v2 for about 40 times slower which does not seem to be a legit result

dyh1998 · 2022-08-11T10:47:54Z

Alright, thanks for your patient explanations, that make me relaize i have a shortage of model's architecture details. And i don't know much about DeBERTa. Maybe somebody can help you.

Best,

Xiao9905 · 2022-08-15T04:04:50Z

@kefirski Hi,

Thanks for your interest in our work! Sorry for the late reply.

It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

Yes, I think this is the reason. At the time when we are experimenting P-Tuning v2, DeBERTa had not yet being officially implemented in huggingface transformers, and thus we implement the past_key_value functions ourselves. We are sorry that our own implementation can be slower than the huggingface official implemented one.

Xiao9905 closed this as completed Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeBERTa P-Tuning v2 speed #36

DeBERTa P-Tuning v2 speed #36

kefirski commented Aug 3, 2022 •

edited

kefirski commented Aug 3, 2022

dyh1998 commented Aug 11, 2022

kefirski commented Aug 11, 2022

kefirski commented Aug 11, 2022

dyh1998 commented Aug 11, 2022

Xiao9905 commented Aug 15, 2022

DeBERTa P-Tuning v2 speed #36

DeBERTa P-Tuning v2 speed #36

Comments

kefirski commented Aug 3, 2022 • edited

kefirski commented Aug 3, 2022

dyh1998 commented Aug 11, 2022

kefirski commented Aug 11, 2022

kefirski commented Aug 11, 2022

dyh1998 commented Aug 11, 2022

Xiao9905 commented Aug 15, 2022

kefirski commented Aug 3, 2022 •

edited