You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is that LSRA combines multi-head attention and conv in a multi-branch manner, but ConvBert integrates conv into transformer blocks?
if the answer is yes. what are the pros and cons of the above two methods? Do you have experiments?
LSRA is for machine translation and abstractive summarization. They are combining dynamic conv and multi-head attention in a two-branch manner.
ConvBERT is a pre-training based model that can be fine-tuned on downstream tasks like sentence classification. We also propose a novel span-based dynamic convolution operator and combine it with the self-attention to form the mixed-attention block.
Experiments comparing span-based dynamic conv and dynamic conv can be found in Section 4.3 Table 2 in our paper.
You can find that our span-based dynamic conv is better than dynamic conv in this pre-training based model setting. But it's hard to directly compare LSRA with ConvBERT.
LSRA: Lite Transformer with Long-Short Range Attention.
LSRA also integrates convolution operations into transformer blocks. I'm just wondering what makes ConvBert differ from LSRA.
The text was updated successfully, but these errors were encountered: