You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
Apart from saving some memory, is there any reason we are adding the positional embeddings instead of concatenating them. It seems more intuitive concatenate useful input features, instead of adding them.
From another perspective, how can we be sure that the Transformer network can separate the densely informative word embeddings and the position information of pos encoding?
Apart from saving some memory, is there any reason we are adding the positional embeddings instead of concatenating them. It seems more intuitive concatenate useful input features, instead of adding them.
From another perspective, how can we be sure that the Transformer network can separate the densely informative word embeddings and the position information of pos encoding?