Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why add positional embedding instead of concatenate? #1591

Closed
Akella17 opened this issue May 30, 2019 · 2 comments

Comments

@Akella17
Copy link

@Akella17 Akella17 commented May 30, 2019

Apart from saving some memory, is there any reason we are adding the positional embeddings instead of concatenating them. It seems more intuitive concatenate useful input features, instead of adding them.

From another perspective, how can we be sure that the Transformer network can separate the densely informative word embeddings and the position information of pos encoding?

@martinpopel

This comment has been minimized.

Copy link
Contributor

@martinpopel martinpopel commented May 30, 2019

Interesting questions with no simple answers. Just a few comments:

  • I would guess someone has already tried summing PE (positional embeddings/encoding) and WE (word embeddings) instead of concatenating, but I am not aware of any publications with results and comparison. Technically it is simple: you need to add a projection layer to squeeze the dimension to the original size, which means extra parameters, but this should not be a problem for training (and memory-wise it should be fine as well). Alternatively, you could train WE and PE with smaller dimensions, so their concatenation has the original hidden size. In both ways, you can experiment with giving more dimensions to WE than to PE (and making sure PE uses all the dimensions effectively, see below).
  • By default, T2T uses max_timescale=10,000, i.e. the 10,001st word has the same PE as the first word. However, with the maximum sequence length of about 20 and hidden_size=512 (for transformer_base, and 1024 for big), most of the dimensions are used only by WE and the contribution of PE is almost constant (either 0 or 1). See a visualization of PE taken from https://jalammar.github.io/illustrated-transformer/#representing-the-order-of-the-sequence-using-positional-encoding

visualization of positional encoding

  • Note that in Transformer, WE are trained from scratch with the PE summing, so it is probable that WE are trained so they don't encode any important information in the first few dimensions because these dimensions are used intensively by PE. Thus, when using not-too-long sequences, the current T2T code is (or can be, see below) effectively very similar to concatenation and it would be easy to separate the PE and WE information.
  • Based on your question "how can we be sure that the Transformer network can separate [WE and PE]", it seems you consider the separability as an advantage. But what if the interaction between WE and PE (in some dimensions) caused by summing is beneficial for the final Transformer? Maybe some WE properties should be encoded differently for words which are on a specific position (modulo sin/cos period) in a sentence. I have seen positive effects (and sometimes also negative) of summing over concatenation in various tasks, e.g. summing left-to-right and right-to-left LSTM states or summing character-based and word-based embeddings. It is difficult for me to imagine examples of synergy caused by summing WE and PE, but that may be just a limitation of my imagination. Note that thanks to skip connections, PE+WE information is propagated to further layers of Transformer encoder, where syntactic phenomena are handled (I am thinking about MT), thus there are more opportunities for the synergy.
  • It would be very interesting to explore this and what are the properties of WE trained with different versions of Transformer (relative to e.g. word2vec WE). And comparing it also with the relative-position Transformer.
@Akella17

This comment has been minimized.

Copy link
Author

@Akella17 Akella17 commented Jun 1, 2019

Hey, thanks for the detailed reply. When WE are learnable parameters, I agree that the transformer training might model them such that the information of WE and PE is preserved (recoverable by transformer) even after addition. Like you suggested, maybe the transformer might also learn useful features from the addition of WE and PE.

However, my original doubt still persists. Why not just concatenate? Like you suggested, we can add a projection layer to bring the input dimension to transformer hidden size. The advantage of an additional layer is that is can model more complicated relationships b/w WE and PE (including simple addition obviously). However, this advantage comes at the cost of additional parameters, which in most cases is a trivial increment to memory consumption, given the size of a (practical) transformer.

@Akella17 Akella17 closed this Jun 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.