Why add positional embedding instead of concatenate? #1591

Akella17 · 2019-05-30T16:03:02Z

Apart from saving some memory, is there any reason we are adding the positional embeddings instead of concatenating them. It seems more intuitive concatenate useful input features, instead of adding them.

From another perspective, how can we be sure that the Transformer network can separate the densely informative word embeddings and the position information of pos encoding?

martinpopel · 2019-05-30T18:38:36Z

Interesting questions with no simple answers. Just a few comments:

I would guess someone has already tried concatenating PE (positional embeddings/encoding) and WE (word embeddings) instead of summing, but I am not aware of any publications with results and comparison. Technically it is simple: you need to add a projection layer to squeeze the dimension to the original size, which means extra parameters, but this should not be a problem for training (and memory-wise it should be fine as well). Alternatively, you could train WE and PE with smaller dimensions, so their concatenation has the original hidden size. In both ways, you can experiment with giving more dimensions to WE than to PE (and making sure PE uses all the dimensions effectively, see below).
By default, T2T uses max_timescale=10,000, i.e. the 10,001st word has the same PE as the first word. However, with the maximum sequence length of about 20 and hidden_size=512 (for transformer_base, and 1024 for big), most of the dimensions are used only by WE and the contribution of PE is almost constant (either 0 or 1). See a visualization of PE taken from https://jalammar.github.io/illustrated-transformer/#representing-the-order-of-the-sequence-using-positional-encoding

Note that in Transformer, WE are trained from scratch with the PE summing, so it is probable that WE are trained so they don't encode any important information in the first few dimensions because these dimensions are used intensively by PE. Thus, when using not-too-long sequences, the current T2T code is (or can be, see below) effectively very similar to concatenation and it would be easy to separate the PE and WE information.
Based on your question "how can we be sure that the Transformer network can separate [WE and PE]", it seems you consider the separability as an advantage. But what if the interaction between WE and PE (in some dimensions) caused by summing is beneficial for the final Transformer? Maybe some WE properties should be encoded differently for words which are on a specific position (modulo sin/cos period) in a sentence. I have seen positive effects (and sometimes also negative) of summing over concatenation in various tasks, e.g. summing left-to-right and right-to-left LSTM states or summing character-based and word-based embeddings. It is difficult for me to imagine examples of synergy caused by summing WE and PE, but that may be just a limitation of my imagination. Note that thanks to skip connections, PE+WE information is propagated to further layers of Transformer encoder, where syntactic phenomena are handled (I am thinking about MT), thus there are more opportunities for the synergy.
It would be very interesting to explore this and what are the properties of WE trained with different versions of Transformer (relative to e.g. word2vec WE). And comparing it also with the relative-position Transformer.

Akella17 · 2019-06-01T15:12:48Z

Hey, thanks for the detailed reply. When WE are learnable parameters, I agree that the transformer training might model them such that the information of WE and PE is preserved (recoverable by transformer) even after addition. Like you suggested, maybe the transformer might also learn useful features from the addition of WE and PE.

However, my original doubt still persists. Why not just concatenate? Like you suggested, we can add a projection layer to bring the input dimension to transformer hidden size. The advantage of an additional layer is that is can model more complicated relationships b/w WE and PE (including simple addition obviously). However, this advantage comes at the cost of additional parameters, which in most cases is a trivial increment to memory consumption, given the size of a (practical) transformer.

Khaled-Abdelhamid · 2020-05-28T01:08:45Z

If you find a good answer for not contaminating can you please refer to it in here.

sooheon · 2020-10-16T04:39:06Z

While we're discussing the relative merits of "concat then project to D" or "project to D and sum", couldn't we go one step further and decide the mixture of WE and PE via attention? Each projects keys and values, query is projected from global context or from WE.

CarlosEduardoSaMotta · 2020-12-03T16:11:12Z

Perhaps because theses sums form a cloud around a point in word embedding carrying information about position occurrences. Think, for example, of the an word in a 1D embedding and suppose that words are evenly spaced: 1.0, 2.0, 3.0, ... If you sum a sequence of equally spaced small numbers that represent distances from sequence beginning to one of them, let's say, 0.01, 0.02, 0.03, ..., you'll have a cluster of position information around the number that encodes the word. For instance, 1.01, 1.02, 1.05, ..., encode the same word in different positions. If the granularity of the encodings is different you can get such result.

KeithYJohnson · 2021-02-24T16:32:09Z

Is anyone aware of any papers where they concatenated positional embeddings instead of adding them? I'm wondering if anyone has even tried it.

sooheon · 2021-03-10T06:52:16Z

Check out DeBERTA which disentangle the position and content embeddings. This seems like an explicit inductive bias that content and (relative) position should be treated differently--have explicitly separate weight matrices for projection.

jeffreymei · 2021-04-04T15:43:03Z

This is a really good description IMO

Anshita1Saxena · 2023-01-22T00:58:41Z

Just wanted to put this content here for some mathematical context also: https://enccs.github.io/gnn_transformers/notebooks/session_1/1b_vector_sums_vs_concatenation/

yaoshiang · 2023-02-08T19:33:24Z

The following is informed conjecture, not proven fact.

If you look at how much each scalar in the the positional embedding vector changes as a function of position... you'll find that many of the scalars barely change at all. You can visualize this with any positional embedding plot, where the x axis is usually the [512] length of the scalar, and the y axis is the position.

For example, this image is from Jay Alammar's well regarded "The Illustrated Transformer"

Let's try to do this mathematically as well. The implementation of PE's that Jay references is at this Google GitHub repo:

https://github.com/tensorflow/tensor2tensor/tree/23bd23b9830059fbc349381b70d9429b5c40a139

Running the function on a PE/WE of length 512 and max sentence length of 128, let's look at how much the final value in the vector actually changes from the first position, to the 64th position, to the final position. Answer: not much.

print(signal[0, 0, -1])
print(signal[0, 63, -1])
print(signal[0, 127, -1])

 tf.Tensor(1.0, shape=(), dtype=float32)
tf.Tensor(0.99998015, shape=(), dtype=float32)
tf.Tensor(0.99991935, shape=(), dtype=float32)

Ditto for a value 16 steps away from the final location:

print(signal[0, 0, -16])
print(signal[0, 63, -16])
print(signal[0, 127, -16])

 tf.Tensor(1.0, shape=(), dtype=float32)
tf.Tensor(0.9984067, shape=(), dtype=float32)
tf.Tensor(0.9935305, shape=(), dtype=float32)

I saw elsewhere that BERT's WEs are typically roughly the range of [-2, 2], so adding a 0.007 delta from the PE would not move the WE very much at the -16th position.

So what I think is probably happening is that only ~256 of the PE vector's values are actually moving around as a function of the position... the rest are ~constant. Then the learned WE (Transformers don't use prelearned WE like word2vec or glove), figures out to only use the other ~256 elements. So really... it's conceptually a concat.

notebook here

https://colab.research.google.com/drive/14RGALTsPIYGAuIByXGutK-aYN-PikWzF

Akella17 closed this as completed Jun 9, 2019

IsaacRodgz mentioned this issue Mar 23, 2021

Positional Encoding mariagrandury/nlp-es-hugging-face#3

Open

mariagrandury mentioned this issue May 31, 2021

Positional Encoding somosnlp/the-annotated-transformer#2

Open

eleveyuan mentioned this issue Jun 17, 2022

Attention is all you need eleveyuan/PR#17

Closed

martinpopel mentioned this issue Apr 28, 2023

Absolute Position Encoding：Why are the two tensors not alternately merged？ #1925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why add positional embedding instead of concatenate? #1591

Why add positional embedding instead of concatenate? #1591

Akella17 commented May 30, 2019

martinpopel commented May 30, 2019 •

edited

Loading

Akella17 commented Jun 1, 2019

Khaled-Abdelhamid commented May 28, 2020

sooheon commented Oct 16, 2020

CarlosEduardoSaMotta commented Dec 3, 2020

KeithYJohnson commented Feb 24, 2021

sooheon commented Mar 10, 2021

jeffreymei commented Apr 4, 2021

Anshita1Saxena commented Jan 22, 2023

yaoshiang commented Feb 8, 2023 •

edited

Loading

Why add positional embedding instead of concatenate? #1591

Why add positional embedding instead of concatenate? #1591

Comments

Akella17 commented May 30, 2019

martinpopel commented May 30, 2019 • edited Loading

Akella17 commented Jun 1, 2019

Khaled-Abdelhamid commented May 28, 2020

sooheon commented Oct 16, 2020

CarlosEduardoSaMotta commented Dec 3, 2020

KeithYJohnson commented Feb 24, 2021

sooheon commented Mar 10, 2021

jeffreymei commented Apr 4, 2021

Anshita1Saxena commented Jan 22, 2023

yaoshiang commented Feb 8, 2023 • edited Loading

martinpopel commented May 30, 2019 •

edited

Loading

yaoshiang commented Feb 8, 2023 •

edited

Loading