fix positional embedding #177

cshanbo · 2017-07-22T02:20:25Z

This PR is to fix the positional embeddings in the codes

As mentioned in the paper, the positional embedding should be

while in the code, it's calculated as

Please check the differences.

martinpopel · 2017-07-24T22:07:12Z

Is there any difference in speed or BLEU?
The obvious question is whether it would not be better to update the pdf at arXiv.

jekbradbury · 2017-07-24T22:44:14Z

Aren't those just different orderings of the same set of channels? Shouldn't impact the accuracy in that case, right?

lukaszkaiser

Thanks, always good to be the same as in the paper!

lukaszkaiser · 2017-07-25T00:51:23Z

I hope it has little influence on results, but it's always good to have code that does exactly what's written in the paper. Thanks!

cshanbo · 2017-07-25T02:29:36Z

The speed and BLEU are almost the same as the original one.
But I think too it's better to use exactly the same implementation as in the paper.

lukaszkaiser · 2017-07-26T00:30:00Z

One problem with this PR is that it breaks all trained models, every Transformer model using these signals needs to be re-trained now. While I like this PR a lot, this is causing quite some trouble, so I might go back on that. Sorry if we do that, and thanks in any case!

cshanbo · 2017-07-26T01:03:46Z

We might need some dirty work, e.g. checking model version to make pos emb backward compatible. But yes, apparently the easiest way is to use the original codes. Thank you Shanbo

…

在 2017年7月26日，08:30，Lukasz Kaiser ***@***.***> 写道： One problem with this PR is that it breaks all trained models, every Transformer model using these signals needs to be re-trained now. While I like this PR a lot, this is causing quite some trouble, so I might go back on that. Sorry if we do that, and thanks in any case! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ghost · 2020-05-31T11:31:15Z

Is it surprising that one way vs. the other doesn't really matter?

Also, random question... who first came up with the idea of adding signal like this? What's the theoretical motivation? Does anybody know?

martinpopel · 2020-05-31T20:49:31Z

@krzysztofwos

Is it surprising that one way vs. the other doesn't really matter?

If you mean the difference discussed in this thread, it is really just a reordering of the dimensions, so I don't find it surprising that it does not matter regarding speed and BLEU (although guessing influence on speed is tricky because of many low-level effects, memory alignment, GPU/TPU-internal implementation details etc).

who first came up with the idea of adding signal like this? What's the theoretical motivation?

According to Attention Is All You Need, learned positional embeddings were used in Facebook's convolutional seq2seq. I am not aware of any prior usage of positional embeddings (so I was wrong - see the comment below) - they were not needed in RNN (which tracks the position implicitly) and resulted in just a small improvement in the convolutional networks (0.5 BLEU) because CNN has also some kind of information about position even without the positional embeddings. Transformer has no information about position without positional embeddings (or relative-position attention), so it is essential there.

According to Attention Is All You Need, Noam Shazeer (one of the authors of the paper) proposed the parameter-free position representation, i.e. fixed sin/cos positional embeddings (unlike the learned positional embeddings in Facebook).

jekbradbury · 2020-05-31T21:36:14Z

The first example of positional embeddings that I'm familiar with was from a follow-up to the original memory networks paper from Jason Weston's team at FAIR https://arxiv.org/abs/1503.08895 (section 4.1, called "positional encoding"). I wouldn't be surprised if there was something similar even earlier.

ghost · 2020-06-01T02:04:18Z

Thank you!

fix positional embedding

ac038c5

lukaszkaiser approved these changes Jul 25, 2017

View reviewed changes

lukaszkaiser merged commit 62a0ee7 into tensorflow:master Jul 25, 2017

cshanbo deleted the pos_emb branch August 4, 2017 08:17

martinpopel mentioned this pull request Aug 26, 2019

achieve the paper Attention Is All You Need's position encoding #1677

Closed

saberkun mentioned this pull request Sep 27, 2020

Add Bert-style Encoder RFC. keras-team/governance#25

Merged

martinpopel mentioned this pull request Apr 28, 2023

Absolute Position Encoding：Why are the two tensors not alternately merged？ #1925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix positional embedding #177

fix positional embedding #177

cshanbo commented Jul 22, 2017

martinpopel commented Jul 24, 2017

jekbradbury commented Jul 24, 2017

lukaszkaiser left a comment

lukaszkaiser commented Jul 25, 2017

cshanbo commented Jul 25, 2017

lukaszkaiser commented Jul 26, 2017

cshanbo commented Jul 26, 2017 via email

ghost commented May 31, 2020

martinpopel commented May 31, 2020 •

edited

Loading

jekbradbury commented May 31, 2020

ghost commented Jun 1, 2020

fix positional embedding #177

fix positional embedding #177

Conversation

cshanbo commented Jul 22, 2017

martinpopel commented Jul 24, 2017

jekbradbury commented Jul 24, 2017

lukaszkaiser left a comment

Choose a reason for hiding this comment

lukaszkaiser commented Jul 25, 2017

cshanbo commented Jul 25, 2017

lukaszkaiser commented Jul 26, 2017

cshanbo commented Jul 26, 2017 via email

ghost commented May 31, 2020

martinpopel commented May 31, 2020 • edited Loading

jekbradbury commented May 31, 2020

ghost commented Jun 1, 2020

martinpopel commented May 31, 2020 •

edited

Loading