Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

fix positional embedding #177

Merged
merged 1 commit into from
Jul 25, 2017
Merged

fix positional embedding #177

merged 1 commit into from
Jul 25, 2017

Conversation

cshanbo
Copy link
Contributor

@cshanbo cshanbo commented Jul 22, 2017

This PR is to fix the positional embeddings in the codes

As mentioned in the paper, the positional embedding should be
image

while in the code, it's calculated as
image

Please check the differences.

@martinpopel
Copy link
Contributor

Is there any difference in speed or BLEU?
The obvious question is whether it would not be better to update the pdf at arXiv.

@jekbradbury
Copy link
Contributor

Aren't those just different orderings of the same set of channels? Shouldn't impact the accuracy in that case, right?

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, always good to be the same as in the paper!

@lukaszkaiser lukaszkaiser merged commit 62a0ee7 into tensorflow:master Jul 25, 2017
@lukaszkaiser
Copy link
Contributor

I hope it has little influence on results, but it's always good to have code that does exactly what's written in the paper. Thanks!

@cshanbo
Copy link
Contributor Author

cshanbo commented Jul 25, 2017

The speed and BLEU are almost the same as the original one.
But I think too it's better to use exactly the same implementation as in the paper.

@lukaszkaiser
Copy link
Contributor

One problem with this PR is that it breaks all trained models, every Transformer model using these signals needs to be re-trained now. While I like this PR a lot, this is causing quite some trouble, so I might go back on that. Sorry if we do that, and thanks in any case!

@cshanbo
Copy link
Contributor Author

cshanbo commented Jul 26, 2017 via email

@ghost
Copy link

ghost commented May 31, 2020

Is it surprising that one way vs. the other doesn't really matter?

Also, random question... who first came up with the idea of adding signal like this? What's the theoretical motivation? Does anybody know?

@martinpopel
Copy link
Contributor

martinpopel commented May 31, 2020

@krzysztofwos

Is it surprising that one way vs. the other doesn't really matter?

If you mean the difference discussed in this thread, it is really just a reordering of the dimensions, so I don't find it surprising that it does not matter regarding speed and BLEU (although guessing influence on speed is tricky because of many low-level effects, memory alignment, GPU/TPU-internal implementation details etc).

who first came up with the idea of adding signal like this? What's the theoretical motivation?

According to Attention Is All You Need, learned positional embeddings were used in Facebook's convolutional seq2seq. I am not aware of any prior usage of positional embeddings (so I was wrong - see the comment below) - they were not needed in RNN (which tracks the position implicitly) and resulted in just a small improvement in the convolutional networks (0.5 BLEU) because CNN has also some kind of information about position even without the positional embeddings. Transformer has no information about position without positional embeddings (or relative-position attention), so it is essential there.

According to Attention Is All You Need, Noam Shazeer (one of the authors of the paper) proposed the parameter-free position representation, i.e. fixed sin/cos positional embeddings (unlike the learned positional embeddings in Facebook).

@jekbradbury
Copy link
Contributor

The first example of positional embeddings that I'm familiar with was from a follow-up to the original memory networks paper from Jason Weston's team at FAIR https://arxiv.org/abs/1503.08895 (section 4.1, called "positional encoding"). I wouldn't be surprised if there was something similar even earlier.

@ghost
Copy link

ghost commented Jun 1, 2020

Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants