-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Conversation
Is there any difference in speed or BLEU? |
Aren't those just different orderings of the same set of channels? Shouldn't impact the accuracy in that case, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, always good to be the same as in the paper!
I hope it has little influence on results, but it's always good to have code that does exactly what's written in the paper. Thanks! |
The speed and BLEU are almost the same as the original one. |
One problem with this PR is that it breaks all trained models, every Transformer model using these signals needs to be re-trained now. While I like this PR a lot, this is causing quite some trouble, so I might go back on that. Sorry if we do that, and thanks in any case! |
We might need some dirty work, e.g. checking model version to make pos emb backward compatible. But yes, apparently the easiest way is to use the original codes.
Thank you
Shanbo
… 在 2017年7月26日,08:30,Lukasz Kaiser ***@***.***> 写道:
One problem with this PR is that it breaks all trained models, every Transformer model using these signals needs to be re-trained now. While I like this PR a lot, this is causing quite some trouble, so I might go back on that. Sorry if we do that, and thanks in any case!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Is it surprising that one way vs. the other doesn't really matter? Also, random question... who first came up with the idea of adding signal like this? What's the theoretical motivation? Does anybody know? |
If you mean the difference discussed in this thread, it is really just a reordering of the dimensions, so I don't find it surprising that it does not matter regarding speed and BLEU (although guessing influence on speed is tricky because of many low-level effects, memory alignment, GPU/TPU-internal implementation details etc).
According to Attention Is All You Need, learned positional embeddings were used in Facebook's convolutional seq2seq. I am not aware of any prior usage of positional embeddings (so I was wrong - see the comment below) - they were not needed in RNN (which tracks the position implicitly) and resulted in just a small improvement in the convolutional networks (0.5 BLEU) because CNN has also some kind of information about position even without the positional embeddings. Transformer has no information about position without positional embeddings (or relative-position attention), so it is essential there. According to Attention Is All You Need, Noam Shazeer (one of the authors of the paper) proposed the parameter-free position representation, i.e. fixed sin/cos positional embeddings (unlike the learned positional embeddings in Facebook). |
The first example of positional embeddings that I'm familiar with was from a follow-up to the original memory networks paper from Jason Weston's team at FAIR https://arxiv.org/abs/1503.08895 (section 4.1, called "positional encoding"). I wouldn't be surprised if there was something similar even earlier. |
Thank you! |
This PR is to fix the positional embeddings in the codes
As mentioned in the paper, the positional embedding should be
while in the code, it's calculated as
Please check the differences.