I use Transformer for one dummy dataset list following.
for training encoder text:
my name is sanqiang .
zhao is my name .
my name is sanqiang2 .
zhao2 is my name .
for training decoder text are:
i am sanqiang .
i am zhao .
i am sanqiang2 .
i am zhao2 .
For validation encoder text:
my name is zhao .
my name is zhao2 .
and validation decoder text:
i am zhao .
i am zhao2 .
Because texts are sequence, so I add positional encoding (common_attention.add_timing_signal_1d).
Ideally, the model will learn the attention for sanqiang, sanqiang2, zhao, zhao2, 4 words very quickly. But actually, it performs very bad (the attention always cannot pick up the correct words).
So I think it is because additional positional signal added to the word embedding as the input to the model so that confuse the attention. Then I remove positional encoding, it shows good performance. (the attention might work well for similar location in the encoder text and decoder text).
Similar to previous RNN model, I think we can choose whether attend over the original inputs or RNN cell output.
So I wonder if my understanding correct or I miss something in Transformer? And do you think it makes more sense self-attention should consider positional encoding but attention over encoder outputs should ignore the positional encoding?