Will positional encoding confuse attention?

I use Transformer for one dummy dataset list following.
```
for training encoder text:
my name is sanqiang .
zhao is my name .
my name is sanqiang2 .
zhao2 is my name .

for training decoder text are:
i am sanqiang .
i am zhao .
i am sanqiang2 .
i am zhao2 .

For validation encoder text:
my name is zhao .
my name is zhao2 .

and validation decoder text:
i am zhao .
i am zhao2 .
```
Because texts are sequence, so I add positional encoding (common_attention.add_timing_signal_1d).
Ideally, the model will learn the attention for sanqiang, sanqiang2, zhao, zhao2, 4 words very quickly. But actually, it performs very bad (the attention always cannot pick up the correct words).

So I think it is because additional positional signal added to the word embedding as the input to the model so that confuse the attention. Then I remove positional encoding, it shows good performance. (the attention might work well for similar location in the encoder text and decoder text).

Similar to previous RNN model, I think we can choose whether attend over the original inputs or RNN cell output.

So I wonder if my understanding correct or I miss something in Transformer?  And do you think it makes more sense self-attention should consider positional encoding but attention over encoder outputs should ignore the positional encoding?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Will positional encoding confuse attention? #314

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Will positional encoding confuse attention? #314

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions