Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Will positional encoding confuse attention? #314

@Sanqiang

Description

@Sanqiang

I use Transformer for one dummy dataset list following.

for training encoder text:
my name is sanqiang .
zhao is my name .
my name is sanqiang2 .
zhao2 is my name .

for training decoder text are:
i am sanqiang .
i am zhao .
i am sanqiang2 .
i am zhao2 .

For validation encoder text:
my name is zhao .
my name is zhao2 .

and validation decoder text:
i am zhao .
i am zhao2 .

Because texts are sequence, so I add positional encoding (common_attention.add_timing_signal_1d).
Ideally, the model will learn the attention for sanqiang, sanqiang2, zhao, zhao2, 4 words very quickly. But actually, it performs very bad (the attention always cannot pick up the correct words).

So I think it is because additional positional signal added to the word embedding as the input to the model so that confuse the attention. Then I remove positional encoding, it shows good performance. (the attention might work well for similar location in the encoder text and decoder text).

Similar to previous RNN model, I think we can choose whether attend over the original inputs or RNN cell output.

So I wonder if my understanding correct or I miss something in Transformer? And do you think it makes more sense self-attention should consider positional encoding but attention over encoder outputs should ignore the positional encoding?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions