Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Each head in MSA only responsible for a subset of the input tokens ? #7

Closed
valencebond opened this issue Jun 8, 2021 · 1 comment

Comments

@valencebond
Copy link

maybe "Each head in MSA only responsible for a subset of channels of the input tokens".
According to the Transformer, each head is not responsible for part of input tokens, but part of channels of one input token.

@wofmanaf
Copy link
Owner

wofmanaf commented Jun 8, 2021

When you call it "input token", you mean all channels form one tensor. When you call it "input tokens", you mean each channel represents one token. In fact, MSA concatenates the results of all SA along the channel dimension. In this setting, calling it "input tokens" is more accurate. Similarly, the outputs of a layer with a shape of 1x32x7x7, we usually say it has 32 feature maps. Anyway, the codes are only for reference, to better understand the principle, you should follow the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants