Each head in MSA only responsible for a subset of the input tokens ？ #7

valencebond · 2021-06-08T09:34:33Z

maybe "Each head in MSA only responsible for a subset of channels of the input tokens".
According to the Transformer, each head is not responsible for part of input tokens, but part of channels of one input token.

wofmanaf · 2021-06-08T11:13:59Z

When you call it "input token", you mean all channels form one tensor. When you call it "input tokens", you mean each channel represents one token. In fact, MSA concatenates the results of all SA along the channel dimension. In this setting, calling it "input tokens" is more accurate. Similarly, the outputs of a layer with a shape of 1x32x7x7, we usually say it has 32 feature maps. Anyway, the codes are only for reference, to better understand the principle, you should follow the paper.

wofmanaf closed this as completed Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Each head in MSA only responsible for a subset of the input tokens ？ #7

Each head in MSA only responsible for a subset of the input tokens ？ #7

valencebond commented Jun 8, 2021

wofmanaf commented Jun 8, 2021 •

edited

Each head in MSA only responsible for a subset of the input tokens ？ #7

Each head in MSA only responsible for a subset of the input tokens ？ #7

Comments

valencebond commented Jun 8, 2021

wofmanaf commented Jun 8, 2021 • edited

wofmanaf commented Jun 8, 2021 •

edited