You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
maybe "Each head in MSA only responsible for a subset of channels of the input tokens".
According to the Transformer, each head is not responsible for part of input tokens, but part of channels of one input token.
The text was updated successfully, but these errors were encountered:
When you call it "input token", you mean all channels form one tensor. When you call it "input tokens", you mean each channel represents one token. In fact, MSA concatenates the results of all SA along the channel dimension. In this setting, calling it "input tokens" is more accurate. Similarly, the outputs of a layer with a shape of 1x32x7x7, we usually say it has 32 feature maps. Anyway, the codes are only for reference, to better understand the principle, you should follow the paper.
maybe "Each head in MSA only responsible for a subset of channels of the input tokens".
According to the Transformer, each head is not responsible for part of input tokens, but part of channels of one input token.
The text was updated successfully, but these errors were encountered: