|
return self.embedding(tokens.long()) * math.sqrt(self.emb_size) |
src = self.embedding(src) * math.sqrt(self.d_model)
shouln't this be
src = self.embedding(src) / math.sqrt(self.d_model)
at least that is the impression I got when reading the "Attention is all you need" paper.
Or is there some new research finding that multiplying is better?
cc @sekyondaMeta @svekars @kit1980 @subramen @albanD