You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Through tracing transformers running the same model on a fill-mask task, I was able to determine that the execution of transformers and bert-burn diverge at the point where normalization happens.
Furthermore, using the lm_head weights for roberta-base and attaching a LM head model, I was able to verify that bert-burn's results are correct with norm_first: false, but entirely wrong for norm_first: true.
I'd be happy to provide a pull request, but I'm not sure whether other BERT models do use norm_first: true. I'm very new to machine learning and am not familiar with this family of models.
The text was updated successfully, but these errors were encountered:
Through tracing
transformers
running the same model on a fill-mask task, I was able to determine that the execution oftransformers
andbert-burn
diverge at the point where normalization happens.Furthermore, using the lm_head weights for roberta-base and attaching a LM head model, I was able to verify that bert-burn's results are correct with
norm_first: false
, but entirely wrong fornorm_first: true
.I'd be happy to provide a pull request, but I'm not sure whether other BERT models do use
norm_first: true
. I'm very new to machine learning and am not familiar with this family of models.The text was updated successfully, but these errors were encountered: