Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layer Norm in XLM-R XL and XXL #3600

Closed
stefan-it opened this issue Jun 8, 2021 · 8 comments
Closed

Layer Norm in XLM-R XL and XXL #3600

stefan-it opened this issue Jun 8, 2021 · 8 comments

Comments

@stefan-it
Copy link
Contributor

Hi :)

I'm currently trying to convert the recently released XLM-R XL and XXL models into Transformers-compatible weights.

I'm using the latest fairseq master version (with commit 2fd9d8a) and there's something strange with the layer norm parameter.

For debugging, here are the parameter names (shortened) for the XLM-R Base model:

encoder.sentence_encoder.layernorm_embedding.weight        
encoder.sentence_encoder.layernorm_embedding.bias

the parameter name is layernorm_embedding. However, for the new XL models, it outputs:

encoder.sentence_encoder.layer_norm.weight
encoder.sentence_encoder.layer_norm.bias

So the parameter name is "layer_norm". When loading the model using fairseq library, like:

from fairseq.models.roberta import RobertaModel as FairseqRobertaModel


xlmr = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)
xlmr.eval()  # disable dropout

The (shortened) model list for XLM-R Base shows:

RobertaHubInterface(                                                                                 
  (model): RobertaModel(                                                                                  
    (encoder): RobertaEncoder(                                                
      (sentence_encoder): TransformerEncoder(                               
        (dropout_module): FairseqDropout()                                                               
        (embed_tokens): Embedding(250002, 768, padding_idx=1)               
        (embed_positions): LearnedPositionalEmbedding(514, 768, padding_idx=1)                           
        (layernorm_embedding): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)

whereas the module list for the XL model shows:

RobertaHubInterface(                                                                                      
  (model): RobertaModel(                                                                              
    (encoder): RobertaEncoder(                                                                            
      (sentence_encoder): TransformerEncoder(                                                             
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(250880, 2560, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(514, 2560, padding_idx=1)

So a layer norm is missing in the XL model 🤔

Side note: I've updates the conversion script in Transformers library to be compatible with latest fairseq master. At the end, the script compares a model (forward) pass between the original fairseq model and the converted model to see the differences. For the old XLM-R Base model. the output is identical, whereas for XLM-R XL the difference is very high. Script can be found here.

Thanks for your help!

@ngoyal2707
Copy link
Contributor

@stefan-it XLMR-base and large were post layernorm settings of transformer and XL and XXL are pre layernorm settings.

in preLN setting usually the embeddings are not normalized and there's an LN at the start of transformer block. Though there's extra LN at the end of transformer.

@ngoyal2707
Copy link
Contributor

You will need to create the HF transformer also in the same way to get same output

@ricardorei
Copy link

@ngoyal2707 independently of those changes between the base and large I can't load the new XL and XXL models using any fairseq version (without making changes to the state_dict).

If I use version 0.9.0 I get a bunch of unexpected keys because the "decoder" was renamed "encoder".
If I use version >=0.10 I have unexpected keys on the emb_layer_norm which I assume was renamed to layer_norm.

RuntimeError: Error(s) in loading state_dict for RobertaModel:
        Missing key(s) in state_dict: "encoder.sentence_encoder.emb_layer_norm.weight", "encoder.sentence_encoder.emb_layer_norm.bias".
        Unexpected key(s) in state_dict: "encoder.sentence_encoder.layer_norm.weight", "encoder.sentence_encoder.layer_norm.bias", "encoder.sentence_encoder.version".

In any case, those checkpoints seem impossible to load without hacking around.

@stefan-it
Copy link
Contributor Author

stefan-it commented Jun 9, 2021

@ricardorei I installed fairseq via pip3 install git+https://github.com/pytorch/fairseq.git , as I've also seen different error messages for various fairseq version. But with latest master I could load the new larger models 🤗

@stefan-it
Copy link
Contributor Author

@ngoyal2707 Thanks for your explanation 👍 I could see the changes in 54423d3 so we're currently adjusting the RoBERTa model in Transformers to support the new models :)

@Soonhwan-Kwon
Copy link

Soonhwan-Kwon commented Jun 9, 2021

I encountered same error, and it seems that layer_norm needs to be added in TransformerSentenceEncoder https://github.com/pytorch/fairseq/blob/master/fairseq/modules/transformer_sentence_encoder.py.

@stale
Copy link

stale bot commented Apr 17, 2022

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

@stale stale bot added the stale label Apr 17, 2022
@stale
Copy link

stale bot commented Apr 29, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

@stale stale bot closed this as completed Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants