Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining instruction #41

Closed
JiachengLi1995 opened this issue Jan 5, 2021 · 19 comments
Closed

Pretraining instruction #41

JiachengLi1995 opened this issue Jan 5, 2021 · 19 comments

Comments

@JiachengLi1995
Copy link

Hi authors,

Awesome work! Thanks for your codes and instructions. Recently, I want to pretrain a new Luke model on my own dataset. Could you write a pretraining instruction so I can learn? Thank you!

@theblackcat102
Copy link

@JiachengLi1995 It seems the steps to pretrain is fairly trivial. You can execute python -m luke.cli and follows all listed command step by step and it should be fine

  1. Parse wikipedia dump to database
python -m luke.cli build-dump-db enwiki-latest-pages-articles.xml.bz2 ./enwiki_latest
  1. Generate list of entity vocab from your parsed database, you can refer to utils/ent_vocab.py for the additional parameters (size of vocab, white list etc).
python -m luke.cli build-entity-vocab  ./enwiki_latest ./output_ent-vocab.jsonl
  1. Build your training datasets and metadata needed for pretraining

do note that I am using roberta-base tokenizer as my tokenizer you can use others such as bert-base-cased etc.

python -m luke.cli build-wikipedia-pretraining-dataset ./enwiki_latest roberta-base ./output_ent-vocab.jsonl  ./wikipedia_pretrain_dataset
  1. Finally you can start your pretraining by running the following command
python -m luke.cli pretrain \
    ./wikipedia_pretrain_dataset \
    luke-roberta-base \
    --bert-model-name roberta-base \
    --entity-emb-size 300 \
    --batch-size 28 \
    --gradient-accumulation-steps 1 \
    --learning-rate 0.0005 \
    --warmup-steps 10000 \
    --log-dir logs/luke-roberta-base

@ikuyamada
Copy link
Member

@theblackcat102 Thank you for providing the instructions of the pretraining of LUKE! By following these steps with using the hyperparameters specified in the original paper, our system should produce the model that performs similarly to our pretraining model.

@mgong023
Copy link

mgong023 commented May 6, 2021

Hi @ikuyamada,

Sorry to bother you again. I wanted to fine tune the model on the CoNLL2003 dataset to see if it runs. On the entire dataset it begins the training phase, but it takes 50hrs/epoch. So i only took the first few samples of the CoNLL dataset, but then i get an assertionerror examples/ner/utils.py", line 238, in convert_examples_to_features assert not entity_labels. I really don't understand why this happens, i only made the dataset smaller.

@ikuyamada
Copy link
Member

Hi @mgong023,
Please read the source code carefully to investigate why the error happens based on your dataset. The error basically means that some entity spans detected in utils.py (line 122-145) are missing for some reasons.

@mgong023
Copy link

mgong023 commented May 6, 2021

Alright, your recommendation gave me an insight so i have fixed that error. I now encounter an CUDA out of memory on the second epoch. I use 2x Nvidia Tesla K80. Are the GPUs not powerful enough or does something go wrong? I also get the following warning at the start: Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") Does this have to do with it?

@ikuyamada
Copy link
Member

ikuyamada commented May 6, 2021

Hi @mgong023,
I do not know the spec of Tesla K80 but the original NER model was trained using Tesla V100 with 16GB memory.
Also, I think that the error indicates that the apex library was not installed properly.

@mgong023
Copy link

Hi, I used the exact same data for train, valid and test set to test if the model actually learns. However, all scores (precision, recall, F1) stay zero after fine tuning it. It seems like the model doesn't learn anything at all. Is there a way to fix this?

@ikuyamada
Copy link
Member

@mgong023 Hi, did you use the CoNLL-2003 dataset to test the model? If not, I am sorry but I cannot provide support to a specific use case. Also, this issue may not be related to your question. It is a common practice to create an issue if there does not exist an issue related to your problem.

@binhna
Copy link

binhna commented Mar 9, 2022

@theblackcat102 following your instruction, I can train new luke on my own dataset. But it seems to mismatch when converting it to transformers huggingface. There is no [MASK2] in the entity vocab and unexpected keys Unexpected keys cls.predictions.bias, cls.predictions.transform.dense.weight, cls.predictions.transform.dense.bias, cls.predictions.transform.LayerNorm.weight, cls.predictions.transform.LayerNorm.bias, cls.predictions.decoder.weight, cls.predictions.decoder.bias, cls.seq_relationship.weight, cls.seq_relationship.bias, embeddings.position_ids
https://github.com/huggingface/transformers/blob/5b7dcc7342/src/transformers/models/luke/convert_luke_original_pytorch_checkpoint_to_pytorch.py

@svjan5
Copy link

svjan5 commented Apr 5, 2022

Hi @theblackcat102 and @ikuyamada,
I followed the instructions which you have defined for pre-training LUKE on the Wikipedia corpus. However, using the hyperparams provided, I get a really bad-performing model. Is anyone able to reproduce the reported numbers using the above pre-training code?

@ikuyamada
Copy link
Member

Hi @svjan5,

We are working on the updated pretraining code and a documentation and plan to release them soon.

Additionally, as described in our paper (Appendix A), we adopt two-stage pretraining. We freeze the BERT weights (by specifying --fix_bert_weights) and update only the entity embeddings in the first stage, and train the entire model based on the first-stage model (by specifying the checkpoint file of the first-stage training using --model_file) in the second stage.

@svjan5
Copy link

svjan5 commented Apr 5, 2022

Thank you for your prompt reply @ikuyamada. I will try following the two-stage pretaining and will wait for the updated code and instructions.

@patelrajnath
Copy link

@ikuyamada

You meant,
Pretraining: first stage
Downstream tasks: second stage

Please confirm.

Regards
Raj

@ikuyamada
Copy link
Member

Hi @patelrajnath,
Thank you for your comment. No, our pretraining consists of two stages. This is explained in Appendix A of our paper:

To stabilize training, we update only those parameters that are randomly initialized (i.e., fix the parameters that are initialized
using RoBERTa) in the first 100K steps, and update all parameters in the remaining 100K steps.

@ikuyamada
Copy link
Member

@svjan5 Thanks! I will let you know when the pretraining instruction is available.

@bennigeir
Copy link

While you wait @svjan5, here are instructions I created for pretraining LUKE for myself:
https://colab.research.google.com/drive/1lDxvauAAnCtQfybOyTnpv38vQyCI4Whq?usp=sharing

Note that this is for Icelandic and therefore has data and models suited for that language. Something you probably want to change

@svjan5
Copy link

svjan5 commented Apr 5, 2022

Thanks a lot, @bennigeir. I will definitely give it a try.

@ikuyamada
Copy link
Member

The pretraining instruction is available here.

@svjan5
Copy link

svjan5 commented Apr 14, 2022

Great thanks a lot @ikuyamada !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants