Pretraining instruction #41

JiachengLi1995 · 2021-01-05T23:14:31Z

Hi authors,

Awesome work! Thanks for your codes and instructions. Recently, I want to pretrain a new Luke model on my own dataset. Could you write a pretraining instruction so I can learn? Thank you!

theblackcat102 · 2021-04-01T01:54:49Z

@JiachengLi1995 It seems the steps to pretrain is fairly trivial. You can execute python -m luke.cli and follows all listed command step by step and it should be fine

Parse wikipedia dump to database

python -m luke.cli build-dump-db enwiki-latest-pages-articles.xml.bz2 ./enwiki_latest

Generate list of entity vocab from your parsed database, you can refer to utils/ent_vocab.py for the additional parameters (size of vocab, white list etc).

python -m luke.cli build-entity-vocab  ./enwiki_latest ./output_ent-vocab.jsonl

Build your training datasets and metadata needed for pretraining

do note that I am using roberta-base tokenizer as my tokenizer you can use others such as bert-base-cased etc.

python -m luke.cli build-wikipedia-pretraining-dataset ./enwiki_latest roberta-base ./output_ent-vocab.jsonl  ./wikipedia_pretrain_dataset

Finally you can start your pretraining by running the following command

python -m luke.cli pretrain \
    ./wikipedia_pretrain_dataset \
    luke-roberta-base \
    --bert-model-name roberta-base \
    --entity-emb-size 300 \
    --batch-size 28 \
    --gradient-accumulation-steps 1 \
    --learning-rate 0.0005 \
    --warmup-steps 10000 \
    --log-dir logs/luke-roberta-base

ikuyamada · 2021-04-05T09:42:41Z

@theblackcat102 Thank you for providing the instructions of the pretraining of LUKE! By following these steps with using the hyperparameters specified in the original paper, our system should produce the model that performs similarly to our pretraining model.

mgong023 · 2021-05-06T12:34:33Z

Hi @ikuyamada,

Sorry to bother you again. I wanted to fine tune the model on the CoNLL2003 dataset to see if it runs. On the entire dataset it begins the training phase, but it takes 50hrs/epoch. So i only took the first few samples of the CoNLL dataset, but then i get an assertionerror examples/ner/utils.py", line 238, in convert_examples_to_features assert not entity_labels. I really don't understand why this happens, i only made the dataset smaller.

ikuyamada · 2021-05-06T13:45:04Z

Hi @mgong023,
Please read the source code carefully to investigate why the error happens based on your dataset. The error basically means that some entity spans detected in utils.py (line 122-145) are missing for some reasons.

mgong023 · 2021-05-06T14:14:17Z

Alright, your recommendation gave me an insight so i have fixed that error. I now encounter an CUDA out of memory on the second epoch. I use 2x Nvidia Tesla K80. Are the GPUs not powerful enough or does something go wrong? I also get the following warning at the start: Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") Does this have to do with it?

ikuyamada · 2021-05-06T14:19:42Z

Hi @mgong023,
I do not know the spec of Tesla K80 but the original NER model was trained using Tesla V100 with 16GB memory.
Also, I think that the error indicates that the apex library was not installed properly.

mgong023 · 2021-05-11T19:50:49Z

Hi, I used the exact same data for train, valid and test set to test if the model actually learns. However, all scores (precision, recall, F1) stay zero after fine tuning it. It seems like the model doesn't learn anything at all. Is there a way to fix this?

ikuyamada · 2021-05-11T23:47:51Z

@mgong023 Hi, did you use the CoNLL-2003 dataset to test the model? If not, I am sorry but I cannot provide support to a specific use case. Also, this issue may not be related to your question. It is a common practice to create an issue if there does not exist an issue related to your problem.

binhna · 2022-03-09T07:24:38Z

@theblackcat102 following your instruction, I can train new luke on my own dataset. But it seems to mismatch when converting it to transformers huggingface. There is no [MASK2] in the entity vocab and unexpected keys Unexpected keys cls.predictions.bias, cls.predictions.transform.dense.weight, cls.predictions.transform.dense.bias, cls.predictions.transform.LayerNorm.weight, cls.predictions.transform.LayerNorm.bias, cls.predictions.decoder.weight, cls.predictions.decoder.bias, cls.seq_relationship.weight, cls.seq_relationship.bias, embeddings.position_ids
https://github.com/huggingface/transformers/blob/5b7dcc7342/src/transformers/models/luke/convert_luke_original_pytorch_checkpoint_to_pytorch.py

svjan5 · 2022-04-05T04:30:38Z

Hi @theblackcat102 and @ikuyamada,
I followed the instructions which you have defined for pre-training LUKE on the Wikipedia corpus. However, using the hyperparams provided, I get a really bad-performing model. Is anyone able to reproduce the reported numbers using the above pre-training code?

ikuyamada · 2022-04-05T07:22:50Z

Hi @svjan5,

We are working on the updated pretraining code and a documentation and plan to release them soon.

Additionally, as described in our paper (Appendix A), we adopt two-stage pretraining. We freeze the BERT weights (by specifying --fix_bert_weights) and update only the entity embeddings in the first stage, and train the entire model based on the first-stage model (by specifying the checkpoint file of the first-stage training using --model_file) in the second stage.

svjan5 · 2022-04-05T08:04:14Z

Thank you for your prompt reply @ikuyamada. I will try following the two-stage pretaining and will wait for the updated code and instructions.

patelrajnath · 2022-04-05T08:51:17Z

@ikuyamada

You meant,
Pretraining: first stage
Downstream tasks: second stage

Please confirm.

Regards
Raj

ikuyamada · 2022-04-05T08:54:36Z

Hi @patelrajnath,
Thank you for your comment. No, our pretraining consists of two stages. This is explained in Appendix A of our paper:

To stabilize training, we update only those parameters that are randomly initialized (i.e., fix the parameters that are initialized
using RoBERTa) in the first 100K steps, and update all parameters in the remaining 100K steps.

ikuyamada · 2022-04-05T08:55:59Z

@svjan5 Thanks! I will let you know when the pretraining instruction is available.

bennigeir · 2022-04-05T09:59:02Z

While you wait @svjan5, here are instructions I created for pretraining LUKE for myself:
https://colab.research.google.com/drive/1lDxvauAAnCtQfybOyTnpv38vQyCI4Whq?usp=sharing

Note that this is for Icelandic and therefore has data and models suited for that language. Something you probably want to change

svjan5 · 2022-04-05T11:45:44Z

Thanks a lot, @bennigeir. I will definitely give it a try.

ikuyamada · 2022-04-12T15:06:11Z

The pretraining instruction is available here.

svjan5 · 2022-04-14T04:28:14Z

Great thanks a lot @ikuyamada !

rafidwiriz mentioned this issue Apr 15, 2021

"ModuleNotFoundError: No module named 'icu'" when I tried to run in Google Colab #71

Closed

ikuyamada mentioned this issue Apr 16, 2021

Pre-train Corpus #72

Closed

ikuyamada closed this as completed May 11, 2021

bennigeir mentioned this issue Feb 2, 2022

Pretraining Problem #109

Closed

bennigeir mentioned this issue Mar 8, 2022

Pre-training for entity disambiguation #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining instruction #41

Pretraining instruction #41

JiachengLi1995 commented Jan 5, 2021

theblackcat102 commented Apr 1, 2021

ikuyamada commented Apr 5, 2021

mgong023 commented May 6, 2021

ikuyamada commented May 6, 2021

mgong023 commented May 6, 2021

ikuyamada commented May 6, 2021 •

edited

mgong023 commented May 11, 2021

ikuyamada commented May 11, 2021

binhna commented Mar 9, 2022

svjan5 commented Apr 5, 2022

ikuyamada commented Apr 5, 2022

svjan5 commented Apr 5, 2022 •

edited

patelrajnath commented Apr 5, 2022

ikuyamada commented Apr 5, 2022

ikuyamada commented Apr 5, 2022

bennigeir commented Apr 5, 2022

svjan5 commented Apr 5, 2022

ikuyamada commented Apr 12, 2022

svjan5 commented Apr 14, 2022

Pretraining instruction #41

Pretraining instruction #41

Comments

JiachengLi1995 commented Jan 5, 2021

theblackcat102 commented Apr 1, 2021

ikuyamada commented Apr 5, 2021

mgong023 commented May 6, 2021

ikuyamada commented May 6, 2021

mgong023 commented May 6, 2021

ikuyamada commented May 6, 2021 • edited

mgong023 commented May 11, 2021

ikuyamada commented May 11, 2021

binhna commented Mar 9, 2022

svjan5 commented Apr 5, 2022

ikuyamada commented Apr 5, 2022

svjan5 commented Apr 5, 2022 • edited

patelrajnath commented Apr 5, 2022

ikuyamada commented Apr 5, 2022

ikuyamada commented Apr 5, 2022

bennigeir commented Apr 5, 2022

svjan5 commented Apr 5, 2022

ikuyamada commented Apr 12, 2022

svjan5 commented Apr 14, 2022

ikuyamada commented May 6, 2021 •

edited

svjan5 commented Apr 5, 2022 •

edited