I have questions about creating a pre-training model. #22

yeontaek · 2019-06-22T03:50:33Z

HI We're working on a pre-training model. I have two questions about this process.

First of all, The amount of data I have is about 180 million sentences, and it takes too long to make a tfrecord. I need advice to make tfrecord.

Second, Is there no performance problem if I change the model type to another type when I create the Sentencepiece model? like bpe, char, or word.

zihangdai · 2019-06-22T04:16:13Z

If you happen to notice, there are two flags in the data_utils.py script that can be leveraged to perform paralleled preprocessing, namely FLAGS.num_task and FLAGS.task . Intuitively, num_task specifies the total number of processes you hope to use and FLAGS.task specifies the id of a process.

Hence, one (possible but not recommended) way to do this using bash alone could be

NUM_PROC=1000
For i in `seq 0 $((NUM_PROC - 1))`; do
  python data_utils.py \
    ....  \ # other flags
    --task=${i} \
    --num_task=${NUM_PROC} &
done

Essentially, you launch 1000 processes to process the data in parallel. For a better handling, you could wrap the code with the python "multiprocessing" module.

For the second question, we never tried other sub-word models. But my guess is it should be fine.

yeontaek · 2019-06-22T05:39:05Z

@zihangdai
I'll share the results the way you told me. Thank you so much.

yeontaek · 2019-06-22T14:07:35Z

@zihangdai
I have one more question.
Why is it that there is no "corpus_info_path" parameter when running "train.py"

zihangdai · 2019-06-22T18:02:41Z

corpus_info.json was originally design to keep track of the meta-information of the processed tfrecords. Potentially, one could use corpus_info.json to replace many data related flags. However, the currently release code does not take advantage of this and requires users to set them manually. This is a place we should improve.

tomohideshibata · 2019-06-25T00:02:19Z

Unless the --corpus_info_path option isn't used in train.py/train_gpu.py, we have to specify the same options in data_utils.py and train.py/train_gpu.py. So, it's better to set the same default parameters in these scripts. For example, the default parameters for --uncased are different.

kimiyoung · 2019-06-25T00:04:41Z

Good catch. --corpus_info_path has been removed from README since it's not used. The other flags will be fixed soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I have questions about creating a pre-training model. #22

I have questions about creating a pre-training model. #22

yeontaek commented Jun 22, 2019 •

edited

Loading

zihangdai commented Jun 22, 2019 •

edited

Loading

yeontaek commented Jun 22, 2019

yeontaek commented Jun 22, 2019

zihangdai commented Jun 22, 2019

tomohideshibata commented Jun 25, 2019

kimiyoung commented Jun 25, 2019 •

edited

Loading

I have questions about creating a pre-training model. #22

I have questions about creating a pre-training model. #22

Comments

yeontaek commented Jun 22, 2019 • edited Loading

zihangdai commented Jun 22, 2019 • edited Loading

yeontaek commented Jun 22, 2019

yeontaek commented Jun 22, 2019

zihangdai commented Jun 22, 2019

tomohideshibata commented Jun 25, 2019

kimiyoung commented Jun 25, 2019 • edited Loading

yeontaek commented Jun 22, 2019 •

edited

Loading

zihangdai commented Jun 22, 2019 •

edited

Loading

kimiyoung commented Jun 25, 2019 •

edited

Loading