Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have questions about creating a pre-training model. #22

Open
yeontaek opened this issue Jun 22, 2019 · 6 comments
Open

I have questions about creating a pre-training model. #22

yeontaek opened this issue Jun 22, 2019 · 6 comments

Comments

@yeontaek
Copy link

yeontaek commented Jun 22, 2019

HI We're working on a pre-training model. I have two questions about this process.

First of all, The amount of data I have is about 180 million sentences, and it takes too long to make a tfrecord. I need advice to make tfrecord.

Second, Is there no performance problem if I change the model type to another type when I create the Sentencepiece model? like bpe, char, or word.

@zihangdai
Copy link
Owner

zihangdai commented Jun 22, 2019

If you happen to notice, there are two flags in the data_utils.py script that can be leveraged to perform paralleled preprocessing, namely FLAGS.num_task and FLAGS.task . Intuitively, num_task specifies the total number of processes you hope to use and FLAGS.task specifies the id of a process.

Hence, one (possible but not recommended) way to do this using bash alone could be

NUM_PROC=1000
For i in `seq 0 $((NUM_PROC - 1))`; do
  python data_utils.py \
    ....  \ # other flags
    --task=${i} \
    --num_task=${NUM_PROC} &
done

Essentially, you launch 1000 processes to process the data in parallel. For a better handling, you could wrap the code with the python "multiprocessing" module.

For the second question, we never tried other sub-word models. But my guess is it should be fine.

@yeontaek
Copy link
Author

@zihangdai
I'll share the results the way you told me. Thank you so much.

@yeontaek
Copy link
Author

@zihangdai
I have one more question.
Why is it that there is no "corpus_info_path" parameter when running "train.py"

@zihangdai
Copy link
Owner

corpus_info.json was originally design to keep track of the meta-information of the processed tfrecords. Potentially, one could use corpus_info.json to replace many data related flags. However, the currently release code does not take advantage of this and requires users to set them manually. This is a place we should improve.

@tomohideshibata
Copy link

Unless the --corpus_info_path option isn't used in train.py/train_gpu.py, we have to specify the same options in data_utils.py and train.py/train_gpu.py. So, it's better to set the same default parameters in these scripts. For example, the default parameters for --uncased are different.

@kimiyoung
Copy link
Collaborator

kimiyoung commented Jun 25, 2019

Good catch. --corpus_info_path has been removed from README since it's not used. The other flags will be fixed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants