Pre-training dataset format #14

zhouchx · 2019-06-13T03:00:25Z

Could you please give us an example about the format of pretraining corpus(such as .idx and .bin), since I want to use your model structure and my own corpus to retrain a new model. Thank you very much.

zzy14 · 2019-06-15T02:56:23Z

Hi,

I have pushed a new branch called pretrain. You can find a folder called pretrain_data where there is a small dataset with max_seq_len=256.

zzy14 closed this as completed Jun 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-training dataset format #14

Pre-training dataset format #14

zhouchx commented Jun 13, 2019

zzy14 commented Jun 15, 2019

Pre-training dataset format #14

Pre-training dataset format #14

Comments

zhouchx commented Jun 13, 2019

zzy14 commented Jun 15, 2019