How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ? #5

Haena0320 · 2022-03-09T15:26:57Z

Hi,

Thanks for releasing your work.

I'm currently trying to run your data/process.py code with customed crawled video.

And everything works well except the text_iterator().

I thought it is because I couldn't make "txt.jsonl.zst" which is going to use as random_text for pretraining batch

So I was wondering if there is any reference code or sample data to make "text.jsonl.zst" for my own ?

If it isn't possible, could you be able to explain the role of "random_text" in pretraining step for understanding your work ?
(Because I couldn't understand to align the "random text" with MERLOT-Reserve pre-training objectives)

Thank you,
Haena

AfrinaVT · 2022-07-01T18:12:33Z

Hi, were you able to work with the random text. I am having the same issue.

Haena0320 · 2022-07-02T01:48:08Z

I found the random text means "the Pile" dataset described below paper (it had already described in the MERLOT-Researve paper but I noticed later)

Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).

Before process.py code, the Pile dataset should be splitted into 16410 sub files with zst format and stored into STOARGE_DIR path

AfrinaVT · 2022-07-02T12:43:57Z

Thank you very much for the help.

Haena0320 closed this as completed Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ? #5

How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ? #5

Haena0320 commented Mar 9, 2022

AfrinaVT commented Jul 1, 2022

Haena0320 commented Jul 2, 2022

AfrinaVT commented Jul 2, 2022

How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ? #5

How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ? #5

Comments

Haena0320 commented Mar 9, 2022

AfrinaVT commented Jul 1, 2022

Haena0320 commented Jul 2, 2022

AfrinaVT commented Jul 2, 2022