Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ? #5

Closed
Haena0320 opened this issue Mar 9, 2022 · 3 comments

Comments

@Haena0320
Copy link

Hi,

Thanks for releasing your work.

I'm currently trying to run your data/process.py code with customed crawled video.

And everything works well except the text_iterator().

I thought it is because I couldn't make "txt.jsonl.zst" which is going to use as random_text for pretraining batch

So I was wondering if there is any reference code or sample data to make "text.jsonl.zst" for my own ?

If it isn't possible, could you be able to explain the role of "random_text" in pretraining step for understanding your work ?
(Because I couldn't understand to align the "random text" with MERLOT-Reserve pre-training objectives)

Thank you,
Haena

@AfrinaVT
Copy link

AfrinaVT commented Jul 1, 2022

Hi, were you able to work with the random text. I am having the same issue.

@Haena0320
Copy link
Author

I found the random text means "the Pile" dataset described below paper (it had already described in the MERLOT-Researve paper but I noticed later)

Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).

Before process.py code, the Pile dataset should be splitted into 16410 sub files with zst format and stored into STOARGE_DIR path

@AfrinaVT
Copy link

AfrinaVT commented Jul 2, 2022

Thank you very much for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants