Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the Ratios of various data sources in the pre-training data? #18

Closed
jarheadjoe opened this issue Apr 11, 2023 · 4 comments
Closed

Comments

@jarheadjoe
Copy link

jarheadjoe commented Apr 11, 2023

How u get Ratios of various data sources in the pre-training data for existing LLMs in Fig2?
As for me, the data in the Fig2 differs from the paper I read.
For example, GPT-3 paper (https://arxiv.org/abs/2005.14165) did not mention conversation or code data. But in Fig2 GPT-3 used conversation and code data as pretrain data.
And for PaLM, the Proportion of data in Table 2(https://arxiv.org/pdf/2204.02311.pdf) was also different from your ratios.

image

image

@hyp1231
Copy link
Member

hyp1231 commented Apr 11, 2023

Thanks so much for pointing out this issue! It seems that the bug was introduced while editing the raw figure file.

We will fix it, check all the other ratios again, and update our arXiv paper ASAP. Thanks again!

@Wangpeiyi9979
Copy link

Hello, how is the percentage of code data counted, is it the percentage of github data?

@hyp1231
Copy link
Member

hyp1231 commented Apr 26, 2023

Hello, how is the percentage of code data counted, is it the percentage of github data?

Yes. Typically, data collected from GitHub is categorized as "code".

@Wangpeiyi9979
Copy link

thanks

@EliverQ EliverQ closed this as completed Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants