How to get the Ratios of various data sources in the pre-training data? #18

jarheadjoe · 2023-04-11T06:12:49Z

How u get Ratios of various data sources in the pre-training data for existing LLMs in Fig2?
As for me, the data in the Fig2 differs from the paper I read.
For example, GPT-3 paper (https://arxiv.org/abs/2005.14165) did not mention conversation or code data. But in Fig2 GPT-3 used conversation and code data as pretrain data.
And for PaLM, the Proportion of data in Table 2(https://arxiv.org/pdf/2204.02311.pdf) was also different from your ratios.

hyp1231 · 2023-04-11T06:57:02Z

Thanks so much for pointing out this issue! It seems that the bug was introduced while editing the raw figure file.

We will fix it, check all the other ratios again, and update our arXiv paper ASAP. Thanks again!

Wangpeiyi9979 · 2023-04-26T03:12:59Z

Hello, how is the percentage of code data counted, is it the percentage of github data?

hyp1231 · 2023-04-26T03:16:31Z

Hello, how is the percentage of code data counted, is it the percentage of github data?

Yes. Typically, data collected from GitHub is categorized as "code".

Wangpeiyi9979 · 2023-04-29T15:04:56Z

thanks

EliverQ closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the Ratios of various data sources in the pre-training data? #18

How to get the Ratios of various data sources in the pre-training data? #18

jarheadjoe commented Apr 11, 2023 •

edited

hyp1231 commented Apr 11, 2023

Wangpeiyi9979 commented Apr 26, 2023

hyp1231 commented Apr 26, 2023

Wangpeiyi9979 commented Apr 29, 2023

How to get the Ratios of various data sources in the pre-training data? #18

How to get the Ratios of various data sources in the pre-training data? #18

Comments

jarheadjoe commented Apr 11, 2023 • edited

hyp1231 commented Apr 11, 2023

Wangpeiyi9979 commented Apr 26, 2023

hyp1231 commented Apr 26, 2023

Wangpeiyi9979 commented Apr 29, 2023

jarheadjoe commented Apr 11, 2023 •

edited