How to filter the instruction tuning data? #39

lovecambi · 2023-05-16T01:19:56Z

As the comment text in config file, the size of each dataset (# [50997(alpaca), 155562(llava), 53456(quora), 101466(sharegpt)] 361481 ) is different from the original dataset.

Is there any code or script to filter the data?

MAGAer13 · 2023-05-16T02:55:11Z

Hi, we did not filter the dataset. Since we held out some data for validation (~1k for each dataset), so the size of each dataset is smaller than the origin one.

MAGAer13 closed this as completed May 16, 2023

zhangliang-04 mentioned this issue May 23, 2023

Train/Validation splits #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to filter the instruction tuning data? #39

How to filter the instruction tuning data? #39

lovecambi commented May 16, 2023 •

edited

MAGAer13 commented May 16, 2023

How to filter the instruction tuning data? #39

How to filter the instruction tuning data? #39

Comments

lovecambi commented May 16, 2023 • edited

MAGAer13 commented May 16, 2023

lovecambi commented May 16, 2023 •

edited