Questions related to training #11

JieDengsc · 2023-11-22T02:53:45Z

Thank you for sharing

I'm trying to train models using my Chinese SFT data. I have some questions as follows:

My first step is to run "pre_experience_analysis.sh", but it seems to run all my json data. Is that reasonable? It takes a long time. The "start_idx" and "end_idx" of “data_analysis.py” are not set in your code.
Do I need to modify the code for my own Chinese SFT data? Or just use it normally.

MingLiiii · 2023-11-22T04:34:37Z

Thanks for your interest in our work!

The direct answer for your Q1 is YES. We found that the best way to train a pre-experienced model is to consider diversity. Thus we try to gain the embeddings for all the data and select by diversity.
However:
1 If your base model is already really powerful, you can try to neglect the pre-experienced model and directly run the cherry_analysis on the base model.
2 You can also randomly choose some data for the training of the pre-experienced model. Though not as good as considering diversity, it still works.
3 You can also use other quick methods to consider the diversity. For example, sentence_bert + K means.

For the second question, I don't know what base model and what SFT data you use, so I can not give a definite answer. But I think in most situations, you don't need to modify it.

JieDengsc · 2023-11-22T05:13:25Z

Thank you for your reply!

Because I saw the previous text saying "Learning from Brief Experience" by selecting a small amount of data, I'm not sure it's right to put all the data into it for training.
In addition, full data takes a long time to train.

I'll try it. Thank you.

MingLiiii · 2023-11-22T15:56:59Z

Ah, I am not sure if there is still a misunderstanding.

For the pre-experienced model, it indeed only needs a small amount of data. The code "pre_experience_analysis.sh" you were asking for is not "put all the data into it for training", it just tries to select a suitable small amount of the data for training the pre-experienced model.

JieDengsc · 2023-11-23T01:44:09Z

Thank you for your reply.

Maybe I'm not asking the question accurately.
The "pre_experience_analysis.sh" script does not perform training. It embeds all SFT data (that is, "get_perplexity_and_embedding_whole_text") and then uses the "pre_expeerience_selection.sh" script to perform clustering.

Is my understanding correct?

Thank you again

MingLiiii · 2023-11-23T02:13:42Z

Yes, I think you are correct~

MingLiiii closed this as completed Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions related to training #11

Questions related to training #11

JieDengsc commented Nov 22, 2023

MingLiiii commented Nov 22, 2023

JieDengsc commented Nov 22, 2023 •

edited

Loading

MingLiiii commented Nov 22, 2023

JieDengsc commented Nov 23, 2023

MingLiiii commented Nov 23, 2023

Questions related to training #11

Questions related to training #11

Comments

JieDengsc commented Nov 22, 2023

MingLiiii commented Nov 22, 2023

JieDengsc commented Nov 22, 2023 • edited Loading

MingLiiii commented Nov 22, 2023

JieDengsc commented Nov 23, 2023

MingLiiii commented Nov 23, 2023

JieDengsc commented Nov 22, 2023 •

edited

Loading