Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions related to training #11

Closed
JieDengsc opened this issue Nov 22, 2023 · 5 comments
Closed

Questions related to training #11

JieDengsc opened this issue Nov 22, 2023 · 5 comments

Comments

@JieDengsc
Copy link

Thank you for sharing

I'm trying to train models using my Chinese SFT data. I have some questions as follows:

  1. My first step is to run "pre_experience_analysis.sh", but it seems to run all my json data. Is that reasonable? It takes a long time. The "start_idx" and "end_idx" of “data_analysis.py” are not set in your code.

  2. Do I need to modify the code for my own Chinese SFT data? Or just use it normally.

@MingLiiii
Copy link
Collaborator

Thanks for your interest in our work!

The direct answer for your Q1 is YES. We found that the best way to train a pre-experienced model is to consider diversity. Thus we try to gain the embeddings for all the data and select by diversity.
However:
1 If your base model is already really powerful, you can try to neglect the pre-experienced model and directly run the cherry_analysis on the base model.
2 You can also randomly choose some data for the training of the pre-experienced model. Though not as good as considering diversity, it still works.
3 You can also use other quick methods to consider the diversity. For example, sentence_bert + K means.

For the second question, I don't know what base model and what SFT data you use, so I can not give a definite answer. But I think in most situations, you don't need to modify it.

@JieDengsc
Copy link
Author

JieDengsc commented Nov 22, 2023

Thank you for your reply!

Because I saw the previous text saying "Learning from Brief Experience" by selecting a small amount of data, I'm not sure it's right to put all the data into it for training.
In addition, full data takes a long time to train.

I'll try it. Thank you.

@MingLiiii
Copy link
Collaborator

Ah, I am not sure if there is still a misunderstanding.

For the pre-experienced model, it indeed only needs a small amount of data. The code "pre_experience_analysis.sh" you were asking for is not "put all the data into it for training", it just tries to select a suitable small amount of the data for training the pre-experienced model.

@JieDengsc
Copy link
Author

Thank you for your reply.

Maybe I'm not asking the question accurately.
The "pre_experience_analysis.sh" script does not perform training. It embeds all SFT data (that is, "get_perplexity_and_embedding_whole_text") and then uses the "pre_expeerience_selection.sh" script to perform clustering.

Is my understanding correct?

Thank you again

@MingLiiii
Copy link
Collaborator

Yes, I think you are correct~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants