-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why is the process so slow #21
Comments
Firstly, thank you for your interest in our work. The calculation of IFD scores requires the inference on LLMs, thus it's naturally time-consuming. However, we also proposed Superfiltering(ACL'24), which utilizes small language models like GPT2 to select the data rather than LLMs, it tremendously lowers the time and cost for the data selection process. If efficiency is important to you, please try it. Secondly, you did not provide enough information for your observation:
Again, thank you for your interest! We highly recommend you try our Superfiltering(ACL'24) if efficiency is important to you! |
Thank you for your answer!! |
And does this project support Chinese datasets selection? |
Thank you for your interest! Based on your data, I think it is quite reasonable that it will cost a lot of hours. Though it has only 50k samples, the size is almost 20 times of the alpaca data. Unfortunately, I am no expert on accelerating inferences, sorry about that. As for whether this method supports Chinese datasets, I think the answer should be yes. Our method is a language-agnostic method, it computes and compares the losses/perplexities generated by base models. So if the base model itself supports other language, then our method should be useful. |
If you are interested in our method or have further questions, we can also add WeChat friends for better communication. Thank you! |
it costs 100 hours to select from 50000 multi_turn samples
The text was updated successfully, but these errors were encountered: