G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translations

In this implementation, we use model bigscience/bloom-560m for demonstration. And we use a demo dataset demo.json for demonstration. There are three steps in our proposed high-quality selection method in G-DIG:

Fine-tune the target model with candidate data $\mathcal{D}_{raw}$ using huggingface compatible model.

Save checkpoint to ./checkpoint, which will be used for calculate the Hessian matrix and scoring.

Compute the Hessian matrix.

./hessian.sh

# In this script, demo.json should be replaced by the training data you used to finetune the LLM.

run influence function to compute data score.

./if_score.sh

# In this script, -d demo.json corresponds to the candidate dataset and -q demo.json corresponds to the seed dataset.

Finally, use the data score according to Equation (4) in the paper to select high-quality data.

Data

We release our selected data (EN->ZH and DE->EN)

DE->EN training data of size from 1k to 64k available here
ZH->EN training data to be continue.

Citation

If this repo was useful to you, please consider citing

@article{pan2024g,
  title={G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translation},
  author={Pan, Xingyuan and Huang, Luyang and Kang, Liyan and Liu, Zhicheng and Lu, Yu and Cheng, Shanbo},
  journal={arXiv preprint arXiv:2405.12915},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
nngeometry		nngeometry
.gitignore		.gitignore
README.md		README.md
demo.json		demo.json
hessian.sh		hessian.sh
if_score.sh		if_score.sh
kfac_launcher.py		kfac_launcher.py
kfac_mapper.py		kfac_mapper.py
query_loss_launcher.py		query_loss_launcher.py
query_loss_mapper.py		query_loss_mapper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translations

Data

Citation

About

Releases

Packages

Contributors 2

Languages

xypan0/G-DIG

Folders and files

Latest commit

History

Repository files navigation

G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translations

Data

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages