Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

This repository contains a reference implementation for ACL 2025 main paper Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning.

Shaobo Wang†^1,2, Xiangqi Jin², Ziming Wang^2,3, Jize Wang¹, Jiajun Zhang²,
Kaixin Li⁴, Zichen Wen², Zhong Li⁵, Conghui He⁶, Xuming Hu⁷, Linfeng Zhang✉^1,2

†Project Head, ✉Corresponding Author
¹Shanghai Jiao Tong University ²EPIC Lab, Shanghai Jiao Tong University ³Nanyang Technological University
⁴National University of Singapore ⁵Microsoft Research Asia ⁶Shanghai AI Laboratory
⁷Hong Kong University of Science and Technology (Guangzhou)

Pipeline

(I) Few-shot In-Context Learning.
A set of demonstration and query examples is randomly sampled from the initial dataset, and an ICL prompt is constructed with a fixed instruction. The LLM to be fine-tuned generates answers for all query examples, and the average evaluation score is computed using the ground truth answers.
(II) Context-Aware Weighting.
During each iteration of few-shot ICL, we weight the scores of the demonstration examples based on their attention scores, which quantify their influence on the queries.

🔧 Getting Started

🛠️ Setup

git clone https://github.com/gszfwsb/Data-Whisperer.git
cd Data-Whisperer
pip install -r requirements.txt

🧪 Experiments

To run a data selection experiment using Data Whisperer, please refer to scripts/run.sh to modify the parameters according to your requirements.

# Set dataset
DATASET=gsm8k # Support bioinstruct, gsm8k, dialogsum

# Set metric
METRIC=exact_match # Support rouge-L, exact_match

# Set model configurations
MODEL_TYPE=llama3_8b  # Support llama3_8b, qwen, mistral
MODEL=Llama-3-8B-Instruct  # Support Llama-3-8B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-3B-Instruct, Mistral-Nemo-Instruct-2407, Mistral-7B-Instruct-v0.2
MODEL_PATH= # <YOUR_MODEL_PATH> 

# Set numbers of samples for demonstration and query
BATCH_TRAIN=5
BATCH_TEST=3

# Set parallel size
PARALLEL=5

DATASET: Dataset name. Support bioinstruct, gsm8k, dialogsum.
METRIC: Evaluation metric for dataset. Support rouge-L for bioinstruct and dialogsum, and exact_match for gsm8k.
MODEL_TYPE: Type of the model. Support llama3_8b, qwen, mistral.
MODEL: Name of the model. Support Llama-3-8B-Instruct, Qwen2.5-7B-Instruct, Mistral-Nemo-Instruct-2407, and Qwen2.5-3B-Instruct, Mistral-7B-Instruct-v0.2 for weak-to-strong experiments.
MODEL_PATH: Path to your model.
BATCH_TRAIN: Number of samples for demonstration in In-Context Learning.
BATCH_TEST: Number of samples for query in In-Context Learning.
PARALLEL: Parallel size.

After modifying parameters, run:

bash scripts/run.sh

Upon completion of the experiment, the scored dataset will be generated and stored in the results/pruning directory. You can then select data points based on the corresponding metric to construct a coreset.

📝 Citation

If you find Data Whisperer useful for your research and applications, please kindly cite using this BibTeX:

@article{wang2025datawhisperer,
  title = {Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning},
  author = {Wang, Shaobo and Jin, Xiangqi and Wang, Ziming and Wang, Jize and Zhang, Jiajun and Li, Kaixin and Wen, Zichen and Li, Zhong and He, Conghui and Hu, Xuming and Zhang, Linfeng},
  year = {2025},
  journal = {Annual Meeting of the Association for Computational Linguistics},
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
docs		docs
metrics		metrics
pruning		pruning
scripts		scripts
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Pipeline

🔧 Getting Started

🛠️ Setup

🧪 Experiments

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

gszfwsb/Data-Whisperer

Folders and files

Latest commit

History

Repository files navigation

Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Pipeline

🔧 Getting Started

🛠️ Setup

🧪 Experiments

📝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages