Awesome ML Data Quality Papers

This is a list of papers about training data quality management for ML models.

Introduction

Data scientists spend ∼80% time on data preparation for an ML pipeline since the data quality issues are unknown beforehand thereby leading to iterative debugging [1]. A good Data Quality Management System for ML (DQMS for ML) helps data scientists break free from the arduous process of data selection and debugging, particularly in the current era of big data and large models. Automating the management of training data quality effectively is crucial for improving the efficiency and quality of ML pipelines.

With the emergence and development of "Data-Centric AI", there has been increasing research focus on optimizing the quality of training data rather than solely concentrating on model structures and training strategies. This is the motivation behind maintaining this repository.

Before we proceed, let's define data quality for ML. In contrast to traditional data cleaning, training data quality for ML refers to the impact of individual or groups of data samples on the behavior of ML models for a given task. It's important to note that the behavior of the model we are concerned with goes beyond performance metrics like accuracy, recall, AUC, MSE, etc. We also consider more generalizable metrics such as model fairness, robustness, and so on.

Considering the following pipeline, DQMS acts as a middleware between data, ML model, and user, necessitating interactions with each of them.

A DQMS for ML typically consists of three components: Data Sculptor [2], Data Attributer, and Data Profiler. To achieve a well-performing ML model, multiple rounds of training are often required. In this process, the DQMS needs to iteratively adjust the training data based on the results of each round of model training. The workflow of DQMS in one round of training is as follows: (a) Data sculptor first acquires the training dataset from a data source and trains the ML model with it. (b) After training for one round (several epochs), Data Attributer absorbs feedback from the model and user's task requirements and computes the data quality assessment. (c) Data Profiler then provides a user-friendly summary of the training data. (d) Meanwhile, Data Sculptor utilizes the data quality assessment as feedback to acquire higher-quality training data, thus initiating a new iteration.

We collect the recent influential papers about DQMS for ML and annotate the relevant DQMS components involved in these papers, where DS = Data Sculptor, DA = Data Attributer, and DP = Data Profiler.

Paper List

2024

Venue	Paper	Links	Tags	TLDR
ICML'24	DsDm: Dataset Selection with Datamodels		`DS`
ICML'24	BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges		`DS`
VLDB'24	MetaStore: Analyzing Deep Learning Meta-Data at Scale	paper	`DA` `DP`
VLDB'24	Optimizing Data Acquisition to Enhance Machine Learning Performance	paper code	`DS`
VLDB'24	MisDetect: Iterative Mislabel Detection using Early Loss	paper code	`DA`
SIGMOD'24	Data Acquisition for Improving Model Confidence
SIGMOD'24	Controllable Tabular Data Synthesis Using Diffusion Models		`DS`
SIGMOD'24	Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple Games	paper code	`DA`	It assigns a Shapley score for data owners and their corresponding datasets in data market.
WWW'24	Exploring Neural Scaling Law and Data Pruning Methods For Node Classification on Large-scale Graphs	paper code	`DS`	This work selects training nodes that are similar to test nodes by minimizing their bottleneck distance. To avoid bias caused by trivial selection, it uses a greedy alg. to assure the representativeness of selected nodes.
AAAI'24	Quality-Diversity Generative Sampling for Learning with Synthetic Data	paper code	`DS`
AAAI'24	Approximating the Shapley Value without Marginal Contributions	paper	`DA`
WSDM'24	FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes	paper	`DA`
WSDM'24	Efficient, Direct, and Restricted Black-Box Graph Evasion Attacks to Any-Layer Graph Neural Networks via Influence Function	paper code	`DA`
ICLR'24	"What Data Benefits My Classifier?" Enhancing Model Performance and Interpretability through Influence-Based Data Selection	paper code	`DS` `DA`	It extends influence function considering utility, fairness and robustness. It trains a decision tree to further estimate and interpret the influence score.
ICLR'24	Canonpipe: Data Debugging with Shapley Importance over Machine Learning Pipelines	paper code	`DA`	It explores data valuation on raw data before preprocessing. It uses data provenance in ML pipelines and proposes data Shapley under a KNN approximation.
ICLR'24	Time Travel in LLMs: Tracing Data Contamination in Large Language Models	paper code	`DA`	Data contamination means the presence of test data from downstream tasks in the pre-training data of LLMs. This work explore both instance and partition level methods to identify potential contamination.
ICLR'24	GIO: Gradient Information Optimization for Training Dataset Selection	paper code	`DA`	GIO selects a small subset of data from large source data by minimizing the KL divergence between the target distribution and subset.
ICLR'24	Intriguing Properties of Data Attribution on Diffusion Models	paper code	`DA`	This paper proposes D-TRAK to attribute images generated by diffusion models back to the training data.
ICLR'24	D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning	paper code	`DS`	A data pruning method that takes diversity into consideration. It is implemented by forward and reverse message passing in the KNN graph.
ICLR'24	Effective pruning of web-scale datasets based on complexity of concept clusters	paper code	`DS`
ICLR'24	Towards a statistical theory of data selection under weak supervision	paper	`DS`
ICLR'24	Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs	paper code	`DS`
ICLR'24	DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models	paper code	`DA`	DataInf approximate influence function by swapping the order of the matrix inversion and average calculation.
ICLR'24	What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning	paper	`DS`
ICLR'24	Real-Fake: Effective Training Data Synthesis Through Distribution Matching	paper code	`DS`
ICLR'24	InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning	paper code	`DS`	InfoBatch uses training loss to prune well-learned samples in each epoch and estimate gradient distribution for unbiased learning.
arXiv'24	A Decade's Battle on Dataset Bias: Are We There Yet?	paper code	`DS`
arXiv'24	Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities	paper code	`DS`	It uses generative AI for augmentation, ensuring that the generated data covering the original data distribution with a smallest size.
arXiv'24	On the Cause of Unfairness: A Training Sample Perspective	paper	`DA`	The faieness influence can be computed by replacing the training sample with its concept counterfactual sample.

2023

Venue	Paper	Links	Tags	TLDR
arXiv'23	The Journey, Not the Destination: How Data Guides Diffusion Models	paper code	`DA`	-
NIPS'23	Data Selection for Language Models via Importance Resampling	paper code	`DS` `DA`	It selects data satisfying a target distribution from raw data by reducing KL divergence to the target over random selection.
NIPS'23	Model Shapley: Equitable Model Valuation with Black-box Access	paper code	`DA`	-
NIPS'23	Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation	paper	`DA`	Extend KNN-Shapley while considering data privacy.
NIPS'23	GEX: A flexible method for approximating influence via Geometric Ensemble	paper code	`DA`	-
NIPS'23	Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks	paper code	`DS`	-
NIPS'23	Data Pruning via Moving-one-Sample-out	paper	`DS`	This work proposes a Moso score (similar to LOO) and an approximates it using gradient over all training epochs.
NIPS'23	Towards Free Data Selection with General-Purpose Models	paper code	`DS`	-
NIPS'23	Towards Accelerated Model Training via Bayesian Data Selection	paper	`DS`	-
NIPS'23	Robust Data Valuation with Weighted Banzhaf Values	paper	`DA`	-
NIPS'23	UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models	paper	`DS`	-
NIPS'23	Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources	paper code	`DS`	-
NIPS'23	Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy	paper code	`DS`	-
NIPS'23	Spuriosity Rankings: Sorting Data to Measure and Mitigate Biases	paper	`DS`	-
NIPS'23	Core-sets for Fair and Diverse Data Summarization	paper code	`DS` `DP`	-
NIPS'23	Retaining Beneficial Information from Detrimental Data for Neural Network Repair	paper	`DS`	-
NIPS'23	Expanding Small-Scale Datasets with Guided Imagination	paper code	`DS`	-
NIPS'23	Error Discovery By Clustering Influence Embeddings	paper code	`DA`	-
NIPS'23	HiBug: On Human-Interpretable Model Debug	paper code	`DP` `DS`	-
NIPS'23	Skill-it! A data-driven skills framework for understanding and training language models	paper code	`DP` `DS`	-
ICML'23	Discover and Cure: Concept-aware Mitigation of Spurious Correlation	paper code	`DS` `DA`	Discover spurious correlation from concept level and perform concept-based data augmentation to mitigate bias.
ICML'23	TRAK: Attributing Model Behavior at Scale	paper code	`DA`	TRAK first defines a Newton approximation to estimate LOO for logistic regression and then extends it to NNs (including CLIP, mT5) by view them as the linear model acting on input gradient.
ICML'23	RGE: A Repulsive Graph Rectification for Node Classification via Influence	paper code	`DA`	RGE identifies a group of negative edges that are most harmful for GNNs. It iteratively selects negative edges by their individual influence and prefers distant edges first.
ICML'23	Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value	paper code	`DA`	Data-OOB measures the average score when a datum (OOB data) is not selected in the bootstrap dataset.
ICML'23	Towards Sustainable Learning: Coresets for Data-efficient Deep Learning	paper code	`DS`	-
ICML'23 Workshop	Training on Thin Air: Improve Image Classification with Generated Data	paper	`DS`	-
ICML'23 Workshop	Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation	paper code	`DA` `DS`	-
VLDB'23	Equitable Data Valuation Meets the Right to Be Forgotten in Model Markets	paper code	`DA`	-
VLDB'23	Computing Rule-Based Explanations by Leveraging Counterfactuals	paper code	`DP`	-
VLDB'23	Data Collection and Quality Challenges for Deep Learning	paper	`DS` `DA`	-
SIGMOD'23	GoodCore: Coreset Selection over Incomplete Data for Data-effective and Data-efficient Machine Learning	paper	`DS`	GoodCore selects a coreset that achieves expected low gradient approximation error among all possible worlds of missing data.
SIGMOD'23	XInsight: eXplainable Data Analysis Through The Lens of Causality	paper	`DP`	-
SIGMOD'23	HybridPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation	paper code	`DS` `DP`	-
ICLR'23	Data Valuation Without Training of a Model	paper code	`DA`	It proposes a score to measures the gap in data complexity where a certain data instance is removed from the full dataset.
ICLR'23	Distilling Model Failures as Directions in Latent Space	paper code	`DS` `DP`	-
ICLR'23	LAVA: Data Valuation without Pre-Specified Learning Algorithms	paper code	`DA`	LAVA uses a Wasserstein distance to estimate the upper bound of test performance. It values a training sample by its sensitivity to the distance.
ICLR'23	Concept-level Debugging of Part-Prototype Networks	paper code	`DP`	-
ICLR'23	Dataset Pruning: Reducing Training Data by Examining Generalization Influence	paper	`DS`	-
ICLR'23	Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning	paper code	`DS`	-
ICLR'23	Learning to Estimate Shapley Values with Vision Transformers	paper code	`DA`	-
ICLR'23	Characterizing the Influence of Graph Elements	paper code	`DA`	Introduce influence function into graphs, considering node- and edge-removal influence and the linear SGC model.
ICDE'23	Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise	paper code	`DP`	-
ICDE'23	Detection of Groups with Biased Representation in Ranking	paper	`DA`	-
AAAI'23	Fundamentals of Task-Agnostic Data Valuation	paper	`DA`	-
AAAI'23	Interpreting Unfairness in Graph Neural Networks via Training Node Attribution	paper code	`DA`	This work proposes a Probabilistic Distribution Disparity to define node-contributed model bias and use gradient approximation to estimate node-level bias.
AAAI'23	Interpreting Unfairness in Graph Neural Networks via Training Node Attribution	paper code	`DA`
WWW'23	GIF: A General Graph Unlearning Strategy via Influence Function	paper code	`DA`	GIF extends influence function to graph data by considering both the directly affected node(s) and the influenced neighborhoods.
AISTATS'23	Data Banzhaf: A Robust Data Valuation Framework for Machine Learning	paper	`DA`	-
arXiv'23	Data-Juicer: A One-Stop Data Processing System for Large Language Models	paper code	`DS` `DP`	-
arXiv'23	Studying Large Language Model Generalization with Influence Functions	paper	`DA`	-
TMLR'23	Synthetic Data from Diffusion Models Improves ImageNet Classification	paper	`DS`	-

2022

Venue	Paper	Links	Tags	TLDR
NIPS'22	CS-SHAPLEY: Class-wise Shapley Values for Data Valuation in Classification	paper code	`DA`
NIPS'22	Beyond neural scaling laws: beating power law scaling via data pruning	paper	`DS`
NIPS'22	Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP	paper code	`DS`
NIPS'22	Quantifying memorization across neural language models	paper	`DA`
ICML'22	Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments	paper	`DA`	It proposes the AME score $E_S[U(S\cup {z})-U(S)]$ with $S$ being a random set. The AME score can be approximated by a LASSO model.
ICML'22	Meaningfully Debugging Model Mistakes using Conceptual Counterfactual Explanations	papeer code	`DS` `DP`	It learns CAV and move those misclassified training samples toward the direction of CAV.
ICML'22	Datamodels: Predicting Predictions from Training Data	paper code	`DA`	Datamodels learns a linear model to predict the model output on one test data. It takes as input the one-hot mask of training samples.
ICML'22	Prioritized Training on Points that are learnable, Worth Learning, and Not Yet Learnt	paper code	`DS`
ICML'22	Achieving Fairness at No Utility Cost via Data Reweighing with Influence	paper code	`DA`	It employs DP and EOP to compute IF and performs soft reweighing on training samples. The proof of no-utility-degradation is provided.
ICML'22	DAVINZ: Data Valuation using Deep Neural Networks at Initialization	paper	`DA`	It uses NTK-based bound to approximate validation performance without training.
ICML'22	Understanding Instance-Level Impact of Fairness Constraint	paper code	`DA`	IF = IF of loss + IF of fairness constraint. It considers several constraints including DP, EOP, covariance, information, etc. and uses NTK to estimate IF.
ICLR'22	Domino: Discovering systematic errors with cross-modal embeddings	paper code	`DA` `DP`
ICLR'22	Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning	paper	`DA`
VLDB'22	Toward Interpretable and Actionable Data Analysis with Explanations and Causality	paper	`DP`
SIGMOD'22	Complaint-Driven Training Data Debugging at Interactive Speeds	paper	`DA`
SIGMOD'22	Interpretable Data-Based Explanations for Fairness Debugging	paper video	`DA` `DP`
ACL'22	Deduplicating training data makes language models better	paper code	`DS`
AAAI'22	Scaling Up Influence Functions	paper code	`DA`
AISTATS'22	Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning	paper code	`DA`

2021 and before

Venue	Paper	Links	Tags	TLDR
NIPS'21	Explaining Latent Representations with a Corpus of Examples	paper code	`DA`
NIPS'21	Validation free and replication robust volume-based data valuation	paper code	`DA`
NIPS'21	Deep Learning on a Data Diet: Finding Important Examples Early in Training	paper	`DS`
NIPS21	Interactive Label Cleaning with Example-based Explanations	paper code	`DP`
ICML'21	GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training	paper code	`DS`
CVPR'21	Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?	paper code	`DA`
CHI'21	Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency	paper	`DP`
NIPS'20	Multi-Stage Influence Function	paper	`DA`
NIPS'20	Estimating Training Data Influence by Tracing Gradient Descent	paper code	`DA`	TracIn measures the influence of training batched samples during training by estimating the test loss change w.r.t. earlier epochs.
ICML'20	On second-order group influence functions for black-box predictions	paper	`DA`	The influence score of a group = the sum of individual influence per sample + cross-dependencies among samples in the group.
ICML'20	Coresets for data-efficient training of machine learning models	paper code	`DS`
ICML'20	Optimizing Data Usage via Differentiable Rewards	paper	`DS`
ICML'20	Data Valuation using Reinforcement Learning	paper code	`DA`	DVRL employs a learnable NN as data value estimator to select data samples during training and use a RL signal to update it.
ICLR'20	Selection via proxy: Efficient data selection for deep learning	paper code	`DS`
SIGMOD'20	Complaint Driven Training Data Debugging for Query'2.0	paper video	`DA`
PMLR'20	Identifying Statistical Bias in Dataset Replication	paper code
ICML'19	Data Shapley: Equitable Valuation of Data for Machine Learning	paper code	`DA`
VLDB'19	Efficient task-specific data valuation for nearest neighbor algorithms	paper	`DA`
ICML'17	Understanding Black-box Predictions via Influence Functions	paper code	`DA`

Surveys

Venue	Paper	Links	Tags
arXiv'24	A Survey on Data Selection for Language Models	paper	`DS`
Nature Machine Intelligence'22	Advances, challenges and opportunities in creating data for trustworthy AI	paper	`DS` `DA`
arXiv'23	Data-centric Artificial Intelligence: A Survey	paper	`DS` `DA` `DP`
arXiv'23	Data Management For Large Language Models: A Survey	paper code	`DS` `DA`
arXiv'23	Training Data Influence Analysis and Estimation: A Survey	paper code	`DA`
TKDE'22	Data Management for Machine Learning: A Survey	paper	`DS` `DA`
IJCAI'21	Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges	paper	`DA`
TACL'21	Explanation-Based Human Debugging of NLP Models: A Survey	paper	`DP` `DA`

Benchmarks

Venue	Paper	Links	Tags
NIPS'23	DataPerf: Benchmarks for Data-Centric AI Development	paper code website	`DS` `DA` `DP`
NIPS'23	OpenDataVal: a Unified Benchmark for Data Valuation	paper code	`DA`
NIPS'23	Improving multimodal datasets with image captioning	paper code	`DS`
NIPS'23	Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias	paper code	`DS`
DEEM'22	dcbench: A Benchmark for Data-Centric AI Systems	paper code	`DS`

Related Workshops

[ICML'23] DMLR Workshop: Data-centric Machine Learning Research video DMLR Website

Related Repos

More papers about Data Valuation can be found in awesome-data-valuation. DA
More papers about Data Pruning can be found in Awesome-Coreset-Selection. DS

Reference

[1] Gupta, Nitin, et al. "Data quality for machine learning tasks." Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2021.

[2] Liang, Weixin, et al. "Advances, challenges and opportunities in creating data for trustworthy AI." Nature Machine Intelligence 4.8 (2022): 669-677.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitignore		.gitignore
README.md		README.md
framework.png		framework.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

framework.png

framework.png

Repository files navigation

Awesome ML Data Quality Papers

Introduction

Paper List

2024

2023

2022

2021 and before

Surveys

Benchmarks

Related Workshops

Related Repos

Reference

About

Releases

Packages

Contributors 2

SJTU-Quant/awesome-ml-data-quality-papers

Folders and files

Latest commit

History

Repository files navigation

Awesome ML Data Quality Papers

Introduction

Paper List

2024

2023

2022

2021 and before

Surveys

Benchmarks

Related Workshops

Related Repos

Reference

About

Topics

Resources

Stars

Watchers

Forks