Resources, tools, papers, and projects for ensuring data reliability and effectiveness across traditional data, LLM pretraining/fine-tuning data, multimodal data, and more.
- Introduction
- Traditional Data
- Large Language Model Data
- Multimodal Data
- Tabular Data
- Time Series Data
- Graph Data
- Data-Centric AI
Data quality is a critical aspect of any data-driven application or research. This repository collects resources related to data quality across different data types, including traditional data, large language model data (both pretraining and fine-tuning), multimodal data, and more.
This section covers data quality for traditional structured and unstructured data.
- Data Cleaning: Problems and Current Approaches - A comprehensive overview of data cleaning approaches. (2000)
- A Survey on Data Quality: Classifying Poor Data - A survey on data quality issues and classification. (2016)
- Great Expectations - A Python framework for validating, documenting, and profiling data. (2018)
- Deequ - A library built on top of Apache Spark for defining "unit tests for data". (2018)
- OpenRefine - A powerful tool for working with messy data, cleaning it, and transforming it. (2010)
This subsection covers methods and tools for assessing data readiness for AI applications.
- Data Readiness for AI: A 360-Degree Survey - A comprehensive survey examining metrics for evaluating data readiness for AI training across structured and unstructured datasets. (2024)
- Assessing Student Adoption of Generative Artificial Intelligence across Engineering Education - An empirical study on data quality considerations in educational AI applications. (2025)
- Data Readiness Assessment Framework - A framework for evaluating data quality and readiness for AI applications. (2024)
- AI Data Quality Metrics - Standardized metrics for assessing data quality in AI contexts. (2024)
This section covers data quality for large language model pretraining data.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling - A large-scale curated dataset for language model pretraining. (2021)
- Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets - An audit of the quality of web-crawled multilingual datasets. (2021)
- Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus - Documentation of the C4 dataset. (2021)
- Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models - REWire method for recycling and improving low-quality web documents through guided rewriting, addressing the "data wall" problem in LLM pretraining. (2025)
- Assessing the Role of Data Quality in Training Bilingual Language Models - A study revealing that unequal data quality is a major driver of performance degradation in bilingual settings, with a practical data filtering strategy for multilingual models. (2025)
- Dolma - A framework for curating and documenting large language model pretraining data. (2023)
- Text Data Cleaner - A tool for cleaning text data for language model pretraining. (2022)
- CCNet - Tools for downloading and filtering CommonCrawl data. (2020)
- Dingo - A comprehensive data quality evaluation tool supporting multiple data sources, types, and modalities. (2024)
This section covers data quality for large language model fine-tuning data.
- Training language models to follow instructions with human feedback - The RLHF paper from Anthropic. (2022)
- Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP - A study on the importance of data quality over quantity. (2021)
- Data Quality for Machine Learning Tasks - A survey on data quality for machine learning. (2021)
- LMSYS Chatbot Arena - A platform for evaluating LLM responses. (2023)
- OpenAssistant - A project to create high-quality instruction-following data. (2022)
- Argilla - An open-source data curation platform for LLMs. (2021)
This section covers comprehensive data management approaches for LLMs, including data processing, storage, and serving.
- A Survey of LLM × DATA - A comprehensive survey on data-centric methods for large language models covering data processing, storage, and serving. (2025)
- Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval - A method for identifying and relabeling false negatives in training data to improve model performance. (2025)
- awesome-data-llm - Official repository of "LLM × DATA" survey paper with curated resources. (2025)
- CommonCrawl - A massive web crawl dataset covering diverse languages and domains. (2008)
- RedPajama - An open-source reproduction of the LLaMA training dataset. (2023)
- FineWeb - A large-scale, high-quality web dataset for language model training. (2024)
This section focuses on cognition engineering and test-time scaling methods that improve data quality through enhanced reasoning and thinking processes.
- Generative AI Act II: Test Time Scaling Drives Cognition Engineering - A comprehensive survey on cognition engineering through test-time scaling and reinforcement learning. (2025)
- Unlocking Deep Thinking in Language Models: Cognition Engineering through Inference Time Scaling and Reinforcement Learning - A framework for developing AI thinking capabilities through test-time scaling paradigms. (2025)
- O1 Journey--Part 1 - A dataset for math reasoning with long chain-of-thought. (2024)
- Marco-o1 - Reasoning dataset synthesized from Qwen2-7B-Instruct. (2024)
- STILL-2 - Long-form thought data for math, code, science, and puzzle domains. (2024)
- OpenThoughts-114k - Large-scale dataset of reasoning trajectories distilled from DeepSeek R1. (2024)
- High-impact Sample Selection - Methods for prioritizing training samples based on learning impact measurement. (2025)
- Noise Reduction Filtering - Techniques for removing noisy web-extracted data to improve generalization. (2025)
- Length-Adaptive Training - Approaches for handling variable-length sequences in training data. (2024)
This section covers data quality for multimodal data, including image-text pairs, video, and audio.
- LAION-5B: An open large-scale dataset for training next generation image-text models - A large-scale dataset of image-text pairs. (2022)
- DataComp: In search of the next generation of multimodal datasets - A benchmark for evaluating data curation strategies. (2023)
- CLIP-Benchmark - A benchmark for evaluating CLIP models. (2021)
- img2dataset - A tool for efficiently downloading and processing image-text datasets. (2021)
This section covers data quality for tabular data.
- Automating Data Quality Validation for Dynamic Data Ingestion - A framework for automating data quality validation. (2019)
- A Survey on Data Quality for Machine Learning in Practice - A survey on data quality issues in machine learning. (2021)
- Pandas Profiling - A tool for generating profile reports from pandas DataFrames. (2016)
- DataProfiler - A Python library for data profiling and data quality validation. (2021)
This section covers data quality for time series data.
- Cleaning Time Series Data: Current Status, Challenges, and Opportunities - A survey on cleaning time series data. (2022)
- Time Series Data Augmentation for Deep Learning: A Survey - A survey on time series data augmentation. (2020)
- Darts - A Python library for time series forecasting and anomaly detection. (2020)
- tslearn - A machine learning toolkit dedicated to time series data. (2017)
This section covers data quality for graph data.
- A Survey on Graph Cleaning Methods for Noise and Errors in Graph Data - A survey on graph cleaning methods. (2022)
- Graph Data Quality: A Survey from the Database Perspective - A survey on graph data quality from a database perspective. (2022)
- DGL - A Python package for deep learning on graphs. (2018)
- NetworkX - A Python package for the creation, manipulation, and study of complex networks. (2008)
This section focuses on data quality management for machine learning models, following the Data-Centric AI paradigm. It includes papers and resources related to data valuation, data selection, and benchmarks for evaluating data quality in ML pipelines.
- Data Quality Awareness: A Journey from Traditional Data Management to Data Science Systems - A comprehensive survey on data quality awareness across traditional data management and modern data science systems. (2024)
- A Survey on Data Selection for Language Models - A survey focusing on data selection techniques for language models. (2024)
- Advances, challenges and opportunities in creating data for trustworthy AI - A Nature Machine Intelligence paper discussing the challenges and opportunities in creating high-quality data for AI. (2022)
- Data-centric Artificial Intelligence: A Survey - A comprehensive survey on data-centric AI approaches. (2023)
- Data Management For Large Language Models: A Survey - A survey on data management techniques for large language models. (2023)
- Training Data Influence Analysis and Estimation: A Survey - A survey on methods for analyzing and estimating the influence of training data on model performance. (2022)
- Data Management for Machine Learning: A Survey - A TKDE survey on data management techniques for machine learning. (2022)
- Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges - An IJCAI paper on data valuation methods in machine learning. (2022)
- Explanation-Based Human Debugging of NLP Models: A Survey - A TACL survey on explanation-based debugging of NLP models. (2021)
- Data Shapley: Equitable Valuation of Data for Machine Learning - An ICML paper introducing the Data Shapley method for valuing training data. (2019)
- Efficient task-specific data valuation for nearest neighbor algorithms - A VLDB paper on efficient data valuation for nearest neighbor algorithms. (2019)
- Towards Efficient Data Valuation Based on the Shapley Value - An AISTATS paper on efficient data valuation using Shapley values. (2019)
- Understanding Black-box Predictions via Influence Functions - An ICML paper introducing influence functions for understanding model predictions. (2017)
- Data Cleansing for Models Trained with SGD - A NeurIPS paper on data cleansing for SGD-trained models. (2019)
- Modyn: Data-Centric Machine Learning Pipeline Orchestration - A SIGMOD paper on pipeline orchestration for data-centric machine learning. (2023)
- Data Selection via Optimal Control for Language Models - An ICLR paper on optimal control methods for data selection in language models. (2024)
- ADAM Optimization with Adaptive Batch Selection - An ICLR paper on adaptive batch selection for ADAM optimization. (2024)
- Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws - An ICLR paper on dynamic sample selection using scaling laws. (2024)
- Selection via proxy: Efficient data selection for deep learning - An ICLR paper on efficient data selection using proxy models. (2020)
- DataPerf: Benchmarks for Data-Centric AI Development - A NeurIPS paper introducing benchmarks for data-centric AI development. (2023)
- OpenDataVal: a Unified Benchmark for Data Valuation - A NeurIPS paper on a unified benchmark for data valuation. (2023)
- Improving multimodal datasets with image captioning - A NeurIPS paper on improving multimodal datasets with image captioning. (2023)
- Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias - A NeurIPS paper on using LLMs as training data generators. (2023)
- dcbench: A Benchmark for Data-Centric AI Systems - A DEEM paper introducing a benchmark for data-centric AI systems. (2022)