Skip to content

A comprehensive collection of data quality resources, tools, papers, and projects across various data types including traditional data, LLM pretraining/fine-tuning data, multimodal data, and more. Essential reference for researchers and practitioners in data-centric AI.

License

Notifications You must be signed in to change notification settings

MigoXLab/awesome-data-quality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Data Quality Awesome

Resources, tools, papers, and projects for ensuring data reliability and effectiveness across traditional data, LLM pretraining/fine-tuning data, multimodal data, and more.

Contents

Introduction

Data quality is a critical aspect of any data-driven application or research. This repository collects resources related to data quality across different data types, including traditional data, large language model data (both pretraining and fine-tuning), multimodal data, and more.

Traditional Data

This section covers data quality for traditional structured and unstructured data.

Papers

Tools & Projects

  • Great Expectations - A Python framework for validating, documenting, and profiling data. (2018)
  • Deequ - A library built on top of Apache Spark for defining "unit tests for data". (2018)
  • OpenRefine - A powerful tool for working with messy data, cleaning it, and transforming it. (2010)

Data Readiness Assessment

This subsection covers methods and tools for assessing data readiness for AI applications.

Papers

Tools & Projects

Large Language Model Data

Pretraining Data

This section covers data quality for large language model pretraining data.

Papers

Tools & Projects

  • Dolma - A framework for curating and documenting large language model pretraining data. (2023)
  • Text Data Cleaner - A tool for cleaning text data for language model pretraining. (2022)
  • CCNet - Tools for downloading and filtering CommonCrawl data. (2020)
  • Dingo - A comprehensive data quality evaluation tool supporting multiple data sources, types, and modalities. (2024)

Fine-tuning Data

This section covers data quality for large language model fine-tuning data.

Papers

Tools & Projects

  • LMSYS Chatbot Arena - A platform for evaluating LLM responses. (2023)
  • OpenAssistant - A project to create high-quality instruction-following data. (2022)
  • Argilla - An open-source data curation platform for LLMs. (2021)

LLM Data Management

This section covers comprehensive data management approaches for LLMs, including data processing, storage, and serving.

Papers

Tools & Projects

  • awesome-data-llm - Official repository of "LLM × DATA" survey paper with curated resources. (2025)
  • CommonCrawl - A massive web crawl dataset covering diverse languages and domains. (2008)
  • RedPajama - An open-source reproduction of the LLaMA training dataset. (2023)
  • FineWeb - A large-scale, high-quality web dataset for language model training. (2024)

Cognition Engineering & Test-Time Scaling

This section focuses on cognition engineering and test-time scaling methods that improve data quality through enhanced reasoning and thinking processes.

Surveys

Data Engineering 2.0

  • O1 Journey--Part 1 - A dataset for math reasoning with long chain-of-thought. (2024)
  • Marco-o1 - Reasoning dataset synthesized from Qwen2-7B-Instruct. (2024)
  • STILL-2 - Long-form thought data for math, code, science, and puzzle domains. (2024)
  • OpenThoughts-114k - Large-scale dataset of reasoning trajectories distilled from DeepSeek R1. (2024)

Training Data Quality

Multimodal Data

This section covers data quality for multimodal data, including image-text pairs, video, and audio.

Papers

Tools & Projects

  • CLIP-Benchmark - A benchmark for evaluating CLIP models. (2021)
  • img2dataset - A tool for efficiently downloading and processing image-text datasets. (2021)

Tabular Data

This section covers data quality for tabular data.

Papers

Tools & Projects

  • Pandas Profiling - A tool for generating profile reports from pandas DataFrames. (2016)
  • DataProfiler - A Python library for data profiling and data quality validation. (2021)

Time Series Data

This section covers data quality for time series data.

Papers

Tools & Projects

  • Darts - A Python library for time series forecasting and anomaly detection. (2020)
  • tslearn - A machine learning toolkit dedicated to time series data. (2017)

Graph Data

This section covers data quality for graph data.

Papers

Tools & Projects

  • DGL - A Python package for deep learning on graphs. (2018)
  • NetworkX - A Python package for the creation, manipulation, and study of complex networks. (2008)

Data-Centric AI

This section focuses on data quality management for machine learning models, following the Data-Centric AI paradigm. It includes papers and resources related to data valuation, data selection, and benchmarks for evaluating data quality in ML pipelines.

Surveys

Data Valuation

Data Selection

Benchmarks

About

A comprehensive collection of data quality resources, tools, papers, and projects across various data types including traditional data, LLM pretraining/fine-tuning data, multimodal data, and more. Essential reference for researchers and practitioners in data-centric AI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published