GitHub - sumanthprabhu/DQC-Toolkit: Data quality checks to curate noisy labels in the data

DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate for curation of text classification datasets (binary / multi-class) using cross validation based selection.

Installation

Installation of DQC-toolkit can be done as shown below

pip install dqc-toolkit

Quick Start

Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -

from dqc import CrossValCurate

cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])

The result stored in data_curated which is a pandas dataframe similar to data with the following columns -

>>> data_curated.columns
['text', 'label', 'label_correctness_score', 'is_label_correct', 'predicted_label', 'prediction_probability']

'label_correctness_score' represents a normalized score quantifying the correctness of 'label'.
'is_label_correct' is a boolean flag indicating whether the given 'label' is correct (True) or incorrect (False).
'predicted_label' and 'prediction_probability' represent the curation model's prediction and the corresponding probability score.

For more details regarding different hyperparameters available in CrossValCurate, please refer to the API documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
docs		docs
dqc		dqc
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

docs

docs

dqc

dqc

tests

tests

.gitignore

.gitignore

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

mkdocs.yml

mkdocs.yml

pyproject.toml

pyproject.toml

Repository files navigation

Installation

Quick Start

About

Packages

Languages

License

sumanthprabhu/DQC-Toolkit

Folders and files

Latest commit

History

Repository files navigation

Installation

Quick Start

About

Topics

Resources

License

Stars

Watchers

Forks

Languages