LAVA: Data Valuation without Pre-Specified Learning Algorithms

This repository is the official implementation of the "LAVA: Data Valuation without Pre-Specified Learning Algorithms" (ICLR 2023).

We propose LAVA: a novel model-agnostic framework to data valuation using a non-conventional, class-wise Wasserstein discrepancy. We further introduce an efficient way to measure datapoint contribution at no cost from the optimization solution.

Limitations of traditional data valuation methods

Traditional data valuation methods assume knowledge of the underlying learning algorithm.

Learning algorithm is unknown prior to valuation
Stochastic training process => Unstable values
Model training => Computational burden

Data Valuation via Optimal Transport

We propose data valuation via optimal transport to replace the current data valuation frameworks which rely on the underlying learning algorithm.

Strong analytical properties of OT:

well-defined distance metric
computationally tractable
computable from finite samples

LAVA: Individual Datapoint Valuation

To compute individual datapoint valuation, we propose the notion calibrated gradient, which measures sensitivity of the data point to the dataset distance by shifting the probability mass of the datapoint in the dual OT formulation.

$$Value(z_i) = \frac{\partial\text{OT}(\mu_t,\mu_v)}{\partial\mu_t(z_i)} = f_{i}^* -\sum_{j\in{1, ... N} \setminus i} \frac{f^*_j}{N-1}$$

Exactly the gradient of the dual formulation
Obtained for free when solving original OT problem

Applications

LAVA can be applied to numerous data quality applications:

Mislabeled Data
Noisy Features
Dataset Redundancy
Dataset Bias
Irrelevant Data
and more.

Requirements

Install a virtual environment (conda).

conda env create -f environment.yaml python=3.8

Getting Started

Load data package.

import lava

Create a corrupted dataset and the index list of corrupted data points or create your own.

loaders, shuffle_ind = lava.load_data_corrupted(corrupt_type='shuffle', dataname='CIFAR10', resize=resize, training_size=training_size, test_size=valid_size, currupt_por=portion)

Load a feature embedder.

feature_extractor = lava.load_pretrained_feature_extractor('cifar10_embedder_preact_resnet18.pth', device)

Compute the Dual Solution of the Optimal Transport problem.

dual_sol, trained_with_flag = lava.compute_dual(feature_extractor, loaders['train'], loaders['test'], training_size, shuffle_ind, resize=resize)

Compute the Data Values with LAVA + visualization.

calibrated_gradient = lava.compute_values_and_visualize(dual_sol, trained_with_flag, training_size, portion)

Examples

For better understanding of applying LAVA to data valuation, we have provided examples on CIFAR-10 and STL-10.

Checkpoints

The pretrained embedders are included in the folder 'checkpoint'.

Optimal Transport Solver

This repo relies on the OTDD implementation to compute the class-wise Wasserstein distance.
We are immensely grateful to the authors of that project.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
checkpoint		checkpoint
models		models
otdd		otdd
triggers		triggers
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
example-cifar10.ipynb		example-cifar10.ipynb
example-stl10.ipynb		example-stl10.ipynb
lava.py		lava.py
lava_ot_valuation.png		lava_ot_valuation.png
lava_res.gif		lava_res.gif
poi_util.py		poi_util.py
preact_resnet.py		preact_resnet.py
setup.py		setup.py
vgg.py		vgg.py

License

ruoxi-jia-group/LAVA

Folders and files

Latest commit

History

Repository files navigation

LAVA: Data Valuation without Pre-Specified Learning Algorithms

Limitations of traditional data valuation methods

Data Valuation via Optimal Transport

Strong analytical properties of OT:

LAVA: Individual Datapoint Valuation

Applications

Requirements

Getting Started

Examples

Checkpoints

Optimal Transport Solver

About

Topics

Resources

License

Stars

Watchers

Forks

Languages