Skip to content

Code for the AISTATS 2024 Paper "From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance"

License

Notifications You must be signed in to change notification settings

se-jaeger/conformal-data-cleaning

Repository files navigation

Conformal Data Cleaning

This repository contains source code for the experiments conducted in the AISTATS 2024 paper From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance.

Run Experiments

First of all, use load_corrupt_and_test_datasets.ipynb to download and corrupt the datasets and setup the expected structure of the data directory.

run_experiment.py implements a simple CLI script (run-experiment), which allows to easily run experiments.

Conformal Data Cleaning:

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	experiment \
	--confidence_level \
	"0.999"

ML Baseline:

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	baseline \
	--method \
	"AutoGluon" \
	--method_hyperparameter \
	"0.999"

PyOD Baseline (not included in the paper):

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	baseline \
	--method \
	"PyodECOD" \
	--method_hyperparameter \
	"0.3"

For Garf, please use main.py.

python \
	main.py \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments"

Run our Experimental Setup

We ran our experiments on Kubernetes using Helm. Please checkout the helm charts and change the image and imagePullSecrets settings in the values.yaml files accordingly to your setup. Therefore, some read-write-many volumes are necessary to store the experiment results. Please checkout the infrastructure/k8s directory (and don't forget to setup the data directory as describe above).

Using make docker builds and pushes the necessary docker images and make helm-install uses deploy_experiments.py to start our experimental setup.

Evaluation

notebooks/evaluation contains notebooks we use for evaluating the results and 5_plotting.ipynb outputs the plots shown in the paper.

About

Code for the AISTATS 2024 Paper "From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published