This repository benchmarks tools for tabular data synthesis, providing insights and comparisons to help users identify the most suitable tool for generating synthetic data tailored to their specific use case. This repository is the code for the paper: Benchmarking Tabular Data Synthesis: Evaluating Tools, Metrics, and Datasets on Commodity Hardware for End-Users.
Each tool needs a repo with toolname-main and inside:
- model code (added here using git submodule)
- run_tool.py
- python-version.txt
- requirements.txt
- special-torch.txt (optional for tools that need special torch library before installing reqs.)
Put the real dataset inside the data folder as a file with .csv extension. If you want to compare several tools, pre-processing the original dataset and scaling it is advised.
Inside TDSbenchmark run:
python benchmark.py
Then, as prompted, provide the experiment .json file. For examples of .json configuration files see the "experiments" folder.
The benchmark will run te shell script "monitor_usage" to create a .csv file with the resource performance (CPU, GPU, memory, time) during the benchmark. For iOS, use monitor_usageMac.sh.
python bechmark.py
experiments/per_dataset/adult.json
When finished, benchmark saves:
- A fake dataset under fake_datasets/toolname/toolname dataset.csv
- Performance files: one for CPU and memory performance, one for GPU performance and several for other evaluation metrics. For a complete list of the evaluation metrics, see the original paper.
- For result plots of our benchmark, see "results".