Skip to content

syndara-lab/debiased-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Debiasing Synthetic Data Generated by Deep Generative Models

Code to reproduce results in "Debiasing Synthetic Data Generated by Deep Generative Models", presented during the 38th Annual Conference on Neural Information Processing Systems (2024) and available from https://arxiv.org/abs/2411.04216.

In previous work, we demonstrated that the use of deep generative models (DGMs) for synthetic data generation induces considerable bias and imprecision into synthetic data analyses, leading to an inflated type 1 error rate. This arises from the regularization bias inherent to these models, as they are optimized for prediction error rather than estimation error. In this work, we propose a debiasing strategy that targets synthetic data created by DGMs for specific data analyses. Our approach delivers coverage levels for the target parameter that approximate the nominal level and thereby contributes to advancing the reliability and applicability of synthetic data in statistical inference.

Experiments

The following class and helper files are included:

  • utils/custom_ctgan.py: class to train CTGAN (using sdv backend)
  • utils/custom_tvae.py: class to train TVAE (using sdv backend)
  • utils/disease.py: simulate low-dimensional tabular toy data sampled from an arbitrary ground truth population
  • utils/eval.py: functions to calculate and plot inferential utility metrics

Simulation study:

  • sim_generate.py: sample original data and generate (default and debiased) synthetic versions using different generative models
  • sim_evaluate.py: calculate inferential utility metrics of original and synthetic datasets
  • sim_output.ipynb: notebook containing all output (figures and tables) presented in paper

Case study 1:

  • case_study1/IST_generate.py: sample original data and generate (default and debiased) synthetic versions from International Stroke Trial (IST) dataset using CTGAN and TVAE
  • case_study1/utils/eval_IST.py: additional functions to calculate and plot inferential utility metrics for the IST data
  • case_study1/IST_evaluate.ipynb: notebook containing all output (figures and tables) presented in paper

Case study 2:

  • case_study2/case_study2.py: sample original data and generate (default and debiased) synthetic versions from Adult Census Income dataset using TVAE
  • case_study2/case_study2_output.ipynb: notebook containing all output (figures and tables) presented in paper

Cite

If our paper or code helped you in your own research, please cite our work as:

@InProceedings{decruyenaere2024debiasing,
  title = 	 {Debiasing Synthetic Data Generated by Deep Generative Models},
  author =       {Decruyenaere, Alexander and Dehaene, Heidelinde and Rabaey, Paloma and Polet, Christiaan and Decruyenaere, Johan and Demeester, Thomas and Vansteelandt, Stijn},
  booktitle = 	 {Proceedings of the 38th Annual Conference on Neural Information Processing Systems},
  year = 	 {2024},
  pdf = 	 {https://arxiv.org/pdf/2411.04216},
  url = 	 {https://arxiv.org/abs/2411.04216},
  abstract = 	 {While synthetic data hold great promise for privacy protection, their statistical analysis poses significant challenges that necessitate innovative solutions. The use of deep generative models (DGMs) for synthetic data generation is known to induce considerable bias and imprecision into synthetic data analyses, compromising their inferential utility as opposed to original data analyses. This bias and uncertainty can be substantial enough to impede statistical convergence rates, even in seemingly straightforward analyses like mean calculation. The standard errors of such estimators then exhibit slower shrinkage with sample size than the typical 1 over root-n rate. This complicates fundamental calculations like p-values and confidence intervals, with no straightforward remedy currently available. In response to these challenges, we propose a new strategy that targets synthetic data created by DGMs for specific data analyses. Drawing insights from debiased and targeted machine learning, our approach accounts for biases, enhances convergence rates, and facilitates the calculation of estimators with easily approximated large sample variances. We exemplify our proposal through a simulation study on toy data and two case studies on real-world data, highlighting the importance of tailoring DGMs for targeted data analysis. This debiasing strategy contributes to advancing the reliability and applicability of synthetic data in statistical inference.}
}

About

Code to reproduce results in "Debiasing Synthetic Data Generated by Deep Generative Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors