Skip to content

stefan-m-lenz/dist-gen-comp

Repository files navigation

Comparison of distributed generation of synthetic data using DBMs, VAEs, GANs and MICE

This repository contains an experiment that compares deep Boltzmann machines (DBMs), variational autoencoders (VAEs), generative adversarial networks (GANs) and multivariate imputation by chained equations (MICE) as generative approaches on distributed data. The experiment uses data sets of genetic variants from the 1000 Genomes Project that are split onto a number of sites, which could represent different medical centers with sensitive patient data. At the sites, models are trained and the synthetic data generated by these models are collected. The generative approaches are then compared in terms of the distances of the log odds ratios between held-out validation data and synthetic observations sampled from the modules. More details about the experimental setup can be found in the article

[1] Lenz, S., Hess, M. & Binder, H. Deep generative models in DataSHIELD. BMC Med Res Methodol 21, 64 (2021). https://doi.org/10.1186/s12874-021-01237-6

The code in this repository has been adapted from the article

[2] Nußberger J, Boesel F, Lenz S, Binder H, Hess M. Synthetic observations from deep generative models and binary omics data with limited sample size. Briefings in Bioinformatics. 2020. doi:10.1093/bib/bbaa226.

The original data is from https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html. The preprocessed data for the experiment can be found in the data folder.

Running the Julia script comparison_odds.jl produces the file results_odds.tsv. This took about 25 hours on a cluster of three machines with 8, 12, and 28 cores with clock speeds of about 3 GHz and at least 8 GB RAM per core. The R script comparisonplot_odds.r can be used to reproduce figures 6 and 7 in [1] from results_odds.tsv.

Running the Julia script comparison_membership.jl performs a membership attack and produces the file results_membership.tsv. Running the script took 11 minutes using the same machines. This script uses the results_odds.tsv as input and evaluates only the models with the best scores concering the odds ratio distances therein. The R script comparisonplot_membership.r can be used to reproduce figures 8 and 9 in [1] from the output file.

The Julia code runs with Julia version 1.5. The used packages are defined via the files Manifest.toml and Project.toml.

About

Comparison of distributed generation of synthetic data using DBMs, VAEs, GANs and MICE

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published