Skip to content

xiemeigongzi/awesome-data-synthesis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Maintenance GitHub GitHub GitHub

Inspired by awesome-production-machine-learning

Awesome Data Synthesis

This repository contains a curated list of awesome resources for creating synthetic data

Main Content

Data-driven methods

Tabular

  • FakeR - Generates fake data from a dataset of different variable types
  • CTGAN - CTGAN is a GAN-based data synthesizer that can generate synthetic tabular data with high fidelity. - Paper
  • TGAN - Outdated and superseded by CTGAN
  • gretel - create fake, synthetic datasets with enhanced privacy guarantees
  • On the Generation and Evaluation of Synthetic Tabular Data using GANs - we propose using the WGAN-GP architecture for training the GAN, which suffers less from mode-collapse and has a more meaningful loss.
  • Synthpop - A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis.
  • DataSynthesizer - DataSynthesizer generates synthetic data that simulates a given dataset. It applies Differential Privacy techniques to achieve strong privacy guarantee.
  • MedGAN - medGAN is a generative adversarial network for generating multi-label discrete patient records. It can generate both binary and count variables (i.e. medical codes such as diagnosis codes, medication codes or procedure codes) - Paper
  • MC-MedGAN - Multi-Categorical GANs - Paper
  • tableGAN - tableGAN is a synthetic data generation technique (Data Synthesis based on Generative Adversarial Networks paper) based on Generative Adversarial Network architecture (DCGAN). - Paper
  • VEEGAN - Reducing Mode Collapse in GANs using Implicit Variational Learning - Paper
  • DP-WGAN - The solution trains a wasserstein generative adversarial network (w-GAN) that is trained on the real private dataset. Differentially private training is applied by santizing (norm clipping and adding Gaussian noise) the gradients of the discriminator. Once the model is trained, it can be used to generate sytnethic dataset by feeding random noise into the generator.
  • DP-GAN - Differentially private release of semantic rich data - Paper
  • DP-GAN 2 - Source code of paper "Differentially Private Generative Adversarial Network" - Paper
  • [PateGAN] - Our method modifies the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs - Paper
  • bnomics - Synthetic data generation with probabilistic Bayesian Networks - Paper
  • [MPoM] - Paper
  • CLGP - categorical latent Gaussian process is a generative model for multivariate categorical data - Paper
  • COR-GAN - Correlation-Capturing Convolutional Neural Networks for Generating Synthetic Healthcare Records - Paper
  • synergetr - An R package to generate synthetic data with empirical probability distributions - Paper
  • DPautoGAN - Code for the paper Differentially Private Mixed-Type Data Generation for Unsupervised Learning - Paper
  • SynC - SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula - Paper
  • NIST-PSCR - Code and Data for NIST PSCR Differential Privacy Synthetic Data Challenge - Paper
  • Bn-learn Latent Model - Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software - Paper
  • SAP Security research sample - SAP Security research sample code and tutorials for generating differentially private synthetic datasets using generative deep learning models
  • Python synthpop - Python implementation of the R package synthpop.
  • UCLANesl - UCLANesl - NIST Differential Privacy Challenge (Match 3)
  • Repo on generating synthetic data using GAN - Repo on generating synthetic data using GAN
  • synthia - 📈 🐍 Multidimensional synthetic data generation in Python
  • Synthetic_Data_System - The Alpha Build of the SDS for ideas gathering, testing and commentary
  • QUIPP - Privacy preserving synthetic data generation workflows
  • MSFT synthetic data showcase - Generates synthetic data and user interfaces for privacy-preserving data sharing and analysis.
  • extended-MedGan - Synthetic patient data using generative adversarial networks.
  • Synthesizing quality open data - Synthesizing Quality Open Data Assets from Private Health Research Studies
  • bayesian-synthetic-generator - Repository of a software system for generating synthetic personal data based on the Bayesian network block structure
  • Generating-Synthetic-data-using-GANs - How can we safely and efficiently share encrypted data that is also useful. We use the mechanism of GANs used to generate fake images to generate synthetic tabular data
  • synthetic health data
  • Synthetic data Copula
  • PrivBayes -
  • pategan
  • HoloClean - A Machine Learning System for Data Enrichment.
  • SYNDATA - Generation and evaluation of synthetic patient data - Paper

Multiple formats

Time Series

Sensor data

Process-driven methods

Tabular

Students

Population

Patients & Medical Data

Metrics and dataset evaluation

other

https://github.com/jmschrei/pomegranate
https://github.com/Pushkar-v/Generating-Synthetic-Data-using-GANs
https://github.com/ydataai/ydata-synthetic
https://github.com/DPBayes/data-sharing-examples
https://github.com/jclymo/DataGen_NPBE
https://github.com/theodi/synthetic-data-tutorial
https://github.com/tirthajyoti/Synthetic-data-gen
https://github.com/jhajagos/SynthMedTopia
https://github.com/spiros/tofu
https://github.com/ewvanwinkle/SyntheticDataVault
https://github.com/chasebos91/GAN-for-Synthetic-EEG-Data
https://github.com/jgalilee/data
https://github.com/blt2114/overpruning_in_variational_bnns
https://github.com/avensolutions/synthetic-cdc-data-generator
https://github.com/nikk-nikaznan/SSVEP-Neural-Generative-Models
https://github.com/MertNacar/create-synthetic-data

About

A curated list of awesome resources for creating synthetic data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published