This repository contains the supplementary code for the paper: "Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software" by Allan Tucker, Zhenchen Wang, Ylenia Rotalinti, and Puja Myles.
The code is used to perform a latent variable model analysis, generate synthetic patient data, and compare it with the original ground truth data.
The project is organized into the following directories:
data/: This directory should contain the input data files. You need to place thecvdgt.txtfile here.results/: This directory will store the output files generated by the scripts, including intermediate data samples, tables, and figures.scripts/: This directory contains the R scripts for running the analysis.
The following R packages are required to run the scripts:
bnlearnpcalgLaplacesDemonRgraphvizggplot2gridExtrapracmamissForestgRainclusterarulesRevoScaleRsummarytoolskernlabdplyrSuperLearnerprecreclmtest
You can install these packages using the install.packages() function in R. For example:
install.packages(c("bnlearn", "pcalg", "LaplacesDemon", "Rgraphviz", "ggplot2", "gridExtra", "pracma", "missForest", "gRain", "cluster", "arules", "RevoScaleR", "summarytools", "kernlab", "dplyr", "SuperLearner", "precrec", "lmtest"))Note: Rgraphviz requires additional installation steps from Bioconductor. Please refer to the Bioconductor website for instructions.
- Place your data: Put the
cvdgt.txtfile into thedata/directory. - Configure the analysis: Open the
scripts/config.Rfile and adjust the parameters if needed. The default settings are based on the original study. - Run the latent model script: Execute the
scripts/latentModel.Rscript. This script will perform the main analysis and generate the ground truth and synthetic data samples in theresults/directory.Rscript scripts/latentModel.R
- Run the tables and figures script: Execute the
scripts/tables_figures.Rscript. This script will perform the comparison analysis and generate the tables and figures. The function calls at the end of the script are commented out. You can uncomment them to run the specific experiments you are interested in.Rscript scripts/tables_figures.R
This project uses the testthat package for unit testing. The tests are located in the tests/testthat/ directory.
If you don't have testthat installed, you can install it from CRAN:
install.packages("testthat")To run all the unit tests, you can execute the run_tests.R script from the project root directory:
Rscript tests/run_tests.Rscripts/config.R: This file contains all the configuration parameters for the analysis, such as file paths and model parameters.scripts/latentModel.R: This is the main script for the latent variable model analysis. It loads the data, learns the model, and generates synthetic data.scripts/tables_figures.R: This script is used to generate the tables and figures that compare the synthetic data with the ground truth data. It includes various statistical tests and visualizations.