Skip to content

zhangyuqing/bea_ensemble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Addressing batch effect with ensemble learning

This repository stores all scripts to reproduce results and figures in the following manuscript:

Zhang, Y., Johnson, W. E., & Parmigiani, G. (2019). Robustifying genomic classifiers to batch effects via ensemble learning. bioRxiv, 703587.

Folders in this directory

  • Scripts are stored under ./code
  • Data used in our simulation and real data examples are stored as R data files under ./data
  • ./figures contains all figures in the main article and in supplementary materials
  • ./results_* contain result files generated by the pipelines for both simulations, and real data application example using 6 or 4 studies. They can be used to reproduce figures under ./figures

Reproduce results in the paper

  • Download this GitHub repository
  • In R, set the current working directory to the GitHub repository: setwd("<parent path>/bea_ensemble/")
  • Run ./code/make_pub_figures_mainpaper.R to reproduce figures in the main paper. Result files from the pipeline, which were used for the figures in the paper, are provided in ./results_*
  • Run ./code/make_pub_figures_supplement.R to reproduce results in the supplementary materials

Simulation

To run the simulation pipeline, execute the following code in command line:

Rscript 1_simpipe.R <sample size per batch> <mean batch effect> <variance batch effect>

This will output performance metrics of models in a sub-directory named ./results. This sub-directory will be created if it does not exist. The result files can be used to generate figures in the paper. The scripts can also be ran on HPC environment, with bash scripts as encluded in the code directory.

Real data application

./code/2_TB_getdata.R generates the data under ./data. Real data are downlowded from GEO using GEOquery, annotated, and cleaned by this script.

./code/3_real_data_pipe.R is used to perform bootstrap sampling on the test data, and evaluated the trained model on each bootstrap. Simply execute the pipeline with

Rscript 3_real_data_pipe.R

Model performance metrics will be output to a sub-directory named ./results_real (automatically created if not exist).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published