Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
123 lines (110 sloc) 6.45 KB
title tags authors affiliations date bibliography
adaptest: Data-Adaptive Statistics for High-Dimensional Testing in R
R language
data-adaptive statistics
data mining
multiple testing
computational biology
targeted learning
name orcid affiliation
Weixin Cai
name orcid affiliation
Alan Hubbard
name orcid affiliation
Nima Hejazi
name index
Group in Biostatistics, University of California, Berkeley
15 October 2018


The adaptest R package contains an implementation of a methodology based on using data-adaptive statistics for estimating effect sizes, complete with appropriate inference, in high-dimensional settings while avoiding the inferential burdens of multiple testing corrections. To address the issue of multiple testing in situations where the dimensionality is high but sample size comparatively small (e.g., analysis of RNA-seq data), we expose an implementation of a method for statistical inference on data-adaptive target parameters [@hubbard2016statistical] in the form of a software package for the R language and environment for statistical computing [@R].

Data-adaptive test statistics for multiple testing are motivated by efforts to address the limitations of existing multiple testing methods such as the popular Benjamini-Hochberg procedure to control the False Discovery Rate (FDR) [@benjamini1995controlling] or the Bonferroni method to control the Family-Wise Error Rate (FWER) [@dunn1961multiple]. Such methods are well studied in the literature on multiple testing, and it is well established that, for a fixed targeted effect size and fixed sample size, power decreases as the number of tests and corresponding critical values increase [@lazzeroni2010cost]. Further, @lazzeroni2010cost show that if the power for a single test is 80%, the power is approximately 50% for 10; 10% for 1000; and 1% for 100,000 Bonferroni-adjusted tests, a classic method to correct for Type-I error when facing multiple testing issues. This simple example demonstrates that data analysts and other practitioners must invest, at a prohibitively high rate, additional resources to collect samples in order to obtain meaningful results under high-dimensional multiple testing constraints.

Utilizing this recently developed data-adaptive statistical framework, our method reduces information loss induced by standard multiple testing procedures through data-adaptive dimensionality reduction. This recent methodological advance, a data-adaptive multiple testing technique [@cai2018data-adaptive], is a natural extension of the data-adaptive target parameter framework introduced in @hubbard2016statistical and @hubbard2016mining, which present a new class of inference procedures that introduce more rigorous statistical inference into problems being increasingly addressed by smart yet ad hoc algorithms for data mining.

The approach of data-adaptive test statistics improves on current approaches to multiple testing by applying a set of data-mining algorithms (specified by the user) across splits of a particular sample of data, allowing for parameters of interest to be discovered from the data. Such methods uncover associations that are stable across the full sample and restrict multiple testing to a smaller subset of covariates by allowing for variable importance to be measured via the data-adaptive procedure. Test statistics are formulated on a separately held-out subset of data and are expected to both outperform pre-specified test statistics and provide improved power, all while simultaneously allowing for appropriate statistical inference to be performed.

We illustrate how to apply the data-adaptive test statistics for multiple testing by considering a simulated randomized trial with binary treatment and 1000 outcomes (e.g., biomarkers in the microarray analysis). The dataset size is 100 observations. Of the 1000 outcomes (biomarkers), outcome 1 - 10 have effect sizes equal to 0.6, while the treatment has no effect on outcomes 11 - 1000. After applying our data-adaptive test statistics method (using the adaptest function in the R package), we obtain a rank order (regarding effect size) for all outcomes across multiple cross-validation folds. We then average the rank order across folds, sort in ascending order, which gives us Figure \ref{avg_rank}. By looking at the top 15 outcomes in Figure \ref{avg_rank}, we observe that there are two large jumps in average rank order of the top 15 outcomes: between outcome 9 and 4, and between outcome 3 and 2. These jumps naturally divide the outcomes into tiers regarding importance. Outcome 9 consistently ranks highly in the importance measure employed across the many rounds of cross-validation performed. In this example, we recommend practitioner first to analyze outcome 9, and if data size allows, extend the analysis to the group of outcome from 4 to 3, and so on. Figure \ref{q_value} displays adjusted p-values of the same set of outcomes as in Figure \ref{avg_rank}, with a group of outcomes (outcome 9 to outcome 3) with very significant effect.

Average rank order of outcomes regarding absolute estimated effect size across cross-validation folds (simulated data). The top outcomes are displayed after being sorted in ascending order. \label{avg_rank}

Adjusted p-values (using the Benjamini-Hochberg procedure) of the same set of candidate outcomes, computed on a validation set that is mutually exclusive from the data used to compute the rank order in Figure \ref{avg_rank}. The top outcomes are displayed after being sorted in ascending order. \label{q_value}

The adaptest R package provides utilities for performing the estimation and hypothesis testing procedures discussed above, and detailed in @cai2018data-adaptive, alongside utilities for easily producing data visualizations based on the results. The software introduces new classes and methods, based on R's S4 class system, to facilitate its integration into the Bioconductor ecosystem [@huber2015orchestrating], making it well-suited for applications in computational biology, where high-dimensional data structures very often arise. The R package includes documentation and detailed vignettes that will allow for both (bio)statisticians and computational biologists to efficiently make use of this new tool in such data analytic problem settings.