diff --git a/vignettes/differentialExpression.Rmd b/vignettes/differentialExpression.Rmd index 9cf8549..b6d30d7 100644 --- a/vignettes/differentialExpression.Rmd +++ b/vignettes/differentialExpression.Rmd @@ -12,10 +12,15 @@ vignette: > ## Introduction -The `adaptest` R package can be used to perform data-mining and high-dimensional statistical tests that is common in differential expression studies. The package utilizes a two stage procedure: -1. data-mining stage: reduce the dimension of biomarkers based on the associations of biomarkers with an exposure variable. -2. multiple testing stage: adjust for multiple testing to control false positives. -In this vignette, we illustrate how to use `adaptest` to perform such analysis, using a data set containing microarray expression measures. +The `adaptest` R package can be used to perform data-mining and high-dimensional +statistical tests that is common in differential expression studies. The package +utilizes a two stage procedure: +1. data-mining stage: reduce the dimension of biomarkers based on the + associations of biomarkers with an exposure variable. +2. multiple testing stage: adjust for multiple testing to control false + positives. +In this vignette, we illustrate how to use `adaptest` to perform such analysis, +using a data set containing microarray expression measures. --- @@ -24,6 +29,7 @@ In this vignette, we illustrate how to use `adaptest` to perform such analysis, First, we load the `adaptest` package and the (included) `simpleArray` data set: ```{r setup_data} +set.seed(1234) library(adaptest) data(simpleArray) "%ni%" = Negate("%in%") @@ -32,16 +38,20 @@ data(simpleArray) In order to perform Targeted Minimum Loss-Based Estimation, we need three separate data structures: (1) _W_, baseline covariates that could potentially confound the association of biomarkers with the exposure of interest; (2) _A_, -the point exposure of interest; and (3) _Y_, the biomarkers of interest. All values in _A_ ought to be binarized, in order to avoid practical violations of -the assumption of positivity. To invoke the data-adaptive testing function (`adaptest`), we also need to specify the number of top biomarkers `n_top` to the data-mining algorithm, and the number of folds `n_fold` for cross-validation. The smaller `n_top` is, the more selective data-mining algorithm we have. The larger `n_fold` is, more folds are carried our in cross validaiton. - -The TMLE-based biomarker discovery process can be invoked using the -`adaptest` function. The procedure is quite resource-intensive because it -evaluates the association of each individual potential biomarker (of which there -are 1e3 in the included data set) with an exposure of interest, while -accounting for potential confounding based on all other covariates included in -the design matrix. We demonstrate the necessary syntax for calling -`adaptest` below: +the point exposure of interest; and (3) _Y_, the biomarkers of interest. All +values in _A_ ought to be binarized, in order to avoid practical violations of +the assumption of positivity. To invoke the data-adaptive testing function +(`adaptest`), we also need to specify the number of top biomarkers `n_top` to +the data-mining algorithm, and the number of folds `n_fold` for cross- +validation. The smaller `n_top` is, the more selective data-mining algorithm we +have. The larger `n_fold` is, more folds are carried our in cross validaiton. + +The TMLE-based biomarker discovery process can be invoked using the `adaptest` +function. The procedure is quite resource-intensive because it evaluates the +association of each individual potential biomarker (of which there are 1e3 in +the included data set) with an exposure of interest, while accounting for +potential confounding based on all other covariates included in the design +matrix. We demonstrate the necessary syntax for calling `adaptest` below: ```{r adaptest_eval, eval=TRUE} adaptestout <- adaptest(Y = Y, @@ -56,38 +66,58 @@ adaptestout <- adaptest(Y = Y, data(adaptestout) ``` -The output of `adaptest` is an object of class `adaptest`, containing the following objects: -(1) top_index: (integer vector) - indices for the data-mining selected biomarkers -(2) top_colname: (character vector) - names for the data-mining selected biomarkers -(3) top_colname_significant_q: (character vector) - names for the data-mining selected biomarkers, which are significant after multiple testing stage -(4) DE: (numeric vector) - differential expression effect sizes for the biomarkers in \code{top_colname} -(5) p_value: (numeric vector) - p-values for the biomarkers in \code{top_colname} -(6) q_value: (numeric vector) - q-values for the biomarkers in \code{top_colname} -(7) significant_q: (integer vector) - indices of \code{top_colname} which is significant after multiple testing stage. -(8) mean_rank_top: (numeric vector) - average ranking across cross-validation folds for the biomarkers in \code{top_colname} +The output of `adaptest` is an object of class `adaptest`, containing the +following objects: +(1) top_index: (integer vector) - indices for the data-mining selected +biomarkers +(2) top_colname: (character vector) - names for the data-mining selected +biomarkers +(3) top_colname_significant_q: (character vector) - names for the data-mining +selected biomarkers, which are significant after multiple testing stage +(4) DE: (numeric vector) - differential expression effect sizes for the +biomarkers in \code{top_colname} +(5) p_value: (numeric vector) - p-values for the biomarkers in +\code{top_colname} +(6) q_value: (numeric vector) - q-values for the biomarkers in +\code{top_colname} +(7) significant_q: (integer vector) - indices of \code{top_colname} which is +significant after multiple testing stage. +(8) mean_rank_top: (numeric vector) - average ranking across cross-validation +folds for the biomarkers in \code{top_colname} (9) folds: (origami::folds class) - cross validation object -After invoking `adaptest`, the resultant `adaptest` object will have the slots described above completely filled in. The statistical results of this procedure can be extracted using `summary` method. +After invoking `adaptest`, the resultant `adaptest` object will have the slots +described above completely filled in. The statistical results of this procedure +can be extracted using `summary` method. --- ## Interpret + Visualize the Results -This package provides several interpretation methods that can be used to tabular and visualize the results of the data-adaptive tests. +This package provides several interpretation methods that can be used to tabular +and visualize the results of the data-adaptive tests. -The `get_composition` method for a `adaptest` object will produce a table of composition of each data-adaptive parameters that is significant after multiple testing stage: +The `get_composition` method for a `adaptest` object will produce a table of +composition of each data-adaptive parameters that is significant after multiple +testing stage: ```{r get_comp_small} get_composition(object = adaptestout, type = 'small') ``` -Setting the argument `type = "big"` will instead produce a table of composition of each data-adaptive parameters before multiple testing stage, so that there are more columns +Setting the argument `type = "big"` will instead produce a table of composition +of each data-adaptive parameters before multiple testing stage, so that there +are more columns ```{r get_comp_big} get_composition(object = adaptestout, type = 'big') ``` -The `plot` method for a `adaptest` object will produce two plots that help user interpret the results. The first plot is a plot of sorted average CV-rank for all the biomarkers in the original dataset (`Y`). The second plot is a plot of sorted q-values with labels corresponding to the indices of the data-adaptive parameter (as returned in `get_composition`) +The `plot` method for a `adaptest` object will produce two plots that help user +interpret the results. The first plot is a plot of sorted average CV-rank for +all the biomarkers in the original dataset (`Y`). The second plot is a plot of +sorted q-values with labels corresponding to the indices of the data-adaptive +parameter (as returned in `get_composition`) ```{r plot} plot(adaptestout)