Skip to content

Commit

Permalink
wrap vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
wilsoncai1992 committed Feb 8, 2018
1 parent c7f6f1a commit e5733fd
Showing 1 changed file with 58 additions and 28 deletions.
86 changes: 58 additions & 28 deletions vignettes/differentialExpression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,15 @@ vignette: >

## Introduction

The `adaptest` R package can be used to perform data-mining and high-dimensional statistical tests that is common in differential expression studies. The package utilizes a two stage procedure:
1. data-mining stage: reduce the dimension of biomarkers based on the associations of biomarkers with an exposure variable.
2. multiple testing stage: adjust for multiple testing to control false positives.
In this vignette, we illustrate how to use `adaptest` to perform such analysis, using a data set containing microarray expression measures.
The `adaptest` R package can be used to perform data-mining and high-dimensional
statistical tests that is common in differential expression studies. The package
utilizes a two stage procedure:
1. data-mining stage: reduce the dimension of biomarkers based on the
associations of biomarkers with an exposure variable.
2. multiple testing stage: adjust for multiple testing to control false
positives.
In this vignette, we illustrate how to use `adaptest` to perform such analysis,
using a data set containing microarray expression measures.

---

Expand All @@ -24,6 +29,7 @@ In this vignette, we illustrate how to use `adaptest` to perform such analysis,
First, we load the `adaptest` package and the (included) `simpleArray` data set:

```{r setup_data}
set.seed(1234)
library(adaptest)
data(simpleArray)
"%ni%" = Negate("%in%")
Expand All @@ -32,16 +38,20 @@ data(simpleArray)
In order to perform Targeted Minimum Loss-Based Estimation, we need three
separate data structures: (1) _W_, baseline covariates that could potentially
confound the association of biomarkers with the exposure of interest; (2) _A_,
the point exposure of interest; and (3) _Y_, the biomarkers of interest. All values in _A_ ought to be binarized, in order to avoid practical violations of
the assumption of positivity. To invoke the data-adaptive testing function (`adaptest`), we also need to specify the number of top biomarkers `n_top` to the data-mining algorithm, and the number of folds `n_fold` for cross-validation. The smaller `n_top` is, the more selective data-mining algorithm we have. The larger `n_fold` is, more folds are carried our in cross validaiton.

The TMLE-based biomarker discovery process can be invoked using the
`adaptest` function. The procedure is quite resource-intensive because it
evaluates the association of each individual potential biomarker (of which there
are 1e3 in the included data set) with an exposure of interest, while
accounting for potential confounding based on all other covariates included in
the design matrix. We demonstrate the necessary syntax for calling
`adaptest` below:
the point exposure of interest; and (3) _Y_, the biomarkers of interest. All
values in _A_ ought to be binarized, in order to avoid practical violations of
the assumption of positivity. To invoke the data-adaptive testing function
(`adaptest`), we also need to specify the number of top biomarkers `n_top` to
the data-mining algorithm, and the number of folds `n_fold` for cross-
validation. The smaller `n_top` is, the more selective data-mining algorithm we
have. The larger `n_fold` is, more folds are carried our in cross validaiton.

The TMLE-based biomarker discovery process can be invoked using the `adaptest`
function. The procedure is quite resource-intensive because it evaluates the
association of each individual potential biomarker (of which there are 1e3 in
the included data set) with an exposure of interest, while accounting for
potential confounding based on all other covariates included in the design
matrix. We demonstrate the necessary syntax for calling `adaptest` below:

```{r adaptest_eval, eval=TRUE}
adaptestout <- adaptest(Y = Y,
Expand All @@ -56,38 +66,58 @@ adaptestout <- adaptest(Y = Y,
data(adaptestout)
```

The output of `adaptest` is an object of class `adaptest`, containing the following objects:
(1) top_index: (integer vector) - indices for the data-mining selected biomarkers
(2) top_colname: (character vector) - names for the data-mining selected biomarkers
(3) top_colname_significant_q: (character vector) - names for the data-mining selected biomarkers, which are significant after multiple testing stage
(4) DE: (numeric vector) - differential expression effect sizes for the biomarkers in \code{top_colname}
(5) p_value: (numeric vector) - p-values for the biomarkers in \code{top_colname}
(6) q_value: (numeric vector) - q-values for the biomarkers in \code{top_colname}
(7) significant_q: (integer vector) - indices of \code{top_colname} which is significant after multiple testing stage.
(8) mean_rank_top: (numeric vector) - average ranking across cross-validation folds for the biomarkers in \code{top_colname}
The output of `adaptest` is an object of class `adaptest`, containing the
following objects:
(1) top_index: (integer vector) - indices for the data-mining selected
biomarkers
(2) top_colname: (character vector) - names for the data-mining selected
biomarkers
(3) top_colname_significant_q: (character vector) - names for the data-mining
selected biomarkers, which are significant after multiple testing stage
(4) DE: (numeric vector) - differential expression effect sizes for the
biomarkers in \code{top_colname}
(5) p_value: (numeric vector) - p-values for the biomarkers in
\code{top_colname}
(6) q_value: (numeric vector) - q-values for the biomarkers in
\code{top_colname}
(7) significant_q: (integer vector) - indices of \code{top_colname} which is
significant after multiple testing stage.
(8) mean_rank_top: (numeric vector) - average ranking across cross-validation
folds for the biomarkers in \code{top_colname}
(9) folds: (origami::folds class) - cross validation object

After invoking `adaptest`, the resultant `adaptest` object will have the slots described above completely filled in. The statistical results of this procedure can be extracted using `summary` method.
After invoking `adaptest`, the resultant `adaptest` object will have the slots
described above completely filled in. The statistical results of this procedure
can be extracted using `summary` method.

---

## Interpret + Visualize the Results

This package provides several interpretation methods that can be used to tabular and visualize the results of the data-adaptive tests.
This package provides several interpretation methods that can be used to tabular
and visualize the results of the data-adaptive tests.

The `get_composition` method for a `adaptest` object will produce a table of composition of each data-adaptive parameters that is significant after multiple testing stage:
The `get_composition` method for a `adaptest` object will produce a table of
composition of each data-adaptive parameters that is significant after multiple
testing stage:

```{r get_comp_small}
get_composition(object = adaptestout, type = 'small')
```

Setting the argument `type = "big"` will instead produce a table of composition of each data-adaptive parameters before multiple testing stage, so that there are more columns
Setting the argument `type = "big"` will instead produce a table of composition
of each data-adaptive parameters before multiple testing stage, so that there
are more columns

```{r get_comp_big}
get_composition(object = adaptestout, type = 'big')
```

The `plot` method for a `adaptest` object will produce two plots that help user interpret the results. The first plot is a plot of sorted average CV-rank for all the biomarkers in the original dataset (`Y`). The second plot is a plot of sorted q-values with labels corresponding to the indices of the data-adaptive parameter (as returned in `get_composition`)
The `plot` method for a `adaptest` object will produce two plots that help user
interpret the results. The first plot is a plot of sorted average CV-rank for
all the biomarkers in the original dataset (`Y`). The second plot is a plot of
sorted q-values with labels corresponding to the indices of the data-adaptive
parameter (as returned in `get_composition`)

```{r plot}
plot(adaptestout)
Expand Down

0 comments on commit e5733fd

Please sign in to comment.