# An R example: ashr benchmark

This is a more advanced application of DSC with R scripts. We demonstrate in this tutorial features of DSC2 including:

*  Inline code as parameters
*  `@ALIAS` decorator
*  R library installation and version check

## DSC Problem
The DSC problem is based on the ASH example of DSCR ([R Markdown version](https://github.com/stephens999/dscr/blob/master/vignettes/dsc_shrink.rmd) and [HTML version](dscr_dsc_shrink.html)). Material to run this tutorial can be found in [DSC2 vignettes repo](https://github.com/stephenslab/dsc2/tree/master/vignettes/ash). Description below is copied from the DSCR vignette:

> To illustrate we consider the problem of shrinkage, which is tackled by the `ashr` package at [http://www.github.com/stephens999/ashr](http://www.github.com/stephens999/ashr). The input to this DSC is a set of estimates $\hat\beta$,  with associated standard errors $s$. These values are estimates of actual (true) values for $\beta$, so the meta-data in this case are the true values of beta. Methods must take $\hat\beta$ and $s$ as input, and provide as output "shrunk" estimates for $\beta$ (so output is a list with one element, called `beta_est`, which is a vector of estimates for beta). The score function then scores methods on their RMSE comparing `beta_est` with beta.

> First define a datamaker which simulates true values of $\beta$ from a user-specified normal mixture, where one of the components is a point mass at 0 of mass $\pi_0$, which is a user-specified parameter. It then simulates $\hat\beta \sim N(\beta_j,s_j)$ (where $s_j$ is again user-specified). It returns the true $\beta$ values and true $\pi_0$ value as meta-data, and the estimates $\hat\beta$ and $s$ as input-data.


> Now define a [method wrapper](https://github.com/stephenslab/dsc2/blob/master/vignettes/ash/bin/runash.R) for the `ash` function from the `ashr` package. Notice that this wrapper does not return output in the required format - it simply returns the entire ash output.

> Finally add a generic (can be used to deal with both $\pi$ and $\beta$) [score function](https://github.com/stephenslab/dsc2/blob/master/vignettes/ash/bin/score.R) to evaluate estimates by `ash`.

## DSC Specification
The problem is fully specified in DSC2 language below, following the structure of the original DSCR implementation:

```
# module alias and executables
simulate: datamaker.R
    # module input and variables
    seed: R(1:5)
    g: raw(ashr::normalmix(c(2/3,1/3),c(0,0),c(1,2))),
       raw(ashr::normalmix(rep(1/7,7),c(-1.5,-1,-0.5,0,0.5,1,1.5),rep(0.5,7))),
       raw(ashr::normalmix(c(1/4,1/4,1/3,1/6),c(-2,-1,0,1),c(2,1.5,1,1)))
    min_pi0: 0
    max_pi0: 1
    nsamp: 1000
    betahatsd: 1
    # module decoration
    @ALIAS: args = list()
    @CONF: queue = midway
    # module output
    $data: data
    $true_beta: raw(data$meta$beta)
    $true_pi0: raw(data$meta$pi0)

shrink: runash.R
    # module input and variables
    input: $data
    mixcompdist: normal, halfuniform
    # module output
    $ash_data: ash_data
    $beta_est: raw(ashr::get_pm(ash_data))
    $pi0_est: raw(ashr::get_pi0(ash_data))

score_beta: score.R
    # module input and variables
    est: $true_beta
    truth: $beta_est
    # module output aka pipeline variable
    $mse_beta: result

score_pi0: score.R
    # module input and variables
    est: $pi0_est
    truth: $true_pi0
    # module output
    $mse_pi: result

DSC:
    # module ensembles
    define:
      score: score_beta, score_pi0
    # pipelines
    run: simulate * shrink * score
    # runtime environments
    R_libs: ashr@stephens999/ashr (2.0.0+)
    exec_path: bin
    output: dsc_result
    # pipeline variables, will overwrite any module variables of the same name
    # it is also place to config the global random number generator
```

It is suggested that you check out the corresponding R codes for modules [`simulate`](https://github.com/stephenslab/dsc2/blob/master/vignettes/ash/bin/datamaker.R), [`shrink`](https://github.com/stephenslab/dsc2/blob/master/vignettes/ash/bin/runash.R) and the [score function](https://github.com/stephenslab/dsc2/blob/master/vignettes/ash/bin/score.R) to figure out how DSC2 communicates with your R scripts.

Here we will walk through all modules to highlight important syntactical elements.

### Module `simulate`
#### Inline code as parameters and output values
The parameter `g` has three candidate values, all of which are R codes inside `raw()` function. Contents inside `raw()` will be interpreted as functional code pieces rather than strings. In other words, DSC2 will interpret it as `g <- ashr::normalmix(c(2/3,1/3),c(0,0),c(1,2))` so that `g` will be assigned **at runtime** output of R codes in `raw()` for use with `datamaker.R`. Without `raw`, this line will be interpreted as a string assigned to `g` which apparently is problematic. Similarly, `$true_beta: raw(data$meta$beta)` extracts data at runtime and assign it to output variable.

#### Decorator `@ALIAS` for R list
Inside `datamaker.R` the input for the core function is a single parameter of an R [list](http://www.r-tutor.com/r-introduction/list) containing all parameters specified in this module. The decorator `@ALIAS` uses a special DSC2 operation `List()` to consolidate these parameters into an R list `args` which corresponds to the input parameter in `datamaker.R`.

### Module `shrink` 
Here notice the output variable are also provided at runtime via "raw" pieces of R codes, which, in this case, is `get_pi0` function from `ashr` package.

### Module `beta_score` & `pi0_score`
These modules uses the same computational routine `score.R` but on different input data. Due to differences in variable names it is best to configure them in separate blocks. However an alternative style that configures them in the same block is:

```
score_beta, score_pi0: score.R
    @score_beta:
        est: $true_beta
        truth: $beta_est
        $mse_beta: result
    @score_pi0:
        est: $pi0_est
        truth: $true_pi0
        $mse_pi: result        
```

Here `@*` are module specific variable decorations that configures input and output such that different modules can be allowed in the same block. 

Notice that different from the [DSCR ASH example](https://github.com/stephens999/dscr/blob/master/vignettes/dsc_shrink.rmd) the output score is an "atomic" value (a float of RMSE). If the outcome object `result` is not such a simple object, for example it returns an R list, then you may want to use the `raw` operator to only keep the information you need so that they'll be readily extractable in the benchmark query process. To do so, e.g., `$mse_pi: raw(score_output$mse)`.

### `DSC` section
As has been discussed in previous tutorials, `DSC::run` defines module ensembles and executes essentially two pipelines (one ending with `score_beta` another with `score_pi0`). The `R_libs` entry specifies the R package required by the DSC. It is formatted as a github package (`repo/pkg`) and the minimal version requirement is `2.0.0`. DSC will check first if the package is available, and install it if necessary. It will then check its version and quit on error if it does not satisfy the requirement. **DSC does not attempt to upgrade/downgrade a package in cases of version mismatch.**

## Execution logic
This diagram (generated by `dot` command using the execution graph from this DSC) shows the logic of this benchmark:

![ash.png](../img/ash.png)

## Run DSC

In [2]:
! dsc settings.dsc -c 30

INFO: DSC script exported to [32mdsc_result.html[0m
INFO: Constructing DSC from [32msettings.dsc[0m ...
INFO: Building execution graph ...
INFO: DSC in progress ...
DSC: 100%|████████████████████████████████████████| 5/5 [00:30<00:00,  4.26s/it]
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time [32m38.572[0m seconds.


## Result extraction

***FIXME: Update with dscrutils extraction interface***

### Obtain final score for methods comparison


We can examine the result in `R`, similar to what we have done in the [Quick Start example](Explore_Output.html):

In [4]:
%use ir
options(warn=-1)
dat = readRDS('ashr_pi0_1.rds')
case1 = unlist(dat$case1_pi0_score_result)
case2 = unlist(dat$case2_pi0_score_result)

In [5]:
suppressMessages(library(plotly))
p = plot_ly(y = case1, name = 'case 1', type = "box") %>%
  add_trace(y = case2, name = 'case 2', type = "box")  %>% 
  layout(title = "MSE for pi_0 estimate")
htmlwidgets::saveWidget(as.widget(p), "pi0_score.html")
# IRdisplay::display_html(paste(readLines("pi0_score.html"), collapse="\n"))

You can view the output [here](pi0_score.html).

### Obtain intermediate output
You can also extract quantities of interest in any steps in a DSC sequence. For example we want to compare MSE for posterior mean estimate, and at the same time we want to explore the distribution of posterior mean. We first extract both quantities:

In [6]:
! dsc -e beta_score:result shrink:beta_est --target beta_score -o ashr_beta_1.rds \
    --tags "case1 = An && ash_n" "case2 = An && ash_hu" -b dsc_result

Extracting:   0%|          | 0/5 [00:00<?, ?it/s]Extracting:  20%|██        | 1/5 [00:00<00:00,  4.79it/s]Extracting:  40%|████      | 2/5 [00:00<00:00,  4.18it/s]Extracting:  60%|██████    | 3/5 [00:00<00:00,  4.37it/s]Extracting: 100%|██████████| 5/5 [00:01<00:00,  4.86it/s]
INFO: Data extracted to [32mashr_beta_1.rds[0m for DSC result [32mbeta_score[0m via annotations: 
	[32mcase1 = An && ash_n
	case2 = An && ash_hu[0m
INFO: Elapsed time [32m2.013[0m seconds.


Then we plot them both:

In [7]:
%use ir
dat = readRDS('ashr_beta_1.rds')
case1 = unlist(dat$case1_beta_score_result)
case2 = unlist(dat$case2_beta_score_result)
case1_beta = rowMeans(data.frame(dat$case1_shrink_beta_est))
case2_beta = rowMeans(data.frame(dat$case2_shrink_beta_est))
#
suppressMessages(library(plotly))
p = plot_ly(y = case1, name = 'case 1', type = "box") %>%
  add_trace(y = case2, name = 'case 2', type = "box")  %>% 
  layout(title = "MSE for beta estimate")
htmlwidgets::saveWidget(as.widget(p), "beta_score.html")
#IRdisplay::display_html(paste(readLines("beta_score.html"), collapse="\n"))

You can view the output [here](beta_score.html).

In [8]:
p = plot_ly(x = case1_beta, name = 'case 1', opacity = 0.9, type = "histogram") %>%
  add_trace(x = case2_beta, name = 'case 2', opacity = 0.9, type = "histogram") %>%
  layout(title = "Posterior mean distribution")
htmlwidgets::saveWidget(as.widget(p), "beta.html")
#IRdisplay::display_html(paste(readLines("beta.html"), collapse="\n"))

You can view the output [here](beta.html).

### Benchmarking runtime
You can also benchmark the time it takes to run a computational step. For example:

In [9]:
case1 = unlist(dat$DSC_TIMER$case1_shrink_beta_est)
case2 = unlist(dat$DSC_TIMER$case2_shrink_beta_est)
#
suppressMessages(library(plotly))
p = plot_ly(y = case1, name = 'case 1', type = "box") %>%
  add_trace(y = case2, name = 'case 2', type = "box")  %>% 
  layout(title = "Time elapsed for posterior mean estimation")
htmlwidgets::saveWidget(as.widget(p), "beta_time.html")
#IRdisplay::display_html(paste(readLines("beta_time.html"), collapse="\n"))

You can view the output [here](beta_time.html).