# Design and Features

## Design Overview
DSC2 is designed to execute a sequence of computational steps that generate (or gather) data, perform statistical analyses and evaluate the performance of statistical procedures involved. Each step will take some input and create some output. A DSC2 user thus has two jobs: define the steps, and define the sequence of steps.

A typical DSC sequences consists of 3 logical types of steps: **scenario**, **method** and **score**, where

*  **scenario**: provides input data and / or computational routines that generate input data.
*  **method**: defines statistical procedures that analyzes data.
*  **score**: defines methods that evaluates the result of data analyses against the "truth" (model from which the data is generated) and calculates scores as measures of evaluations.


> Peter: I was confused here because this is only an illustration of DSC2; not all use cases of DSC2 need to 
> follow this approach. I think you need to reframe this an illustrative example, and emphasize that DSC2
> is very generic.
>
> Also, maybe use the plural for all three (scenarios, methods and scores)?

_FIXME: At this point I'm not sure if we want to emphasize this typical logic, or emphasize it this heavily. This is essentially DSCR logic. Though DSC2 is more generic I do not want to make people think / use it as a tool to run pipelines. In the domain of methods comparisons any DSC sequence should follow along the line of these 3 broad types of steps_


DSC allows for flexible combination of computational steps, as illustrated in the pipeline below,

```
  | scenario 1               |    | method 1 ->                      |    | score 1 |
  | scenario 2               | -> | method 2 -> method 1             | -> | score 2 |
  | scenario 1 -> scenario 3 |    | method 3 -> method 2 -> method 1 |    | ...     |
  | scenario 4 -> ...        |    | method 4 -> ...                  |    | ...     |
```

where each **scenario**, **method** and **score** is a computational step which can be different approaches to generate data, perform statistical analysis, or measure performance of methods, or can be the same approaches with different parameter settings. It is good to have the 3 logical types in mind when developing a DSC, though these are not necessarily keywords in DSC syntax, as will be demonstrated in the case study below. Naturally you do not have to fully developing each of these logical steps particularly in the exploratory phase of research.



> Peter: What do the arrows mean here?

_FIXME: replace this text illustration with a figure generated via inkscape_


## A Case Study

To understand the DSC design we review a simple example with (mostly) self-explanatory syntax. DSC syntax is completely documented [elsewhere](../../documentation.html#Syntax); readers should not worry about learning the syntax at this point.

> Peter: Can you briefly describe (in a few sentences) what is the aim of this example? It looks like 
> this example is comparing two methods (windsor1 and windsor2) and evaluating these methods
> by computing the MSE between the estimate and the ground-truth value. Is this correct?
> I realize that you describe this example in more detail below, but it is useful to have
> a brief description upfront.

The example reads:

```yaml
  simulate:
      exec: rt.R, rcauchy.R
      seed: R(1:5)
      params:
          n: 1000
          true_loc: 0, 1
      return: x, true_loc

  transform:
      exec: winsor1.R, winsor2.R
      params:
          x: $x
          exec[1]:
              fraction: 0.05
          exec[2]:
              multiple: 3
      return: x

  estimate:
      exec: mean.R, median.R
      params:
          x: $x
      return: loc

  mse:
      exec: MSE.R
      params:
          mean_est: $loc
          true_mean: $true_loc
      return: mse

  DSC:
      run: simulate *
           (transform * estimate, estimate) *
           mse
      exec_path: R
      output: dsc_result
```

Each section in the DSC file, except for the last "DSC" section, is called a *DSC block*. Each DSC block defines **a family of** computational routines (or steps, defined by `exec` entries), input to these routines (`params`) and output (`return`). The family of routines in the same block share similar (not necessarily the same) input parameters, and strictly the same return values. In this example, computational steps from the same block are logically concurrent to each other, because routines for all `exec` entries are independent. The "DSC" section defines the sequence of steps to execute, via logical combinations of available DSC blocks.


> Peter: I think I mostly understand what the paragraph above is saying. But maybe you can give an example?
> e.g., in the "transform" block, are the family of routines `winsor1.R` and `winsor2.R`?
>
> Are you required to define 'exec', 'params' and 'return' for all blocks? Are there many other 
> parameters that aren't mentioned here? 


In this example, the first part of the DSC sequence calls the "simulate" block, which has two computational routines `rcauchy.R` and `rt.R`. These routines generate `n` random samples under Cauchy distribution with location parameter `true_loc`, and *t* distribution with non-centrality parameter `true_loc`, respectively. There are 2 choices of computational routines, each routine has one choice of `n` (1000) and two choices of `true_loc` (0 and 1). 5 replicates are evaluated as defined by `seed`. As a result, the "simulate" block has 20 parallel computational steps.

The second part of the sequence consists of "transform" and "estimate" blocks. "transform" block performs two types of winsorization on input data. "estimate" block defines two computational routines to estimate location parameter, via sample mean (`mean.R`) and median (`median.R`) respectively. In this part, there are two types of procedures: the first is `transform + estimate` which runs the "transform" family of steps first, then run the "estimate" family after data has been transformed; the second is `estimate` which directly performs parameter estimation with the original data produced by its upstream computational steps.

The last part of the sequence calls the "mse" block, which has a single computational routine `MSE.R` to calculate the mean square error as a summary of comparison between the *true* (`true_mean`) and *estimated* (`mean_est`) location parameters, taking `$true_loc` and `$loc` respectively which correspond to the `return` values from previous computational steps.

In this case study, the design of blocks logically follows the **scenario**, **method** (pre-processing and analysis), and **score** paradigm; thus there are 3 blocks. It does not have to be this way, though: one can even make separated blocks for each computational routine (in this case for every R script) and combine them in the *run* sequence in *DSC* section. The style adopted in this example appears to be the most reasonable choice because computational routines sharing input and output are consolidated into single blocks, and logic between blocks reflecting the DSC design are clearly presented.

> Peter: I'm confused---aren't there 4 blocks in this example? (simulate, transform, estimate, mse)

DSC2 script implements an implicit rule for the scope of variables starting with `$` sigil. Within a DSC block, all parameter variables should be considered *local* to the block, not accessible from outside the block; all returned values should be considered *global* variable accessible from other DSC blocks via `$` sigil. It is legitimate and sometimes necessary that multiple blocks have the same variable name in their return entry, for example here both "simulate" and "transform" blocks return `x`. In such case, which block provides value `$x` depends on the context of the DSC sequence. The implicit rule here is that DSC2 will always search for the nearest upstream block that yields the variable it looks for. For example, the DSC sequence `simulate * (transform * estimate, estimate) * mse` can be expanded to `simulate * transform * estimate * mse` and `simulate * estimate * mse`. In the first sequence, `$x` in "estimate" comes from its direct upstream neighbor "transform", not "simulate", while in the second sequence `$x` comes from "simulate".

> Peter: The "transform" block takes x as input and returns x. Could DSC2 run transform multiple times?

## Executable modes
DSC2 will consider a computational routine a *plugin* when the `exec` is R or Python script, and the return value is not [`File()`](DSC_Configuration.html#File()). Computational results and communications between blocks will be handled implicitly (via RDS files). In plugin mode, users only have to make sure the variables used in plugin scripts are defined in the DSC script not the plugin scripts, and the return value for the DSC block should be one or multiple variables in the plugin script. All the computational routines in the example above are plugins written in R.
