# Design and Features

## Design Overview
DSC2 is designed to execute a sequence of computational steps for *Dynamic Statistical Comparisons*. Typically such comparisons involve generating (or gathering) data, performing statistical analyses and evaluating performance of statistical procedures applied. Each step takes some input data from previous step (unless it is the first step) and create some output to be passed to the next step, or to be interpreted as the outcome of the DSC. A DSC user thus has two jobs: define computational steps, and specify how the steps are connected into a sequence.

As a first pass to illustrating the DSC design, let's consider a DSC sequence consists of 3 types of steps: **scenarios**, **methods** and **scores**, where

*  **scenarios**: provide input data and / or computational routines that generate input data.
*  **methods**: defines statistical procedures that analyzes data.
*  **scores**: defines methods that evaluates the result of data analyses compared to the "truth" (model from which the data is generated) and calculates scores that measures performance of methods.

In practice DSC is more generic and flexible -- one does not have to follow this paradigm when composing a DSC, as long as steps and sequences are properly defined. You may often have incomplete DSC in the exploratory phase of a project as you develop methods.

The example below follows the aforementioned 3 steps paradigm but we will allow for combinations of computational steps as illustrated in pipelines below,

```
  | scenario 1               |    | method 1 ->                      |    | score 1 |
  | scenario 2               | -> | method 2 -> method 1             | -> | score 2 |
  | scenario 1 -> scenario 3 |    | method 3 -> method 2 -> method 1 |    | ...     |
  | scenario 4 -> ...        |    | method 4 -> ...                  |    | ...     |
```

where each **scenario**, **method** and **score** is a computational step which can be different approaches to generate data, perform statistical analysis, or measure performance of methods, or can be the same approaches with different parameter settings. The arrows connects the computational steps into sequences.

## A Case Study

To understand the DSC design we review a simple example with (mostly) self-explanatory syntax. DSC syntax is completely documented [elsewhere](DSC_Configuration.html); readers should not worry about learning the syntax at this point.

In this example we compare methods of estimating location parameter from data. We use data generated from *t* distribution and Cauchy distribution, remove outliers (with two Winsorization methods) and estimate mean (via sample average or median), and evaluate performance of sequences of these steps by comparing the estimate with the "ground truth" in terms of mean squared error. 

### DSC configuration

```yaml
  simulate:
      exec: rt.R, rcauchy.R
      seed: R(1:5)
      params:
          n: 1000
          true_loc: 0, 1
      return: x, true_loc

  transform:
      exec: winsor1.R, winsor2.R
      params:
          x: $x
          exec[1]:
              fraction: 0.05
          exec[2]:
              multiple: 3
      return: x

  estimate:
      exec: mean.R, median.R
      params:
          x: $x
      return: loc

  mse:
      exec: MSE.R
      params:
          mean_est: $loc
          true_mean: $true_loc
      return: mse

  DSC:
      run: simulate *
           (transform * estimate, estimate) *
           mse
      exec_path: R
      output: dsc_result
```

Each section in the DSC configuration file, except for the last "DSC" section, is called a *DSC block*. Each DSC block defines 

1. **a family of** computational routines (or steps, usually names of computer scripts or executables in the `exec` entries). For example `winsor1.R` and `winsor2.R` are routines in the `transform` family.
2. Input data to these routines (`params`) and 
3. output data (`return`). 

The family of routines in the same block share similar (not necessarily the same) input parameters, and strictly the same return values. In this example, computational steps from the same block are logically concurrent to each other, consequently by default routines in a `exec` entry are independent to each other. The "DSC" section defines the sequence of steps to execute, via logical combinations of available DSC blocks.


### DSC execution
The DSC sequences definition

```
      run: simulate *
           (transform * estimate, estimate) *
           mse
```
will be expanded to 2 sequences in terms of blocks:

1. `simulate -> transform -> estimate -> mse`
2. `simulate -> estimate -> mse`

Or 12 pipelines in terms of computational steps 

1. `rt.R -> winsor1.R -> mean.R -> mse`
2. `rt.R -> winsor1.R -> median.R -> mse`
3. `rt.R -> winsor2.R -> mean.R -> mse`
4. `rt.R -> winsor2.R -> median.R -> mse`
5. `rcauchy.R -> winsor1.R -> mean.R -> mse`
6. `rcauchy.R -> winsor1.R -> median.R -> mse`
7. `rcauchy.R -> winsor2.R -> mean.R -> mse`
8. `rcauchy.R -> winsor2.R -> median.R -> mse`
9. `rt.R -> mean.R -> mse`
10. `rt.R -> median.R -> mse`
11. `rcauchy.R -> mean.R -> mse`
12. `rcauchy.R -> median.R -> mse`

By allowing for different parameters to each step, more combinations are implicitly defined and consequently more unique pipelines.

### More details on DSC blocks
The **scenarios** part of the DSC sequence is the `simulate` block, which has two computational routines `rcauchy.R` and `rt.R`. These routines generate `n` random samples under Cauchy distribution with location parameter `true_loc`, and *t* distribution with non-centrality parameter `true_loc`, respectively. There are 2 choices of computational routines, each routine has one choice of `n` (1000) and two choices of `true_loc` (0 and 1). 5 replicates are evaluated as defined by `seed`. As a result, the "simulate" block has 20 parallel computational steps.

The **methods** part of the sequence consists of `transform` and `estimate` blocks. `transform` performs two types of winsorization on input data. `estimate` block defines two computational routines to estimate location parameter, via sample mean (`mean.R`) and median (`median.R`) respectively. There are two types of procedures for **methods**: the first is `transform + estimate` which runs the `transform` family of steps first, then run the `estimate` family after data has been transformed; the second is `estimate` which directly performs parameter estimation with the original data produced by its upstream computational steps, ie, from the **scenarios** part.

The **scores** part of the sequence is the `mse` block, which has a single computational routine `MSE.R` to calculate the mean square error as a summary of comparison between the *true* (`true_mean`) and *estimated* (`mean_est`) location parameters, taking `$true_loc` and `$loc` respectively corresponding to the `return` values from previous computational steps in `simulate` and `estimate`.

In this case study, the design of blocks logically follows the **scenarios**, **methods** (pre-processing via `transformation` and analysis via `estimate`), and **scores** paradigm. It does not have to be this way, though: one can even make separated blocks for each computational routine (in this case for every R script) and combine them in the *run* sequence in *DSC* section in a fashion simular to the 12 pipelines expanded from this DSC as shown in the previous section of this document. The style adopted here seems the most reasonable choice because computational routines sharing input and output are consolidated into single blocks, and logic between blocks reflecting the DSC design are clearly presented.


### Variable scope and sharing in DSC
DSC2 configuration script implements an implicit rule for the scope of variables starting with `$` sigil. Within a DSC block, all parameter variables should be considered *local* to the block, not accessible from outside the block; all returned values should be considered *global* variable accessible from other DSC blocks via `$` sigil. It is legitimate and sometimes necessary that multiple blocks have the same variable name in their return entry, for example here both `simulate` and `transform` blocks return `x`. In such case, which block provides value `$x` depends on the context of the DSC sequence. The implicit rule here is that DSC2 will always search for the nearest upstream block that yields the variable it looks for. For example, In sequence `simulate * transform * estimate * mse`, `$x` in `estimate` comes from its direct upstream neighbor `transform`, while in sequence `simulate * estimate * mse` `$x` comes from `simulate`.

## Executable modes

DSC2 will consider a computational routine a *plugin* when the `exec` is R or Python script, and the return value is not [`File()`](DSC_Configuration.html#File()). Computational results and communications between blocks will be handled implicitly. In this mode users do not have to worry about file output; yet they have to make sure the variable names used to program the plugin scripts are defined in the DSC script and are not overwritten in the plugin scripts, and the return value for the DSC block should be a subset of variables avaiable from the plugin script. All the computational routines in the example above are plugins written in R. You can look at a concrete example in the [Quick Start](../tutorials/Quick_Start.html) tutorial.
