# Design and Features

## Design Overview
DSC2 is designed to manage and execute all computations required for *Dynamic Statistical Comparisons*. Typically such comparisons involve generating (or gathering) data, performing statistical analyses and evaluating performance of statistical procedures applied. Each step takes some input data from previous step (unless it is the first step) and create some output to be passed to the next step, or to be interpreted as the outcome of the DSC. A DSC user thus has two jobs: define computational steps (as [DSC modules](Terminology.html)), and specify how the steps are connected to be executed (as [DSC pipelines](Terminology.html)).

As a first pass to illustrating the DSC design, let's consider a DSC sequence consists of 3 types of steps: **scenarios**, **methods** and **scores**, where

*  **scenarios**: provide input data and / or configures computational routines that generate input data.
*  **methods**: defines statistical procedures that analyzes data.
*  **scores**: defines methods that evaluates the result of data analyses compared to the "truth" (model from which the data is generated) and calculates scores that measures performance of methods.

In practice DSC is more generic and flexible -- one does not have to follow this paradigm when composing a DSC, as long as steps and sequences are properly defined. You may often have incomplete DSC in the exploratory phase of a project as you develop methods.

The example below follows the aforementioned 3 steps paradigm but we will allow for combinations of computational steps as illustrated in pipelines below,

**FIXME: to be replaced by an Adobe Illustrator figure**

```
  | scenario 1               |    | method 1 ->                      |    | score 1 |
  | scenario 2               | -> | method 2 -> method 1             | -> | score 2 |
  | scenario 1 -> scenario 3 |    | method 3 -> method 2 -> method 1 |    | ...     |
  | scenario 4 -> ...        |    | method 4 -> ...                  |    | ...     |
```

where each **scenario**, **method** and **score** is a computational step which can be different approaches to generate data, perform statistical analysis, or measure performance of methods, or can be the same approaches with different parameter settings. The arrows connects the computational steps into sequences.

## A Case Study

To understand the DSC design we review a simple example with (mostly) self-explanatory syntax. DSC syntax is completely documented [elsewhere](DSC_Configuration.html); readers should not worry about learning the syntax at this point.

In this example we compare methods of estimating location parameter from data. We use data generated from *t* distribution and Cauchy distribution, remove outliers (with two Winsorization methods) and estimate mean (via sample average or median), and evaluate performance of sequences of these steps by comparing the estimate with the "ground truth" in terms of mean squared error. 

### DSC script

```
t, cauchy: rt.R, rcauchy.R
 replicate: R(1:5)
 n: 1000
 true_loc: 0, 1
 $x: x
 $true_loc: true_loc

winsor1, winsor2: winsor1.R, winsor2.R
 x: $x
 @winsor1:
     fraction: 0.05
 @winsor2:
     multiple: 3
 $x: x

mean, median: mean.R, median.R
  x: $x
  $loc: loc

mse: mse.R
  mean_est: $loc
  true_mean: $true_loc
  $mse: mse

DSC:
  define: 
      simulate: t, cauchy
      transform: winsor1, winsor2
      estimate: mean, median
      score: mse
  run: simulate *
       (transform * estimate, estimate) *
       score
  exec_path: R
  output: dsc_result
```

### DSC execution
The sequence to execute:
```
      run: simulate *
           (transform * estimate, estimate) *
           score
```
will be expanded to 2 sequences:

1. `simulate -> transform -> estimate -> score`
2. `simulate -> estimate -> score`

Or 12 sequences in terms of computational executables 

1. `rt.R -> winsor1.R -> mean.R -> mse.R`
2. `rt.R -> winsor1.R -> median.R -> mse.R`
3. `rt.R -> winsor2.R -> mean.R -> mse.R`
4. `rt.R -> winsor2.R -> median.R -> mse.R`
5. `rcauchy.R -> winsor1.R -> mean.R -> mse.R`
6. `rcauchy.R -> winsor1.R -> median.R -> mse.R`
7. `rcauchy.R -> winsor2.R -> mean.R -> mse.R`
8. `rcauchy.R -> winsor2.R -> median.R -> mse.R`
9. `rt.R -> mean.R -> mse.R`
10. `rt.R -> median.R -> mse.R`
11. `rcauchy.R -> mean.R -> mse.R`
12. `rcauchy.R -> median.R -> mse.R`

By allowing for different parameters to each executable, more combinations are implicitly defined and consequently more unique sequences executed.

## A brief narrative to the case study

As has been previously motivated, there are 3 steps in this example: **scenarios**, **methods** and **scores**.

The **scenarios** part is the first code block in the file, which has two computational executables `rcauchy.R` and `rt.R`. These routines generate `n` random samples under Cauchy distribution with location parameter `true_loc`, and *t* distribution with non-centrality parameter `true_loc`, respectively. There are 2 choices of computational routines, each routine has one choice of `n` (1000) and two choices of `true_loc` (0 and 1). 5 replicates are evaluated as defined by `replicate`. As a result, this code block will result in 20 parallel computations.

The **methods** part consists of `transform` and `estimate`. `transform` performs two types of winsorization on input data. `estimate` has two computational routines to estimate location parameter, via sample mean (`mean.R`) and median (`median.R`) respectively. There are two types of procedures for **methods**: the first is `transform + estimate` which runs the `transform` family of steps first, then run the `estimate` family after data has been transformed; the second is `estimate` which directly performs parameter estimation with the original data produced by its upstream steps, ie, from the **scenarios** part.

The **scores** part is `score`, which has a single computational routine `mse.R` to calculate the mean square error as a summary of comparison between the *true* (`true_mean`) and *estimated* (`mean_est`) location parameters, taking `$true_loc` and `$loc` respectively corresponding to the *output* values from previous steps in **scenarios** and **methods**

In this case study, the design of blocks logically follows the **scenarios**, **methods** (pre-processing via `transformation` and analysis via `estimate`), and **scores** paradigm. It does not have to be this way, though: one can even make separated blocks for each computational routine (in this case for every R script) and combine them in the *run* sequence in *DSC* section in a fashion simular to the 12 pipelines expanded from this DSC as shown in the previous section of this document -- it is a decision users has the liberty to make based on style preference and logic. 

## Output files

Computational results and communications throughout DSC will be handled implicitly. Users do not have to worry about file output. Yet users are responsible for matching variables specified in DSC script to what are in fact written in provided computational routines. This is a key feature of DSC.