# Syntax: DSC configuration

## Block Parameters
A DSC block is composed of block parameters.

### exec
`exec` specifies the names of executable computational routines as well as their command line arguments if applicable. For example a `exec` entry reads:

```yaml
  exec: datamaker.R, ms $nsam $nreps -t $theta -seed $seed
```

where data is generated by two programs, `datamaker.R` and `ms`, with command line arguments `nsam`, `nreps` and `theta` & `seed` for `ms`. Although `exec` takes arbitrary command line programs, if the computational routine is [a plugin](01-Design-and-Features#executable-modes), for example `datamaker.R`, then there is no need to explicitly specify input parameters such as `nsam` and `nreps`, as long as the parameter names match the variable names coded inside the R script. For non-plugin such as `ms` here, input parameters should start with `$` and followed by parameter names which will be found under the [`params`](#params-optional) entry of **the same block**.

### params [optional]
`params` defines parameters to be used by computational routines under `exec`. It is a indented list with labels (parameter names) corresponding to command arguments (for non-plugin mode) or variable names (for plugin mode) of `exec`. A typical `params` reads:

```yaml
  params:
    n: 1000, 2000
    mean: 0, 1
```

which indicates that there are 2 input parameters, namely `n` and `mean`, for computational routines defined in `exec`. Combinations of parameter values (Cartesian product style) will be assigned to all `exec` unless [otherwise instructed](#logic-optional). For example, each executable under `exec` will take 4 sets of parameter from the example above: `(n = 1000, mean = 0), (n = 1000, mean = 1), (n = 2000, mean = 0), (n = 2000, mean = 1)`.

#### `exec` specific parameters
Often, there are parameters unique to one `exec`. Executable specific parameter assignment is needed in this case:

```yaml
  params:
    n: 1000, 2000
    mean: 0, 1
    exec[1]:
      t: 5, 10
```

where `n` and `mean` are shared by all `exec` but parameter `t` is only used by the first executable (indexed by `[1]`) in `exec`.

#### Grouped parameters
This is a house-keeping feature that organizes parameters in groups to enhance readability. These group names will be ignored by DSC2 interpreter in action. For example the following two blocks are equivalent:

```yaml
  params:
    sample_params:
      n_samples: 1000, 2000
      p_cases: 0.5
    genotype_params:
      n_snps: 500, 800
      n_genes: 20
```

```yaml
  params:
    n_samples: 1000, 2000
    p_cases: 0.5
    n_snps: 500, 800
    n_genes: 20
```

`exec` is a required parameter.

### seed
`seed` sets random seeds for programs that use random number generators. Setting a range of seeds naturally creates *replicates* of the same scenario or same methods. If there are plugins in `exec`, language specific functions (`set.seed()` in R and `[numpy].random.seed()` in Python) will be invoked to set seeds; otherwise the seed values will be be passed to `exec` the same fashion as other parameters in `params`, for example `ms ... -seed $seed` as demonstrated above.

`seed` is an optional parameter.

### return
`return` parameter are values to be saved to disk at the end of computation for each block. Only `return` values of a block can be referred to by other blocks (via the `$` symbol). For example:

```yaml
  simulate:
      ...
      return: x, y
```

then `return` values `x`, `y` and `z` can be used in other blocks, for example,

```yaml
  test:
      params:
        x: $x
        y: $y
```

where input parameters `x` and `y` for "test" consists of values from `return` of "simulate".

It is possible to apply alias to return. There are two types of alias supported: extract and re-name. For example to extract a specific attribute from an R list:

```yaml
  return: data, beta = R(data$meta$beta)
```

then in addition to returning `data` which is an R list, it also returns value `beta` which is part of the `data` list, i.e., `data$meta$beta`. For another example:

```yaml
  return: x, y = x_new
```
where `x` and `x_new` exist in data but `x_new` is returned as another variable named `y`.

For plugin executables, return value should correspond to a variable name in the plugin script and thus may or may not be in `params`. For non-plugin executables, return value should be one of the `params` values. If return value is a file name (via [File()](#file) syntax), the corresponding file will be registered to DSC2 to track its future changes.

`return` is required parameter.

### .logic
`.logic` defines how parameter values are combined. It can be used outside or inside `params`.

#### For executables
When `.logic` appears outside `params` (typically under `exec` ), it uses the `+` operator to specify how the computational routines should be combined. These routines are independent from each other by default, but can be connected via `.logic` entry, for example:

```yaml
    method:
        exec: test1.R, test2.R
        .logic: exec[2], exec[1] + exec[2]
```

then the DSC pipeline will run two procedures: one runs only `test2.R`, the other runs `test1.R` followed by `test2.R`.

A handy user case for `.logic` under `exec` is pre/post processing of data from/to third-party software, for example:

```yaml
  admixture:
      exec: make_ped.py $data $ped,
            admixture $ped $K > $output,
            new_admixture_method.R $data $output
      .logic: exec[1] + exec[2], exec[3]
```

Here two different admixture analysis methods are compared: the `admixture` program and a new method under development, coded in `new_admixture_method.R`. The `admixture` program requires input data in PED format, and a `make_ped.py` script is used to convert data to PED. `.logic` here indicates that `exec[1]` is a pre-processor for `exec[2]` and they should always be combined into one unit.

#### For parameters

*FIXME: this feature is not yet implemented as of May 01, 2016.*

When `.logic` appears inside `params`, it overrides the default logic (that all parameters are combined the Cartesian product style). [DSC sequence](#dsc-sequences) operators are supported. For example:

```yaml
  params:
    n: 100, 200, 300, 400, 500
    mu: 0, 1
    exec[1]:
      sigma: 1, 2
      .logic: n[1:3] * mu[1] * sigma, n[4,5] * mu[2] * sigma[1]
```

Without `.logic`, DSC will exhaust all combinations of 5 values of `n`, 2 of `mu` and 2 of `sigma`, a total of 20 parallel jobs. The `.logic` here states that instead of 20 jobs, DSC will first run 3 values of `n` with `mu = 0` and 2 values of `sigma`, then run another 2 values of `n` with `mu = 1` and `sigma = 1`, which is a total of 8 jobs. Notice that parameter index slicing makes it possible to run a subset of parameter values.

`logic` is optional parameter.

### .alias
#### For executables
When `.alias` appears outside `params`, it should has a one-to-one correspondence with `exec` entry. These alias will be used to name the columns of DSC output database. For example:

```yaml
  pi0_score:
      exec: score.R
      .alias: score_pi0
      params:
          ...
      return: result
```

Without `.alias`, the output step name in DSC database for this DSC block will be `score.r`; with the alias the column name will be `score_pi0`. Here `.alias` is useful when the same `exec` is used in different blocks for different purposes (here it is used to evaluate score for `pi0` but it may also be used in another block evaluating another quantity).

`.alias` can also be used along with `.logic` for `exec` to better name composite steps. For example:

```yaml
simulate:
  exec: BM.R, MultiBM.R, PostProcBM.py, PostProcMultiBM.py
  .logic: exec[1] + exec[3], exec[2] + exec[4]
  .alias: BM, MultiBM
```

#### For parameters
`.alias` is often used to adjust parameter names for input to different executables. For example:

```yaml
  params:
    mu: 1, 2, 3
    exec[2]:
      .alias: theta = mu
```

then all `exec` takes a parameter `mu`, except for `exec[2]` which requires parameter with name `theta`, but `theta` in `exec[2]` is equivalent to `mu` in other `exec`. Under the hood DSC will load values from `mu` and assign them to `theta` for use with `exec[2]`.

For plugins, `.alias` is often used with [Pack](#pack) operator to consolidate parameters to a single data object (list for R, dictionary for Python).

`alias` is optional parameter

### .options

*FIXME: this feature is not yet implemented as of May 01, 2016.*

`.options` include parameters that controls behavior of the corresponding `exec` as it executes, for example:

```yaml
  .options: ncpu = 2, mem = 4G
```

Supported options are:

*  `ncpu`: Number of required CPUs.
*  `mem`: Required memory.
*  `inline`: True or False, of whether or not an R script is executed inline with the next procedure instead of producing return files. This feature is useful when the cost of computation for a procedure is trivial compared to the cost of storing its output. For example if a simulation procedure is simply `runif(500000)` it makes more sense to save this line of code and execute it inline with the next step, rather than to save a vector of 500,000 random numbers to disk.

#### Scope of .logic, .alias and .options
When these parameters appear in `params` but outside any `exec[i]`, then they will also effect all parameters under `exec[i]` when applicable. However this behavior can be overloaded inside `exec[i]` if the same parameter is re-defined.

*FIXME: example needed*

`.options` is optional parameter

## DSC Block Operators
### Sigils
There are two types of wildcard sigils: `$` and `$()`
#### `$` in `exec` entry
In `exec`, `$` refers to parameters defined inside `params` of the same block.

#### `$` in `params` entries
In `params`, `$` refers to return values from an upstream block.

#### `$()`
`$()` refers to variables defined in `DSC::parameters`. For example:

```yaml
  simulate:
      params: 
         methods: $(data_functions)
  ...
  DSC:
      ...
      parameters:
          data_functions: mvngenotypes, discrete.cosine, discrete.cosine2, discrete.cosine.peaksel
```

is equivalent to

```yaml
  simulate:
      params:
         methods: mvngenotypes, discrete.cosine, discrete.cosine2, discrete.cosine.peaksel
```

### group operator
The group operator is `()`, a bare parenthesis that groups parameters as one unit. For example:

```yaml
  exec: method.R, program.exe $K
  params:
     K: (1,2,3), (4,5,6)
```

With `()`, `(1,2,3)` will be translated to vector assignment `c(1, 2, 3)` in R plugin, tuple `(1,2,3)` in Python plugin, or space separated argument sequence `program.exe 1 2 3` for the other command line program. Values will be assigned in units of 3 instead of separately.

### R(), Python(), Shell()
These operators run codes inside parenthesis using R, Python or Shell interpreters and evaluate the output. For example `seed: R(1:5)` results in `seed: 1, 2, 3, 4, 5`. This provides handy tool for generating input parameters.

### Combo(), Pairs()
Cartesian product and paired grouping of parameters. This can be considered short-cut for assigning values in DSC entries. For example:

```yaml
  exec: Combo(classifier.R (kernal_1, kernal_2, kernal_3))
```

is equivalent to

```yaml
  exec: classifier.R kernal_1, classifier.R kernal_2, classifier.R kernal_3
```

### Asis()
In DSC file, numeric vs. string data-types are automatically determined and there is no need to add quotes to strings. This is convenient in most cases but can be problematic when the input appear as strings but are in fact, for example, actual R codes that should not be converted to strings. `Asis()` operator will be useful for this case. For example,

```yaml
  g: Asis(normalmix(c(2/3,1/3),c(0,0),c(1,2)))
```

will result in

```r
  g = normalmix(c(2/3,1/3),c(0,0),c(1,2))
```

But without `Asis()` it will read

```r
  g = "normalmix(c(2/3,1/3),c(0,0),c(1,2))"
```

`Asis()` can also be used to indicate a *raw string*, for example:

```yaml
  g: (1, 2, 3)
  k: ('1', '2', '3')
  l: (Asis('1', '2', '3'))
```

will result in

```r
  g = c(1, 2, 3)
  k = c('1', '2', '3')
  l = c("'1'", "'2'", "'3'")
```

### File()

*FIXME: multiple extension and multiple return files are not yet implemented as of May 01, 2016.*

When a DSC string parameter is a file name but the file is yet to be created (by the current computational step), it is required that `File()` operator be applied to the parameter to indicate that it represents a file / files although they do not yet exist . `File()` operator accepts file extension and will assign a unique basename for the context. For example:

```yaml
      params:
          data: $sim
          K: 1, 2, 3, 4, 5
          ped: File(ped, map)
          score: File(score)
      return: score
```

Here the return value `score` is a file, likely generated by the computational routine taking input `data` file and 5 values of parameter `K`. As a result, the returned `score` should be 5 files. Names of these 5 output `score` files will be automatically assigned, and users do not have to worry about file name specifications for different combinations of parameters. We allow for a simple interface to query the output files upon completion of DSC so that users can keep track of the outcome from particular sets of parameters.

When there are multiple suffix for example a pair of `(ped, map)`, then the resulting files are also in pairs, e.g., `dsc1.ped, dsc1.map`, ... `dsc5.ped, dsc5.map`.

Only files in `return` will be registered and saved. Other files are considered temporary and will not be monitored.

### Pack()

*FIXME: partial conversion with Pack() is not yet implemented as of May 01, 2016.*

This is used in `.alias` entries to help construct variables for plugin executables. For example for an R plugin

```yaml
  .alias: args = Pack()
```
 will convert all variables in corresponding parameter space, say `x,y,z` to `args = list(x = ..., y = ..., z = ...)`. Likewise, `Pack()` will convert parameters to `dictionary` in Python. Partial conversion is also supported, for example `args = Pack(x, y)` will only convert selected variables to R list, `args = list(x = x, y = y)`. 

### Index and slicing
Index can be used in the following context:

*  Index for parameters in `exec` entry, for example `exec: makeped.py $data $output[1]` where `output` parameter takes the form of `output: (1.ped, 1.map), (2.ped, 2.map)`.
*  Index for `exec` in `.logic` inside and outside `params`; each element corresponds to a computational routine in `exec`
*  Index for parameters defined by `params` in `.logic` inside `params`; each element corresponds to a value for the parameter.
*  Index for block names in `DSC::run` sequence; each element corresponds to a computational routine in `exec`.

Slicing syntax is allowed. For example, `n[1,2,4]` extracts the first, second and forth elements of `n`. `n[1:4]` extracts elements 1 though 4, and `n[1:9:2]` extracts elements 1, 3, 5, 7, 9.

## Block Inheritance
When a new block shares similar specifications with existing blocks, block inheritance is introduced to make new block definition more succinct. For example:

```yaml
  SVA:
      exec: SVA.R
      params:
          data: $data
          .alias: Pack(args)
      return: data

  RUV(SVA):
      exec: RUV.R

  voom(SVA):
      exec: voom.R
```

Here, the 3 blocks differ only in the executable name. With block inheritance, we can completely configure `SVA`, then inherit it to configure "RUV" and "voom" where only `exec` have to be re-defined.