# Syntax: DSC configuration

## Block Properties: main
Main properties of a DSC block are `exec`, `seed`, `params` and `return`, of which `exec` and `return` are required and `seed` and `params` optional.

### exec
`exec` specifies the names of executable computational routines as well as their command line arguments if applicable. For example an `exec` entry reads:

```yaml
  exec: datamaker.R, ms $nsam $nreps -t $theta -seed $seed
```

where data is generated by two programs, `datamaker.R` and `ms`, with command line arguments `nsam`, `nreps` and `theta` & `seed` for `ms`. Although `exec` takes arbitrary command line programs, if the computational routine is [a plugin](Design_and_Features.html#Executable-modes), for example `datamaker.R`, then there is no need to explicitly specify input parameters such as `nsam` and `nreps` for non-plugin routine `ms`. Caution that for plugin the parameter names must match the variable names coded in the plugin script and for non-plugin the input parameters should start with `$` followed by parameter names which should be found under the [`params`](#params) entry of **the same block**.

`exec` is a required parameter.

### params
`params` defines parameters to be used by computational routines in `exec`. It is a indented list with labels (parameter names) corresponding to command arguments (for non-plugin mode) or variable names (for plugin mode) of `exec`. A typical `params` reads:

```yaml
  params:
    n: 1000, 2000
    mean: 0, 1
```

which indicates that there are 2 input parameters, namely `n` and `mean`, for computational routines in `exec`. Combinations of parameter values (Cartesian product style by default) will be assigned to all `exec` unless [otherwise instructed](#.logic). For example, each executable under `exec` will take 4 sets of parameter from the example above: `(n = 1000, mean = 0), (n = 1000, mean = 1), (n = 2000, mean = 0), (n = 2000, mean = 1)`.

`params` is optional parameter.

#### `exec` specific parameters
Often we need to sepcify parameters unique to one `exec` but not applicable to others. Executable specific parameter assignment is needed in this case:

```yaml
  params:
    n: 1000, 2000
    mean: 0, 1
    exec[1]:
      t: 5, 10
```

where `n` and `mean` are shared by all `exec` but parameter `t` is only used by the first executable (indexed by `[1]`) in `exec`.

#### Grouped parameters
This is a "house-keeping" feature that organizes parameters in groups to enhance readability. These group names will be ignored by DSC2 interpreter in action. For example the following two blocks are equivalent:

```yaml
  params:
    sample_params:
      n_samples: 1000, 2000
      p_cases: 0.5
    genotype_params:
      n_snps: 500, 800
      n_genes: 20
```

```yaml
  params:
    n_samples: 1000, 2000
    p_cases: 0.5
    n_snps: 500, 800
    n_genes: 20
```

### seed
`seed` sets random seeds for programs that use random number generators. This is a frequently used feature in simulation studies where multiple replicates for the same analytical routine are required. Setting a range of seeds naturally creates *replicates* of the same scenario or same methods. If there are plugins in `exec`, DSC will call language specific functions (`set.seed()` in R and `numpy.random.seed()` in Python) to set seeds; otherwise the seed values will be treated same as a parameter in `param` and be passed to `exec` the same fashion as other parameters, for example `ms ... -seed $seed` as demonstrated above.

`seed` is optional parameter.

### return
`return` parameter are values to be saved to disk at the end of computation for each block. Only `return` values of a block can be referred to by other blocks (via the `$` symbol). For example:

```yaml
  simulate:
      ...
      return: x, y
```

then `return` values `x` and `y` can be used in other blocks, for example,

```yaml
  test:
      params:
        z: $x
        y: $y
```

where input parameters `z` and `y` for "test" consists of values from `return` of the `simiulate` step `x` and `y`.

It is possible to apply *alias* to return. There are two types of alias supported: extract and re-name. For example to extract a specific attribute from an R list:

```yaml
  return: data, beta = R(data$meta$beta)
```

then in addition to returning `data` which is an R list, it also returns value `beta` which is part of the `data` list, i.e., `data$meta$beta`. For another example:

```yaml
  return: x, y = x_new
```
where `x` and `x_new` exist in data but `x_new` is returned as another variable named `y`.

It is also possible to specify return alias specific to executables, for example:

```yaml
  return:
    exec[1]: score = lkhd
    exec[2]: score = mse
    exec[3]: score = fisher
```

For plugin executables, return value should match a variable name in the corresponding script and thus may or may not be in `params`. For non-plugin executables, return value hae to be one of the `params` values as it is impossible to set other values to a non-plugin executable other than the `params` specification. If return value is a file name (via [File()](#File()) syntax), the corresponding file will be registered to DSC2 to track for possible changes further down the pipelines.

`return` is required parameter.

## Block properties: optional

Optional block properties fine-tunes how the main properties are combined, renamed or defines the particular environment running a DSC block. Available optional properties are `.logic`, `.alias` and `.options`. 

### .logic
`.logic` defines how parameter values are combined. It can be used outside or inside `params`.

#### For executables
When `.logic` appears outside `params` (typically unter `exec` ), it annotates the logic beind `exec` useing the `+` operator to specify how the computational routines should be combined. By default, routines under `exec` are independent from each other; yet this can be changed via `.logic`, for example:

```yaml
    method:
        exec: test1.R, test2.R
        .logic: exec[2], exec[1] + exec[2]
```

DSC will then run two procedures: one only involving `test2.R`, the other runs `test1.R` followed by `test2.R`.

A handy user case for `.logic` under `exec` is pre/post processing of data in between plugin and non-plugin, for example:

```yaml
  admixture:
      exec: make_ped.py $data $ped,
            admixture $ped $K > $output,
            new_admixture_method.R $data $output
      .logic: exec[1] + exec[2], exec[3]
      params:
        K: 3, 6, 9, 15
```

Here two different admixture analysis methods are compared: the `admixture` program and a new method under development, coded in `new_admixture_method.R`. The `admixture` program requires input data in PED format, and a `make_ped.py` script is used to convert data to PED. `.logic` here indicates that `exec[1]` is a pre-processor for `exec[2]` and they should always be executed as one unit.

#### For parameters

When `.logic` appears inside `params`, it overrides the default logic (that all parameters are combined the Cartesian product style). These operators can be written in:

* simple logic statement with `and`, `or`, `not`
* Pythonic syntax: with Python keywords and lambda functions

For example:

```yaml
  params:
    n: 100, 200, 300, 400, 500
    mu: 0, 1
    exec[1]:
      sigma: 1, 2
      .logic: (.n <= 300 and .mu == 0) or (.n > 300 and .mu == 1)
```

Without `.logic`, DSC will exhaust all combinations of 5 values of `n`, 2 of `mu` and 2 of `sigma`, a total of 20 parallel jobs. The `.logic` here states that instead of 20 jobs, DSC will first run 3 values of `n` with `mu = 0` and 2 values of `sigma`, then run another 2 values of `n` with `mu = 1` and `sigma = 1`, which is a total of 8 jobs. Notice that parameter index slicing makes it possible to run a subset of parameter values.


Alternatively you can use pythonic syntax, for example with `in`:

```
      .logic: (.n in [100,200,300] and .mu == 0) or (.n in [400, 500] and .mu == 1)
```

Note that to refer to variables in `.logic` the dot `.` prefix has to be added to variables. Failure to obey this rule will result in error complaining of invalid logic statement.

### .alias

#### For parameters

`.alias` is often used to adjust parameter names for input to different executables. For example:

```yaml
  params:
    mu: 1, 2, 3
    exec[2]:
      .alias: theta = mu
```

in this example, all `exec` takes a parameter named `mu`, except for `exec[2]` which requires parameter with name `theta`, but `theta` in `exec[2]` is equivalent to `mu` in other `exec`. To deal with this situation the `.alias` option can be used to rename `mu` to `theta` for `exec[2]` while keeping it as is for other executables.

For plugins, `.alias` is often used with [`List()` (or `Dict()`)](#List()-or-Dict()-137) operator to consolidate parameters to a single data object (`list` for R, `dictionary` for Python).


#### For executables
`.alias` outside `params` should have a one-to-one correspondence with `exec` entry. The goal of `alias` here is to rename executables for output to DSC database that will be [annotated and queried from](DSC_Annotation.html) after executing the DSC. For example:

```yaml
  pi0_score:
      exec: score.R
      .alias: score_pi0
      params:
          ...
      return: result
```

Without `.alias`, the output step name in DSC database for this DSC block will be `score.R`. With `.alias` however, the output step name will be changed to `score_pi0`. 

Other than enhancing readiability, `.alias` is particularly useful when the same `exec` is used as different steps for different purposes. For example as demonstrated in [the complete version of the example above](../tutorials/Intermediate_R_1.html) the `score.R` routine is used to evaluate both $\pi_0$ and $\beta$ estimates. It is necessary to distinguish between these two contexts.

`.alias` can also be used along with `.logic` for `exec` to rename composite steps. For example:

```yaml
simulate:
  exec: BM.R, MultiBM.R, PostProcBM.py, PostProcMultiBM.py
  .logic: exec[1] + exec[3], exec[2] + exec[4]
  .alias: BM, MultiBM
```

### .options

*FIXME: this feature is not yet implemented as of Jan 16, 2017. With the cluster support we may end up using a separate cluster job configuration file for it to keep DSC configuration file concise.*

`.options` include parameters that controls behavior of the corresponding `exec` as it executes, for example:

```yaml
  .options: ncpu = 2, mem = 4G
```

Supported options are:

*  `ncpu`: Number of required CPUs.
*  `mem`: Required memory.
*  `inline`: True or False, of whether or not an R script is executed inline with the next procedure instead of producing return files. This feature is useful when the cost of computation for a procedure is trivial compared to the cost of storing its output. For example if a simulation procedure is simply `runif(500000)` it makes more sense to save this line of code and execute it inline with the next step, rather than to save a vector of 500,000 random numbers to disk.

### Scope of optional properties inside `params`

When these parameters appear in `params` but outside any `exec[i]`, then they will also effect all parameters under `exec[i]` when applicable. However this behavior can be overloaded inside `exec[i]` if the same parameter is re-defined.

Take `.alias` for example:

```yaml
  params:
    mu: $mu
    beta: $beta
    .alias: theta = beta
    exec[2]:
      .alias: theta = mu
```

Then for all `exec` except for `exec[2]`, the `theta` parameter will take the value of `beta`, whereas for `exec[2]` the `theta` parameter takes value of `mu`.

## DSC Block Operators
### Wildcard sigils
There are two types of wildcard sigils: `$` and `$()`.

#### `$` in `exec`
In `exec`, `$` refers to *parameters* defined inside `params` of **the same** block, eg., parameters `ped` and `K` in this example:

```yaml
  admixture:
      exec: admixture $ped $K
      params:
        K: 3, 6, 9, 15
        ped: ...
```

#### `$` in `params`
In `params`, `$` refers to *return values* from an **upstream block**, for example `$x` in the [Quick Start tutorial](../tutorials/Quick_Start.html).

#### `$()`
`$()` refers to variables defined in `DSC::params`, ie, `params` entry under `DSC` block. For example:

```yaml
  simulate:
      params: 
         methods: $(data_functions)
  ...
  DSC:
      ...
      params:
          data_functions: mvngenotypes, discrete.cosine, discrete.cosine2, discrete.cosine.peaksel
```

is equivalent to

```yaml
  simulate:
      params:
         methods: mvngenotypes, discrete.cosine, discrete.cosine2, discrete.cosine.peaksel
```

### group operator
The group operator is `()`, parenthesis that groups parameters as one unit. For example:

```yaml
  exec: method.R, program.exe $K
  params:
     K: (1,2,3), (4,5,6)
```

With `()`, `(1,2,3)` will be translated to vector assignment `c(1, 2, 3)` in R plugin, tuple `(1,2,3)` in Python plugin, or space separated argument sequence `program.exe 1 2 3` for non-plugin routines. Values will be assigned in units defined by `()` instead of separately.

### R(), Python(), Shell()
These operators run codes inside parenthesis using R, Python or Shell interpreters and evaluate the output. For example `seed: R(1:5)` results in `seed: 1, 2, 3, 4, 5`. This provides handy tool for generating input parameters with R, Python or Shell languages.

### ForEach(), Pairs()
Cartesian product and paired grouping of parameters. These operators makes it easier to assign values to DSC. For example:

```yaml
  exec: ForEach(classifier.R, (kernel_1, kernel_2, kernel_3))
```

is equivalent to the Cartesian product logic

```yaml
  exec: classifier.R kernel_1, classifier.R kernel_2, classifier.R kernel_3
```

and
```yaml
  exec: Pairs($(classifier), $(kernel))
...
DSC:
  params:
    classifier: m1.R, m2.R, m3.R
    kernel: k1, k2, k3
```
is equivalent to:

```yaml
  exec: m1.R k1, m2.R k2, m3.R k3
```

### Asis()
In DSC file, numeric vs. string data-types are automatically determined and there is no need to add quotes to strings. This is convenient in most cases but can be problematic when the input appear as strings but are in fact, for example, a chunk of R codes that should be executed in computational routines rather than converted to strings. `Asis()` operator will safeguard these special input from being treated as strings. For example,

```yaml
  g: Asis(normalmix(c(2/3,1/3),c(0,0),c(1,2)))
```

will result in

```r
  g = normalmix(c(2/3,1/3),c(0,0),c(1,2))
```

But without `Asis()` it will result in

```r
  g = "normalmix(c(2/3,1/3),c(0,0),c(1,2))"
```

which is problematic.

`Asis()` can also be used to indicate a *raw string*, for example:

```yaml
  g: (1, 2, 3)
  k: ('1', '2', '3')
  l: (Asis('1', '2', '3'))
```

will result in

```r
  g = c(1, 2, 3)
  k = c('1', '2', '3')
  l = c("'1'", "'2'", "'3'")
```

### File()

*FIXME: multiple return files are not yet implemented as of Jan 16, 2017.*

*Also File() without extension might be replaced by Temp()*

#### Extension specified
When a parameter is a file name but the file is yet to be created (by the current computational step), it is required that `File()` operator be applied to the parameter to indicate that it represents a file / files although they do not yet exist . `File()` operator only specifies **file extensions** and DSC will automatically give it a unique basename in system [temporary folder](https://en.wikipedia.org/wiki/Temporary_folder). For example:

```yaml
      params:
          data: $sim
          K: 1, 2, 3, 4, 5
          ped: File(ped, map)
          score: File(score)
      return: score
```

Here the return value `score` is a file, likely generated by the computational routine taking input `data` and parameter `K` with 5 values. As a result, the returned `score` should be 5 files. Names of these 5 output `score` files will be automatically assigned, and users do not have to worry about file name specifications for different combinations of parameters. We allow for a simple interface to [annotate and query](DSC_Annotation.html) the output files upon completion of DSC so that users can keep track of the outcome from particular sets of parameters.

When there are multiple extensions for example a group of `(ped, map)`, then the resulting files are also in groups, e.g., `xxx1.ped, xxx1.map`, ... `xxx5.ped, xxx5.map`.

Only files in `return` will be registered and saved. Other files are considered temporary and DSC will not keep track of them. These files will be written to the temporary folder of your system.

#### Extension not specified
When using `File()` without specifing the extension, it generates a unique string that you can potentially use as file prefix, step identifier, etc. Different from `File(ext)` the string `File()` generates does not have a path to the temporary folder. It is a temporary variable rather than a temporary file.

### List() or Dict()

*FIXME: partial conversion is not yet implemented as of Jan 16, 2016.*.

This is used in `.alias` entries to help construct variables for plugin executables. For example for an R plugin

```yaml
  .alias: args = List()
```
 will convert all variables in corresponding parameter space, say `x,y,z` to `args = list(x = ..., y = ..., z = ...)`. Likewise, `List()` will convert parameters to `dictionary` in Python. Partial conversion is also supported, for example `args = List(x, y)` will only convert selected variables to R list which will be translated to R code `args <- list(x = x, y = y)`. 
 
The operator has two names in order to indicate that in R it binds variables to `list` and in Python `dict`. However these two operators can be used interchangably. Using the one or another matching the executable type only helps with readability of the DSC configuration file.

### Index and slicing
Index can be used in the following context:

*  Index for parameters in `exec` entry, for example `exec: makeped.py $data $output[1]` where `output` parameter takes the form of `output: (1.ped, 1.map), (2.ped, 2.map)`. In this case `output[1]` will only use the first value of each parameter group.
*  Index for `exec`. Each element corresponds to a computational routine in `exec`.
*  Index for `params` values when appeared in `.logic` inside `params`; each element corresponds to a subset of the `params` value involved.
*  Index for block names in `DSC::run` sequence; each element corresponds to a computational routine in `exec` for the block involved.

Slicing syntax is allowed. For example, `n[1,2,4]` extracts the first, second and forth elements of `n`. `n[1:4]` extracts elements 1 though 4, and `n[1:9:2]` extracts elements 1, 3, 5, 7, 9.

## Block Inheritance
When one want to write a new block similar in configuration to an existing block, block inheritance can be used to help new block definition become succinct. For example:

```yaml
  SVA:
      exec: SVA.R
      params:
          data: $data
          .alias: List(args)
      return: data

  RUV(SVA):
      exec: RUV.R

  voom(SVA):
      exec: voom.R
```

Here, the 3 blocks differ only in the executable name. With block inheritance, we can completely configure `SVA`, then inherit it to configure "RUV" and "voom" where only `exec` have to be re-defined.

## Comment string
As with many scripting languages (R, Python, shell), `#` can be used to make human-readable explanation to the contents of DSC script.