# DSC syntax: basics of modules

In this document we mainly use toy DSC examples (including breaking down [this toy](https://github.com/stephenslab/dsc2/blob/master/vignettes/one_sample_location/settings.dsc)) for detailed introduction on basic DSC syntax (please refer to [Terminology](Terminology.html) and [Design and Features](Design_and_Features.html) for a broader discussion on this example). We endeavor to maintain a simple syntax to configure a DSC. This documentation covers the basic DSC syntax for users to get started quickly. We will use dedicated tutorial examples on various specific user cases separatedly, such as using other languages / command tools and working with remote computers or HPC clusters.

A DSC file consists of one or more syntax blocks for module definitions, and one block for benchmark execution specifications. Here we focus on discussing module blocks. DSC benchmark execution will be discussed in [a separate documentation page](DSC_Execution.html).

## Modules

A module block consists of:

* `headline`
* `input`
* `parameters`
* `output`

and, optionally, `decorators` that provide customization to modules in the same block. Note that we do not use these exact terms as keywords in module syntax, yet as will be illustrated soon, these components distinguish from each other syntactically.

### Module headline

Modules, the building blocks of DSC, can be minimally specified as follows:

```
normal, t: rnorm.R, rt.R
    n: 1000
    $x: x
```

Here we focus on the first line (aka the "headline"):

```
normal, t: rnorm.R, rt.R
```

The two elements separated by `:` are module names and executables. The module named `normal` corresponds to `rnorm.R`, `t` corresponds to `rt.R`. For cases when there are multiple module names on the left of `:` yet only one executable on the right, eg:

```
normal, t: simulate.R
```

then the two modules will share the same executable (yet, to be discussed later, different module parameters).

### Module Inheritance

Modules can be inherited on the headline as a shorthand to define new modules based on existing ones. For example:

```
normal, t: rnorm.R, rt.R
    n: 1000
    $x: x
    
shifted_normal(normal): 
    mu: 1
```

`shifted_normal` is a derived module from `normal`. The complete definition of `shifted_normal`, after expanding with the derived contents, is in fact:

```
shifted_normal: rnorm.R
    n: 1000
    mu: 1
    $x: x
```

The inheritance syntax makes module definitions not only succint but also conceptually more clear to define new modules -- in this example we can tell that `shifted_normal` is essentially `normal` with additional parameters.

### Module parameters

Module parameters are variables to be used with module executables. Each parameter is a indented entry under module headline with parameter names corresponding to variable names, or command-line arguments, for module executables. 

In the example above, `n` is module input parameter for both modules `normal` and `t`, corresponding to the variable `n` in both [rnorm.R](https://github.com/stephenslab/dsc2/blob/master/vignettes/one_sample_location/R/scenarios/rnorm.R) and [rt.R](https://github.com/stephenslab/dsc2/blob/master/vignettes/one_sample_location/R/scenarios/rt.R) scripts. Here `n` is set to `1000`. When set to an array-like value, eg `n: 100, 500, 1000` then effectively 3 modules are defined for `normal` and `t` each. Take `normal` for example:

* module 1: `rnorm.R` with `n` set to 100
* module 2: `rnorm.R` with `n` set to 500
* module 3: `rnorm.R` with `n` set to 1000

It is also possible to specify a parameter with groups of values, for example, `p: (0.1, 0.9), (0.2, 0.8)` will initialize 2 modules each with parameter `p` of length 2.

When multiple parameters are specified, eg:

```
    n: 10, 20
    p: 0.1, 0.2
```

Combinations of parameters (Cartesian product style by default) will be assigned to all modules unless otherwise instructed via [`@FILTER`](#Decorator-FILTER-123) decorator. In the example above each module will take 4 sets of parameters : `(n = 10, p = 0.1), (n = 10, p = 0.2), (n = 20, p = 0.1), (n = 20, p = 0.2)`.

### Module output

Module output are pipeline variables that can be accessed by other downstream modules in the pipeline. These variables have a leading `$` symbol followed by variable names. For example:

```
normal, t: rnorm.R, rt.R
    n: 1000
    $x: x
```

`$x` on the left hand side of `:` is a module output. Under the hood, the [`rnorm.R`](https://github.com/stephenslab/dsc2/blob/master/vignettes/one_sample_location/R/scenarios/rnorm.R) script generates a variable `x`, which is an `n` vector of normally distributed numbers. By specifying `$x: x`, `x` becomes a "module output". That is, other downstream steps will be able to use it via `$x` as their input, as will be discussed later in detail.

Language specific syntax can be used to extract specific data from objects for use as module output. For example `beta = R(data$meta_info$beta)` extracts `beta` from an `R` nested list that stores meta information, and use the data extracted as module output `beta`.


For modules whose executables are R or Python scripts, a module output can match any arbitary variable name inside the script, thus may or may not appear in module parameters. For example in the above code block `x` is not a module parameter. For command-line executables, module output will have to be one of the module parameters. 

** FIXME: give a tutorial on using command-line tools with DSC **

### Module input

Module input are pipeline variables generated by other upstream modules in the pipeline. Same as module output, module input has the pipeline variable syntax `$` followed by variable names. What distinguishes module input and output is whether they appear on the left side of `:` or the right side. For example:

```
normal, t: rnorm.R, rt.R
    n: 1000
    $x: x

shrink: method.R
    x: $x
```

Here in the `shrink` module, `$x` is a module input, whose value will be assigned to a variable `x` that will be used in `method.R` script. The runtime environment determines which module this variable comes from. For example if the pipeline is `normal -> shrink` then module input `$x` of `shrink` is output `$x` of `normal` module.

## Decorators

The word "decorator" is borrowed from the term "decorator pattern" in object-oriented programming (a design pattern that allows behavior to be added to an individual object without affecting the behavior of other objects from the same class). A DSC decorator, when applied, will modify behavior of modules specified. Available decorators are:

* @ALIAS
* @FILTER
* @CONF

Additionally when a module name is used as decorator it indicates the specifications underneath will only impact the module in question.

### Module as decorator

Module decorator sepcifies parameters or inputs unique to one module but not applicable to others. for example

```
normal, t: rnorm.R, rt.R
    n: 1000
    @t:
        df: 5, 10
...
```

where `n` is shared by both modules but parameter `df` is only used by module `t`. Module decorator always took precedence over parameters / inputs shared by modules when assigning new variables or modify existing ones, that is:

```
normal, t: rnorm.R, rt.R
    n: 1000
    @t:
        n: 200
...
```
will set `n = 200` for `t`, rather than using `n = 1000`. 

Note that to ensure output from all modules are the same, module decorators cannot be used to configure module output variables.

### Decorator ALIAS

`@ALIAS` is used to adjust the way inputs and parameters are passed to modules. DSC module parameter are of "simple" types (single elements or an array of single elements). In practice parameters specification may get complicated. `@ALIAS` can be used to compose a commonly used parameter input theme: all parameters are nested inside a "key-value" system.

For example an R script `method.R` accepts one "List" variable `args` that contains `var_a` and `var_b`:

```
f1 = function(x) { ... }
f2 = function(x) { ... }
result = f1(args$var_a) / f2(args$var_b)
```

Then instead of modifying `method.R` to take "simple" parameters, we can use `@ALIAS`,

```
method: method.R
    a: 1
    b: 2
    @ALIAS: args = List(var_a = a, var_b = b)
```

such that `a` and `b` are properly consolidated into an R List `args` as `var_a` and `var_b` respectively. Currently we allow for `List()` and `Dict()` for R's List and Python's Dictionary though more object types may be supported as needed in future versions.

Another usage of `@ALIAS` is to adjust parameter names for the module specified. For example, suppose the variable `n` in script `rnorm.R` is `n` but in `rt.R` it is `n_samples`, then we can use `@ALIAS` to adjust module `t`:

```
normal, t: rnorm.R, rt.R
    n: 1000
    @ALIAS:
        t: n_samples = n
```

Then `n` is renamed to `n_samples` for input to module `t`.

### Decorator FILTER

** FIXME: add link to syntax documentation for condition statements **

`@FILTER` can be used to modify default logic that all parameters are combined the Cartesian product style. The `@FILTER` syntax is the same as the `condition` statement and is documented elsewhere. Here is an example:

```
normal, t: normal.R, t.R
    n: 100, 200, 300, 400, 500
    k: 0, 1
    @FILTER: (t.n <= 300 and t.k = 0) or (t.n > 300 and t.k = 1),
             (norm.n = 500)
```

Without `@FILTER` DSC will exhaust all combinations of 5 values of `n` and 2 of `k` for both modules, creating a total of 20 parallel jobs. The `@FILTER` here states that instead of running all 20 jobs, DSC will run for `t` 3 values from `n` with `k = 0`, then run the rest 2 values of `n` with `k = 1`, which is a total of 5 jobs for `t`. Then it runs for `normal` with `n = 500` combined with 2 values from `k`, a total of 2 jobs. Therefore only 7 jobs will be executed after applying the `@FILTER`.


When no module name is specified, eg. `.n`, then all modules will subject to the filter:
```
    @FILTER: (.n in (100,200,300) and .k = 0)
```

### Decorator CONF

`@CONF` provides interface to configuring how the modules in the current block should be executed. Most `@CONF` options are related to running on remote computational environment such as a single workstation server or HPC clusers.

**FIXME: termed `.option` in version 0.2.2, the feature is currently removed due to active changes on SoS task models. It will be added back in DSC release 0.2.5**

## Operators

### Substitution operator

The substitution parameter `$()` refers to variables defined in `DSC::globals`.

```
simulate_cosine: cosine.R
    types: $(data_functions)
...

DSC:
    globals: 
        data_functions: discrete.cosine, discrete.cosine2, discrete.cosine.peaksel
```

is equivalent to

```
simulate_cosine: cosine.R
    types: discrete.cosine, discrete.cosine2, discrete.cosine.peaksel
```

### Tuple operator
Tuple operator `()` simply creates grouped values as module parameters, for example

```
method: method.R
    K: (1,2,3), (4,5,6)
```

With `()`, `(1,2,3)` will be translated as a group to `c(1, 2, 3)` in `R`, `(1,2,3)` in `Python`, or space separated argument sequence `1 2 3` for command-line tools. Values will be assigned in groups defined by `()` instead of separately.

### Language interpreters 

Currently `R()`, `Python()`, and `Shell()` are supported. These operators run codes inside parenthesis using corresponding interpreters and evaluate the output. For example `seed: R(1:5)` results in `seed: 1, 2, 3, 4, 5`. This provides handy tool for generating input parameters with R, Python or Shell languages.

** FIXME: advanced behavor of `R()` is yet to be implemented, via the companion package. **

### Grouping operators
Currently `for_each()` and `pairs()` are supported to generate cartesian product and paired grouping of parameters. These operators makes it easier to assign values to DSC. For example:

```
n: for_each(1, [1,2,3])
```

is equivalent to the cartesian product

```
n: (1,1), (1,2), (1,3)
```

and
```
...
  settings: pairs($(classifier), $(kernel))
...
DSC:
  globals:
    classifier: svm, ridge
    kernel: k1, k2
```
is equivalent to:

```
...
  settings: (svm, k1), (ridge, k2)
...
```


### ALIAS operators

** FIXME: partial conversion will be available in version 0.2.5.**

As previously discussed, `List()` and `Dict()` are used in `@ALIAS` decorator to help construct variables of nested `List` or `dictionary` structures in `R` and `Python`. For example for `R` modules:

```
@ALIAS: args = List()
```
 
will convert all input / parameters the module has available, say `x,y,z` to `args <- list(x = ..., y = ..., z = ...)`. Likewise, `List()` will convert parameters to `dictionary` in Python. Partial conversion is also supported, for example `args = List(x, y)` will only convert selected variables to R list which will be translated to R code `args <- list(x = x, y = y)`.

### Raw string indicator
In DSC file, numeric vs. string data-types are automatically determined and there is no need to add quotes (`'` or `"`) to strings. This is convenient in most cases but can be problematic when the input are automatically converted to quoted strings when in fact they should not -- such as cases when we want to input a chunk of R codes that will be executed only at runtime not when constructing DSC. `raw()` indicator will protect them from having quotes added. For example,

```
g: raw(normalmix(c(2/3,1/3),c(0,0),c(1,2)))
```

will result in the `R` code

```r
g = normalmix(c(2/3,1/3),c(0,0),c(1,2))
```

which is what we'd expect. But without `raw()` it will result in `R` code

```r
g = "normalmix(c(2/3,1/3),c(0,0),c(1,2))"
```

which is problematic.

Since `raw()` protects its input from any type of conversions, it can be used to pass in *strings* verbatim, for example:

```
g: (1, 2, 3)
k: ('1', '2', '3')
l: (raw('1', '2', '3'))
```

will result in

```r
g = c(1, 2, 3)
k = c('1', '2', '3')
l = c("'1'", "'2'", "'3'")
```

### File generator

**FIXME: multiple output file generator objects are not yet allowed in 0.2.5.**

File generator `file()` will generate two types of files:

1. With an extension specified, eg. `file(txt)` or equivalently `file(.txt)` it generates files with unique names in DSC output folder.
2. Without an extension it generates a temporary file.

#### Extension specified

When a parameter specifies a filename yet to be created by a module instance, `file()` can be used to ensure a file is properly generated without racing with other module instances. This is achieved by properly name the file for each module instance. `file()` takes a **file extensions**, and DSC will automatically give it a unique and informative basename. The file can either be a module output that DSC will keep track of as pipeline variable, or an intermediate file that will not be used in other modules. For example:

```
kinship: kinship.sh ...
    data: $sim
    K: 1, 2, 3, 4, 5
    plink_bed: file(bed)
    $score: file(score)
```

Here the output `score` is a file, generated by the module taking parameter `data` and parameter `K`. This will result in multiple `score` output files. Names of these `score` files will be automatically assigned so that users do not have to worry about naming them.

Parameter `plink_bed` is also a file, but it is not a module output. This will create a situation where a properly named file `*.bed` is generated to the output directory yet we do not keep track of it: this means deleting this file manually afterwards will not trigger rerun of the pipelines involved. We keep these files just in case we may want to examine them to debug.

One can also use the `.` notation for file extensions to enhance readability, for example `File(.bed)` is equivalent to `File(bed)`.

#### Extension not specified

When using `file()` without specifing the extension, it generates a file path in the system [temporary folder](https://en.wikipedia.org/wiki/Temporary_folder) and can be used as a "temp file". It will not appear in users' work directory, and as with other system temp files there is generally no need to worry about their management / maintenance. In the example above the parameter `plink_bed` can well be made a temporary file if users deem that there is no need to keep it for trouble-shooting:

```
...
    plink_bed: file()
...
```

## Improving DSC script readability
### Use comments
As with many scripting languages (R, Python, shell), `#` can be used to make human-readable explanation to the contents of DSC script. We suggest using single `#` to distinguish between various types of syntax groups introduced in this document, and double `##` to annotate parameters. For example:

```
# Various modules to estimate location parameter
mean, median: mean.R, median.R
    # parameters
    ## set to 0 for no winsorization
    winsorize: 0, 0.02
    # input
    x: $x
    # output
    $mean: mean
    # decorators
    @ALIAS:
        median: w = winsorize
```

### Turn on YAML syntax highlighter
Although DSC script is not compatible with YAML, the syntax highlighter for YAML is good enough to enhance readability of DSC scripts. You may turn on the YAML syntax highlighter in your text editor when composing DSC scripts.