# DSC Basics, Part II

This is the second part of the "DSC Basics" tutorial. Before working through this tutorial, you should have already read [DSC Basics, Part I](Intro_Syntax_I.html). Here we build on the mean estimation example from the previous part to illustrate new concepts and syntax in DSC, with an emphasis on the use of *module parameters*.

Materials used in this tutorial can be found in the [DSC vignettes repository](https://github.com/stephenslab/dsc/tree/master/vignettes/one_sample_location). As before, you may choose to run this example DSC program as you read through the tutorial, but this is not required. For more details, consult the README in the ["one sample location" DSC vignette](https://github.com/stephenslab/dsc/tree/master/vignettes/one_sample_location).

## Adding a module parameter to `normal`

In our example DSC, recall we defined the `normal` module as follows:

```
normal: R(x <- rnorm(n = 100,mean = 0,sd = 1))
  $data: x
  $true_mean: 0
```

Here we propose to make a slight improvement to this module by adding a *module parameter*, `n`:

```
normal: R(x <- rnorm(n,mean = 0,sd = 1))
  n: 100
  $data: x
  $true_mean: 0
```

We have defined a module parameter `n` and set its value to 100. Once we have defined `n`, any of the R code may refer to this module parameter. In the R code, the first argument of `rnorm` is set to the value of `n` (which is 100).

In this first example, there is not much benefit to defining a module parameter `n`. In the examples below, the advantages of module parameters will become more apparent.

## Adding a second module parameter to `normal`

In our current design of `normal`, we made an unfortunate choice: the mean used to simulate the data is defined twice, once inside the call to `rnorm`, where we set `mean = 0`, and once when we set the module output `$true_mean` to zero. If we decide to use a different mean to simulate the data, then we would have to be careful to change the code in two different places. 

It would be better if the mean of the data was defined once. This can be accomplished with a module parameter, which we will name `mu` (the Greek letter conventionally used to denote the mean):

```
normal: R(x <- rnorm(n,mean = mu,sd = 1))
  n: 100
  mu: 0
  $data: x
  $true_mean: mu
```

Here, we have defined a second module parameter `mu`, and set its value zero. Now the `mean` argument of `rnorm` can be set to the value of `mu`. 

Additionally, since `mu` is also a script parameter, the module output `$true_mean` can be set to the value of (script parameter) `mu`. (In this example, the value of the module parameter happens to be the same as the value of the variable `mu` used in the R code, but in some cases the R code might modify the value of `mu`, in which case the module parameter and script parameter will be different. So it is important to keep these quantities distinct.)

With this change to the module definition, modifying the mean used to simulate the data only requires editing one line of code instead of two.

Likewise, we can use a module parameter to specify the mean of the data simulated from a *t* distribution:

```
t: R(x <- mu + rt(n,df = 2))
  n: 100
  mu: 3
  $data: x
  $true_mean: mu
```

Note that there is no requirement that the module parameters for the `normal` and `t` modules have the same name, `mu`, but in this case makes sense to do so. One advantage of defining parameters with the same name is that makes it easier to query the results.

## The order of evaluation inside a module

In the examples above, we informally introduced the notion of a *model parameter.* Below, we will give some more elaborate examples with module parameters, so here we take a moment to describe more formally how a module parameter behaves in relation to other components of a module:

+ A module parameter cannot depend on any of the module inputs, and it can only depend on other module parameters through use of the `raw` keyword (this is explained below). In other words, it must be possible to evaluate the module parameter without knowing the values of the module inputs or the values of the other module parameters (again, the one exception is when `raw` is used).

+ Module parameters are evaluated before the module script (except when `raw` is used—see below). The exact procedure for evaluating a module is as follows:
 
    1. Evaluate any R code used to determine the values of the module parameters (we give an example of this below).
    
    2. Set the values of the module parameters (except for module parameters defined with `raw`).
    
    3. Initialize the module inputs according to the current stored values of the pipeline variables.
    
    4. For each module parameter and module input, define a *script variable* in the global environment in which the script is evaluated with the same name and same value as the module parameter or input. 
    
    5. Evaluate the module script or inline source code. All script variables are retained for resolving any module outputs.
    
    6. Set each module output to the stored value of the selected script variable.

If this evaluation procedure is unclear to you at this stage, it will will become more clear as we work through the examples below.

## A single module parameter with multiple alternative values

Above, we gave a couple examples of defining module parameters. Here, we will demonstrate an important feature of module parameters: they can be used to define multiple modules that are similar to each other.

Our current definition of the `normal` module simulates 100 random samples from a normal distribution. Suppose we would like to define a second module that simulates 1,000 random samples from the same normal distribution. This is easily done by defining a new module parameter `n` that takes on two different values:

```
normal: R(x <- rnorm(n,mean = mu,sd = 1))
  mu: 0
  n: 100, 1000
  $data: x
  $true_mean: mu
```

The comma delimits the two possible values of model parameter `n`.

Now that we have defined `n` inside this module, we can refer to this module parameter inside the R code that simulates random draws from a normal distribution, as in the example above.

To be precise, this code defines a *module block* with two modules. It is equivalent to defining two modules, `normal_100` and `normal_1000`, that are identical in every way except that the first module includes parameter definition `n: 100` and the second defines `n: 100`. The module block above is of course much more succinct.

The line `n: 100, 1000` should not be interpreted as defining a vector or sequence with two entries, 100 and 1000. It defines a *set of alternative values*. To put it another way—and this is the terminology we use frequently—`n: 100, 1000` defines two *alternative values* for module parameter `n`, and therefore defines two *alternative modules* that are the same in every way (including their name, `normal`) except for the setting of `n`.

An important property of module parameters with multiple alternative values is that *their order does not matter*. For example, if we instead wrote `n: 1000, 100`, *the DSC results will be exactly the same as* `n: 100, 1000`. The only thing that will change is the order in which the results will appear in the tables, and the way in which the results are stored in files.

Although the two modules both have the same name, `normal`, their outputs can still be easily distinguished in the results; for example, if you want to compare the accuracy of the estimates in the larger (`n = 1000`) and smaller (`n = 100`) simulated data sets, the results from these two modules can be distinguished by the stored value of the module parameter `n`. We will see an example of this next.

## Executing the DSC with alternative `simulate` modules

Let's go ahead and generate results from our new "mean estimation" DSC. In the new DSC, the `simulate` modules are defined by two module blocks:

```
normal: R(x <- rnorm(n,mean = mu,sd = 1))
  mu: 0
  n: 100, 1000
  $data: x
  $true_mean: mu

t: R(x <- mu + rt(n,df = 2))
  mu: 3
  n: 100, 1000
  $data: x
  $true_mean: mu
```

The rest of the DSC remains unchanged from before.

This new DSC is implemeted by `simulate_data_twice.dsc` inside the `one_sample_location` vignette folder. 

To run the DSC benchmark, change the working directory (here we have assumed that the dsc repository is stored in the `git` subdirectory of your home directory),

In [2]:
cd ~/git/dsc/vignettes/one_sample_location
pwd

/Users/pcarbo/git/dsc/vignettes/one_sample_location


remove any previously generated results,

In [3]:
rm -Rf first_investigation.html first_investigation.log first_investigation

then let's run 10 replicates of all the pipelines:

In [4]:
dsc simulate_data_twice.dsc --replicate 10

INFO: DSC script exported to [32mfirst_investigation.html[0m
INFO: Constructing DSC from [32msimulate_data_twice.dsc[0m ...
INFO: Building execution graph & running DSC ...
DSC: 100%|██████████████████████████████████████| 15/15 [00:47<00:00,  2.37s/it]
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time [32m49.815[0m seconds.


*TO DO: Add a sentence here pointing out the number of module outputs that were generated by this command.*

## Inspecting the DSC results with `n=100` and `n=1000`

Now we will view the DSC results in R. Change the R working directory to the location of the DSC file, and use the dscquery function from the `dscrutils` package to load the DSC results into R:

In [6]:
setwd("~/git/dsc/vignettes/one_sample_location")
library(dscrutils)
dscout <-
  dscquery(dsc.outdir = "first_investigation",
           targets = c("simulate.n","analyze","score.error"))
nrow(dscout)

Running shell command:
dsc-query first_investigation -o /var/folders/9b/ck4lp8s140lcksryyh4dppdr0000gn/T//RtmpU40I0r/file5f34164f5246.csv -f --target simulate.n simulate.true_mean analyze score.error 
Loading dsc-query output from CSV file.
Reading DSC outputs:
 - simulate.true_mean: extracted atomic values
 - score.error: extracted atomic values


The DSC command we ran above generated results for 10 replicates of 16 pipelines, doubling the number of pipelines we had before. This is expected because we now have 4 `simulate` modules (2 `normal` modules and 2 `t` modules), whereas before we had 2 `simulate` modules. To confirm this, we see that each of the `simulate` modules is run 40 times:

In [7]:
with(dscout,table(simulate,simulate.n))

        simulate.n
simulate 100 1000
  normal  40   40
  t       40   40

We would expect that estimates improve with more data. We can quickly check this by comparing the average error (e.g., the squared error) in the pipelines with 100 samples against the average error in the pipelines with 1000 samples: 

In [10]:
dat <- subset(dscout,score == "sq_err")
as.table(by(dat,
            with(dat,list(analyze,simulate.n)),
            function (x) mean(x$score.error)))

            100     1000
mean   0.308991 0.004709
median 0.015792 0.002163

Indeed, based on the results from these 10 replicates, we see that the accuracy of both methods (mean and median) improves considerably with more data, on average, and in both cases the median is more accurate than the mean on average.

## Two module parameters with multiple alternatives

If you provide more than one value for multiple module parameters, DSC considers all combinations of the values. 

For example, suppose we want to evaluate estimators of the population mean when the data are simulated from the *t* distribution with different numbers of degrees of freedom. In DSC, this can be compactly expressed by defining another module parameter, `df`, with multiple values:

```
t: R(x <- mu + rt(n,df))
  mu: 3
  df: 2, 4, 10
  n: 100, 1000
  $data: x
  $true_mean: mu
```

This defines 6 `t` modules from the 6 different ways of setting both the `n` and `df` parameters.

Next, let's clear the previous results and run the new DSC benchmark:

In [11]:
rm -Rf first_investigation.html first_investigation.log first_investigation
dsc simulate_multiple_dfs.dsc --replicate 10

INFO: DSC script exported to [32mfirst_investigation.html[0m
INFO: Constructing DSC from [32msimulate_multiple_dfs.dsc[0m ...
INFO: Building execution graph & running DSC ...
DSC: 100%|██████████████████████████████████████| 15/15 [01:55<00:00,  6.08s/it]
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time [32m118.543[0m seconds.


*TO DO: Add a sentence here pointing out the number of module outputs that were generated by this command.*

Now let's load all the results generated with the *t*-simulated data:

In [17]:
dscout2 <-
  dscquery(dsc.outdir = "first_investigation",
           targets = c("t.n","t.df","analyze","score.error"))

Running shell command:
dsc-query first_investigation -o /var/folders/9b/ck4lp8s140lcksryyh4dppdr0000gn/T//RtmpU40I0r/file5f343992f15b.csv -f --target t.n t.df analyze score.error 
Loading dsc-query output from CSV file.
Reading DSC outputs:
 - score.error: extracted atomic values


In total, we have results from 240 pipelines:

In [15]:
with(dscout2,table(t.n,t.df))

      t.df
t.n     2  4 10
  100  40 40 40
  1000 40 40 40

Each of the 6 `t` modules was run 40 times: 2 analyze modules x 2 score modules x 10 replicates.

## Using a module parameter to set the seed

To ensure reproduceable results, it is often necessary to initialize the state, or "seed", for generating the sequence of pseudorandom numbers. DSC automatically provides a default setting for the seed in R, but you may want to override this choice. A common use of module parameters is to modules with different seeds.

In this example, we define 10 modules that generate normally distributed data sets with 100 samples: 

```
normal: R(set.seed(seed); x <- rnorm(n,mean = mu,sd = 1))
  seed: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
  mu: 0
  n: 100
  $data: x
  $true_mean: mu
```

The only difference in the 10 `normal` modules is the sequence of pseudorandom numbers used to simulate random draws from the normal distribution.

See `multiple_seeds.dsc` in the `one_sample_location` directory for a working example.

## Combining module parameters with module inputs

It is also possible to combine module parameters with module inputs.

Recall, in the [introductory tutorial](Intro_DSC.html) we defined a third `analyze` module that implemented the "Winsorized" mean. The `trim` argument to `winsor.mean` determines the proportion of the data to "squish" from the top and bottom of the distributions. If we wanted to evaluate the impact that the trimming amount has on the accuracy of the estimate, we could introduce a module parameter `trim` with multiple settings:

```
winsor: R(y <- psych::winsor.mean(x,trim,na.rm = TRUE))
  trim: 0.1, 0.2
  x: $data
  $est_mean: y
```

For each data set generated by a `simulate` module, DSC will run two different `winsor` modules: one with `trim = 0.1`, and a second with `trim = 0.2`.

Intuitively, one may want to adjust the trim setting *dynamically* based on the data (e.g., based on the fit to the normal distribution). However, this is not possible in DSC because module parameters must be set independently of the module inputs.

## Defining more complex module parameters (incomplete)

Above, we showed that DSC defines a module for each combination of the module parameters. Sometimes it is desirable to have finer control over which combinations are module parameters are used. One way to do this is to define a single module parameter that sets the value of multiple parameters used in the script. For example, suppose we wanted to simulate data from *t* distributions with these three settings of the mean (`mu`) and number of degrees of freedom (`df`):

```
mu  df
--  --
 0   2
 1   2
 2   4
```

This cannot be achieved by setting the module parameters `mu` and `df` separately because it will automatically define modules for all combinations of `mu` and `df`. Instead, we can do something like this:

```
t: R(mu <- par[1]; df <- par[2]; x <- mu + rt(n,df))
  par: R(c(mu = 0,df = 2)),
       R(c(mu = 1,df = 2)),
       R(c(mu = 2,df = 4))
  n: 100
  $data: x
  $true_mean: 3
```

In this example, `par` is a module parameter with 3 alternative settings, in which each alternative setting is a vector with two elements; the first vector element is the mean of the simulated data, and the second vector element is the number of degrees of freedom.

Note that the text inside the `R()` is evaluated as R code.

## Defining module parameters with many alternative values (incomplete)

Example for this section: 

Start with an example in which the different values of n are 10^1, 10^1.5, 10^2, etc, all defined within R().

Then generate many simulated data sets with different values of n, from very small (10) to very large (1e6) using R{}. 

Finally, run the code and show that it works.

## Recap

Add recap here.

## Exploring further

In this tutorial, we introduced the most essential features of DSC that are sufficient to . There are many other features of DSC that we did not have a chance to mention in these introductory tutorial.  