# Workflow Specification

A SoS script can specify one or more workflows. Each workflow consists of one or more numbered steps. The numbers (should be non-negative) specify the **logical order** by which the steps are executed, but a later step might be executed before the completion of previous steps if it does not depend on the output of these steps.

This tutorial shows you how to define steps in a workflow and how to construct nested and combined workflows from single workflows. Although a Jupyter notebook, because of its interactive nature, is rarely used to execute complete workflows, we define workflows in notebook cells and execute them with options passed from cell magics such as `%set` and `%run`. Briefly speaking, `%set` sets persistent and global command line options and `%run` sets additional temporary options for the current cell. For example, the following command sets verbosity level to 2 so that SoS would display log messages of steps executed for the rest of this tutorial.

In [1]:
# set global logging level to INFO to display step executed. 
%set -v2

Set sos options to "-v2"


## Global step

A SoS script can have one and only one `global` section, with definitions shared by all steps in this script. A global section is usually defined implicitly as all statements before the first named step. For example, the `a=1` statement before the definition of `[10]` is visible to all steps.

In [2]:
a = 1

[10]
print(a)

[20]
print(a+1)

INFO: Executing [32mdefault_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


1
2


The global section can also be defined with a `[global]` section header, and in which case the section does not have to be the first section.

In [3]:
[10]
print(a)

[20]
print(a+1)

[global]
a = 1

INFO: Executing [32mdefault_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


1
2


Note that a script can only have one global section so the following would script would trigger an error

In [4]:
%sandbox --expect-error

a = 1

[global]
b = 1

File contains parsing errors: <string>
	[line  3]: [global]

Cannot define a global section with a non-empty implicit global section


## Single default workflow

Each step of a workflow starts with a **step header** in the format of `[step_name: options]`. A single workflow can be specified without a name in a SoS script. For example, the following sections specify a workflow with four steps `5`, `10`, `20`, and `100`. As you can see, the workflow steps can be specified in any order and do not have to be consecutive (which is actually preferred because it allows easy insertion of extra steps).

In [5]:
[5]
[20]
[10]
[100]

INFO: Executing [32mdefault_5[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_100[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


A workflow specified in this way is the **`default`** workflow and is actually called `default` in SoS output. If you want to give it a meaningful name, you can specify the steps as

In [6]:
[mapping_5]
[mapping_20]
[mapping_10]
[mapping_100]

INFO: Executing [32mmapping_5[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmapping_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmapping_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmapping_100[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


Because this SoS script defines only one workflow (`mapping`), you do not have to specify the name of workflow from SoS command.

A workflow name can have alphabetical and numeric characters, `-`, `_`, but the first character must be an alphabet. 

In [7]:
[process-doc_20]

INFO: Executing [32mprocess-doc_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


Note that the index of a step can be ignored if it is the only step of a workflow.

In [8]:
[mapping]

INFO: Executing [32mmapping_0[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


## Short and long descriptions of steps

You can give each step a short description by adding a short description in parenthesis after step number.

In [9]:
[20 (mapping reads)]
[10 (initialize)]

INFO: Executing [32mdefault_10 (initialize)[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_20 (mapping reads)[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


The first comment block of each step is considered as description of the step and will be displayed when the step is executed.

In [10]:
[10 (initialize)]
# Validate input files and check available
# tools

# this step is actually empty

[20 (mapping reads)]
# Map reads using specified alignment tool

INFO: Executing [32mdefault_10 (initialize)[0m: Validate input files and check available tools
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_20 (mapping reads)[0m: Map reads using specified alignment tool
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


## Multiple workflows

A SoS script can define multiple workflows. For example, the following sections of SoS script defines two workflows named ``mouse`` and ``human``. 

In [11]:
%run mouse
[mouse_10]
[mouse_20]
[mouse_30]
[human_10]
[human_20]
[human_30]

INFO: Executing [32mmouse_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmouse_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmouse_30[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


In this case, a command line option is needed to specify workflow name. This can be done by magic `%run` in Jupyter notebook, or a positional argument from the command line, e.g.

```
    % sos run myscript mouse
```

If you would like to define a ``default`` and a named workflow, you can define them as

In [12]:
[10]
[20]
[30]
[test_10]
[test_20]
[test_30]

INFO: Executing [32mdefault_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_30[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


The `default` workflow will be executed by default using command

```bash
    % sos run myscript
```

The `test` workflow will be executed if its name is specified from the command line

```bash
    % sos run myscript test
```

## Shared workflow steps

One of the motivations of defining multiple workflows in a single SoS script is that they share certain processing steps. If this is the case, you can define sections such as

In [13]:
%run mouse
[mouse_10,human_10]
[mouse_20]
[human_20]
[mouse_30,human_30]

INFO: Executing [32mmouse_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmouse_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmouse_30[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


or

In [14]:
%run mouse
[*_10]
[mouse_20]
[human_20]
[*_30]

INFO: Executing [32mmouse_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmouse_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mmouse_30[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


In [15]:
#local run
%run fly
[*_10]
[mouse_20,human_20]
[fly_20]
[*_30,fly_50]
[fly_40]


INFO: Executing [32mfly_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mfly_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mfly_30[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mfly_40[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mfly_50[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


In the last case, step defined by `[*_30,fly_40]` will be expanded to ``mouse_30``, ``human_30``, ``fly_30``, and ``fly_50`` and will be executed twice for the `fly` workflow. Note that workflow steps can use variable `step_name` to act (slightly) differently for different workflows. For example,

In [16]:
%run mouse
[mouse_20,human_20]
reference = "/path/to/mouse/reference" if \
  step_name.startswith('mouse') else "/path/to/human/reference"

print("Reference genome ${reference} is used")

INFO: Executing [32mmouse_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


Reference genome /path/to/mouse/reference is used


Here the variable `step_name` is `mouse_20` or `human_20` depending on the workflow being executed, and is used to determine the correct reference genome for different workflows.

## Sub- and combined workflows

Although workflows are defined separately with all their steps, they do not have to be executed in their entirety. A `subworkflow` refers to a workflow that is defined from one or more steps of an existing workflows. It is specified using syntax `workflow:[from-to]` where `from-to` can be `n` (step `n`), `-n` (up to `n`), `n-m` (step `n` to `m`) and `m-` (from `m`). For example

  ```python
  A              # complete workflow A
  A:5-10         # step 5 to 10 of A
  A:50-          # step 50 up
  A:-10          # up to step 10 of A
  A:10           # step 10 of workflow A
  ```

In practice, the `-n` format is frequently used to execute part of the workflow for debudding purposes, for example:

In [17]:
%run default:-20
[10]
[20]
[30]

INFO: Executing [32mdefault_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


You can also combine subworkflows to execute multiple workflows one after another. For example,

```python
A + B          # workflow A, followed by B
A:0 + B        # step 0 of A, followed by B
A:-50 + B + C  # up to step 50 of workflow A, followed by B, and C
```

This syntax can be used from the command line, e.g.

```bash
sos-runner myscript align+call
```

or from the `%run` magic of Jupyter notebook

In [18]:
#local run
%run check+align+call
[check_10]
[align_10]
[align_20]
[call_10]
[call_20]

INFO: Executing [32mcheck_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32malign_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32malign_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mcall_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mcall_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m


It is worth noting that combined workflow might work differently from when they are executed separately (e.g. default input of `B` is changed from empty to output of `A_0`), and it is up to the user to resolve conflicts between them.

## Nested workflow

SoS also supports nested workflow in which a complete workflow is treated as part of a step process.
The workflow is execute by SoS action `sos_run`, e.g.

```python
sos_run('A')            # execute workflow A
sos_run('A + B')        # execute workflow B after A
sos_run('D:-10 + C')    # execute up to step 10 of D and workflow C

# execute user-specified aligner and caller workflows
sos_run('${aligner} + ${caller}')  
```

In its simplest form, nested workflow allows you to define another workflow from existing ones. For example,

In [19]:
[align_10]
[align_20]
[call_10]
[call_20]
[default]
sos_run('align+call')

INFO: Executing [32mdefault_0[0m: 
INFO: input:    [32m[][0m
INFO: Executing workflow [32malign+call[0m with input [32m[][0m and no args
INFO: Executing [32malign_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32malign_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mcall_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mcall_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: output:   [32m[][0m


defines a nested workflow that combines workflows `align` and `call` so that the workflow will by default execute two workflows, but can also execute one of them as separate workflows `align` and `call`.

Nested workflow also allows you to define multiple mini-workflows and connect them freely. For example

```python
[a_1]
[a_2]
[b]
[c]
[d_1]
sos_run('a+b')
[d_2]
sos_run('a+c')
```

defines workflows `d` that will execute steps `d_1`, `a_1`, `a_2`, `b_0`, `d_2`,  `a_1`, `a_2`, and `c_0`. 

Nested workflows, like other SoS actions, can be executed repeatedly, for example,

In [20]:
[b_1]
print("My seed is ${_seed}")
[b_2]
[b_3]

[default]
import random
seed = [random.randint(1, 1e10) for x in range(2)]
input: for_each='seed'
sos_run('b')

INFO: Executing [32mdefault_0[0m: 
INFO: input:    [32m[][0m
INFO: Executing workflow [32mb[0m with input [32m[][0m and no args
INFO: Executing [32mb_1[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mb_2[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mb_3[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing workflow [32mb[0m with input [32m[][0m and no args
INFO: Executing [32mb_1[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mb_2[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mb_3[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: output:   [32m[][0m


My seed is 6257357302
My seed is 7764430190


would execute the complete workflow `b` twice with different random seeds. Similarly you can let the nested workflow process groups of input files.

Nested workflows can also be used to compose workflows from user-provided options through command line arguments, configuration files, and even results from previous steps. For example, the following example

In [21]:
%run align
parameter: aligner = CONFIG.get('aligner', 'bwa')

[bwa_1]
[bwa_2]
[novaalign_1]
[novaalign_2]

[align]
sos_run(aligner)

INFO: Executing [32malign_0[0m: 
INFO: input:    [32m[][0m
INFO: Executing workflow [32mbwa[0m with input [32m[][0m and no args
INFO: Executing [32mbwa_1[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mbwa_2[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: output:   [32m[][0m


defines workflows `bwa` and `novaalign` to align raw reads. The `align` workflow is a master workflow that executes `bwa` or `novaalign` determined by option `aligner` defined in a configuration file (command line option `-c`) and command line option `--aligner`.

## Workflow defined by targets

With the introduction of [auxiliary steps](Auxiliary_Steps.html), a SoS workflow can consist of a graph with or without a "stem" with numbered forward-style steps. By specifying the targets of a workflow instead of which steps to execute, you essentially let SoS execute the required steps to generate the targets. For example,

In [22]:
%sandbox

!touch test.bam
%run -t test.vcf

# this step provides variable `var`
[index: provides='{filename}.bam.bai']
input: "${filename}.bam"
sh:
   echo "Generating ${output}"
   touch ${output}

[call: provides='{filename}.vcf']
input:   "${filename}.bam"
depends: "${input}.bai"
sh:
   echo "Calling variants from ${input} with ${depends} to ${output}"
   touch ${output}

INFO: Resolving 1 objects from 0 nodes
INFO: Adding step call with output ['test.vcf']
INFO: Executing [32mcall[0m: 
INFO: input:    [32m['test.bam'][0m
INFO: Target unavailable: test.bam.bai
INFO: Resolving 1 objects from 1 nodes
INFO: Adding step index with output ['test.bam.bai']
INFO: Executing [32mindex[0m: 
INFO: input:    [32m['test.bam'][0m


Generating test.bam.bai


INFO: output:   [32m['test.bam.bai'][0m
INFO: Executing [32mcall[0m: 
INFO: input:    [32m['test.bam'][0m
INFO: _depends: [32m['test.bam.bai'][0m


Calling variants from test.bam with test.bam.bai to test.vcf


INFO: output:   [32m['test.vcf'][0m


In this example, instead of specifiying a workflow, a target `test.bam.bai` is requested. SoS checks all auxiliary steps and calls step `index` to generate `test.bam.bai`. After step `index` is completed, step `call` is executed again to produce the final requested target `test.vcf`.

The `-t` option could specify more than one targets and could be used in combination with a forward-style workflow. Please refer to [documentation on makefile-style workflows[Auxiliary_Steps.html] for more details.