# How to define and execute basic SoS workflows

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * A forward-style workflow consists of numerically numbered steps
  * Multiple workflows can be defined in a single SoS script or notebook
  * Optional input and output statements can be added to change how workflows are executed
  * The default input of a step if the output of its previous step  

## Simple workflows with numerically numbered steps

The workflows you have seen so far have numerically numbered steps. For example, this example from the [first tutorial](sos_in_notebook.html) have a single workflow `plot` with steps `plot_10` and `plot_20` and SoS will execute the steps in numeric ordering.

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Workflows with numerically numbered steps</h4>
    <ul>
        <li>Steps have the format of <code>name_index</code> (e.g. <code>step_10</code>) where <code>name</code> is the name of the workflow, and <code>index</code> is a numeric number</li>
        <li>Step indexes are usually not consecutive to allow easy insertion of new steps</li>
        <li>Both workflow <code>name</code> and <code>index</code> can be ignored. <code>10</code>, <code>20</code> etc are considered as steps of an unnamed workflow, <code>step</code> is considered as a one-step workflow</li>
        <li>Workflow can be executed by workflow name (e.g. <code>%run name</code> in Jupyter, <code>sos run name</code> from command line). A default workflow will be execute if only one workflow is defined, or a default workflow is defined</li>
    </ul>
</div>

In [1]:
%run

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[plot_10]
run: expand=True
    xlsx2csv {excel_file} > {csv_file}

[plot_20]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

xlsx2csv data/DEG.xlsx > DEG.csv


Invalid xlsx file: data/DEG.xlsx


Exception ignored in: <function Xlsx2csv.__del__ at 0x10b080730>


Traceback (most recent call last):


  File "/Users/bpeng1/anaconda3/envs/sos/bin/xlsx2csv", line 205, in __del__


AttributeError: 'Xlsx2csv' object has no attribute 'ziphandle'


[91mERROR[0m: [91m[plot_10]: [0]: 


Failed to execute [0m[32m/bin/bash -ev .sos/plot_10_0_e5c42d97.sh[0m[91m


exitcode=1, workdir=[0m[32m/Users/bpeng1/sos/sos-docs/development[0m[91m


---------------------------------------------------------------------------


[plot]: Exits with 1 pending step (plot_20)[0m


The workflow is executed by default with magic `%run` because only one workflow is defined in the script. You can also define multiple workflows and execute them by their names. For example, the following script defines two single-step workflows `convert` and `plot`. Because there is no default workflow, you will have to refer to them with their names:

```
%run convert
```
and
```
%run plot
```

In [2]:
%run convert

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[convert]
run: expand=True
    xlsx2csv {excel_file} > {csv_file}

[plot]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

xlsx2csv data/DEG.xlsx > DEG.csv


Invalid xlsx file: data/DEG.xlsx


Exception ignored in: <function Xlsx2csv.__del__ at 0x10acb9730>


Traceback (most recent call last):


  File "/Users/bpeng1/anaconda3/envs/sos/bin/xlsx2csv", line 205, in __del__


AttributeError: 'Xlsx2csv' object has no attribute 'ziphandle'


[91mERROR[0m: [91m[convert]: [0]: 


Failed to execute [0m[32m/bin/bash -ev .sos/convert_0_a49740a7.sh[0m[91m


exitcode=1, workdir=[0m[32m/Users/bpeng1/sos/sos-docs/development[0m[91m


---------------------------------------------------------------------------[0m


In [3]:
%run plot

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[convert]
run: expand=True
    xlsx2csv {excel_file} > {csv_file}

[plot]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 


  no lines available in input


Calls: read.csv -> read.table


Execution halted


[91mERROR[0m: [91m[plot]: [0]: 


Failed to execute [0m[32mRscript .sos/plot_0_1b361d96.R[0m[91m


exitcode=1, workdir=[0m[32m/Users/bpeng1/sos/sos-docs/development[0m[91m


---------------------------------------------------------------------------[0m


## Default input of steps

As shown in in [How to specify input and output files and process input files in groups](doc/user_guide/input_substeps.html), you can define `input` and `output` for each step. 

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Default <code>input</code> of numerically numbered workflows</h4>
    <p>The default input of a step in a numerically numbered workflow is the output of its previous step</p>
</div>

Therefore, in the following workflow, the `input` statement of `plot_20` can be ignored. 

In [4]:
%run

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[plot_10]
input: excel_file
output: csv_file

run: expand=True
    xlsx2csv {_input} > {_output}

[plot_20]
# input: csv_file
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

[91mERROR[0m: [91mNo rule to generate target 'data/DEG.xlsx', needed by 'plot_10'.[0m


## Basic data-flow based workflows

We have shown the same workflows in the `plot_10`, `plot_20` style, in the `convert` and `plot` style, and with and without specification of input and output. What will happen if you define a workflow in separate steps with `input` and `output` statements?

Let us first remove the intermediate `DEG.csv`,

In [5]:
!rm DEG.csv

and execute the `plot` step of the following workflow

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Simple data-flow based workflow</h4>
    <p>If the <code>input</code> files of a step do not exist, SoS will automatically check other steps in the workflow and call them to generate the needed files. This allows the creation of workflows based on data flow. 
    </p>
 </div>

In [6]:
%run plot

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[convert]
input: excel_file
output: csv_file

run: expand=True
    xlsx2csv {_input} > {_output}

[plot]
input: csv_file
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

[91mERROR[0m: [91mNo step to generate target data/DEG.xlsx, needed by 'convert'[0m


As you can see, although the step `plot` is requested, SoS executes both the `convert` and `plot` steps because the required input file `csv_file` (`DEG.csv`) does not exist. In this case, SoS will look for steps that produces `DEG.csv` and execute it to generate `DEG.csv` before `plot` is executed.

 <div class="bs-callout bs-callout-warning" role="alert">
    <h4>Output of data-flow based workflow</h4>
     <p>For output files to be automatically identified by SoS as input for another step, the <code>output</code> statement must be clearly defined. That is to say, they must be either
    </p>
    <ul>
        <li>One or more filenames (e.g. <code>output: "DEG.csv"</code>) or</li>
        <li>Some expression that can be easily evaluated from variables defined in the global section (e.g. <code>output: csv_file</code>)</li>
    </ul>
   <p> So output derived from `_input` cannot be used (e.g. <code>output: _input.with_suffix('.bak')</code>).  However,</p>
    <ul>
        <li>You can assign complex output with a name and use <code><a href="named_output.html">named_output()</a></code> to refer to it.</li>
        <li>You can create <a href="auxiliary_steps.html">makefile-style steps</a> and allows the creation of files through pattern-matching.</li>
    </ul>
</div>

## Passing output of steps with substeps *

<div class="bs-callout bs-callout-warning" role="alert">
    <h4>Output of a step with substeps</h4>
    <p>If a step has multiple substeps, the step output consists of <code>_output</code> from each substep, which will be by default passed to the next step and create multiple substeps.</p>
</div>

Things can get a little bit complicated when a step has multiple substeps. As you can recall from [How to specify input and output files and process input files in groups](doc/user_guide/input_substeps.html), multiple substeps can be defined by input option `group_by`, each with its own `_input` and `_output`. When the output of such a step is inherited by another step, these `_output` will become the `_input` of the substeps.

For example, after running `fastqc` on the input fastq files, we would like to process the generated HTML file and check if the qualities are ok. We use the `beautifulsoup` Python module and find all the `<h2>` headers. Without going into the details of the use of `beautifulsoup` to parse HTML files, you should notice that 

* No `input` is defined for step `20` so it takes the output of step `10` as its input.
* The output of step `10` contains two groups, `data/S20_R1_fastqc.html` and `data/S20_R2_fastqc.html`, which becomes the input of two substeps of step `20`.
* The input of step `20` are processed one by one

In [7]:
%run 

[10]
input: 'data/S20_R1.fastq', 'data/S20_R2.fastq', group_by=1
output: f'{_input:n}_fastqc.html'

sh: expand=True
    fastqc {_input}
    
[20]

from bs4 import BeautifulSoup

with open(_input) as html:
    soup = BeautifulSoup(html)
    for h2 in soup.findAll('h2'):
        if h2.img:
            print(f"{_input:bn} {h2.text}: {h2.img['alt']}")

[91mERROR[0m: [91mNo rule to generate target 'data/S20_R2.fastq', needed by '10'.[0m


If you would like to re-group the default input, you can redefine the `input` explicitly, or apply option `group_by` to the default input:

In [8]:
%run -v0

[10]
input: 'data/S20_R1.fastq', 'data/S20_R2.fastq', group_by=1
output: f'{_input:n}_fastqc.html'

sh: expand=True
    fastqc {_input}
    
[20]
input: group_by='all'
print(f'Input of step 20 is {_input}')

[32m[[0m[32m][0m Failed with 0 step processed ()


[91mERROR[0m: [91mNo rule to generate target 'data/S20_R2.fastq', needed by '10'.[0m


## Further reading
* [How to specify input and output files and process input files in groups](doc/user_guide/input_substeps.html)
* [How to use named output in data-flow style workflows](doc/user_guide/named_output.html)
* [How to use Makefile-style rules to generate required files](doc/user_guide/auxiliary_steps.html)