# Reference of SoS actions and functions

## Predefined Actions

###  `run`

`run` is the most frequently used action in sos. In most cases, it is equivalent to action `bash` that uses `bash` to execute specified script. Under the hood, this action is quite different from `bash` because the run action does not have a default interpreter and would behave differently under different situations.

In the simplest case when one or more commands are specified, action `run` would assume it is a batch script under windows, and a bash script otherwise.

In [1]:
run:
    echo "A"

A


However, if the script starts with a shebang line, this action would execute the script directly. This allows you to execute any script in any language. For example, the following script executes a python script using action `run`

In [2]:
run:
    #!/usr/bin/env python
    print('This is python')

This is python


and the following example runs a complete sos script using command `sos-runner`

In [3]:
# use sigil=None to stop interpolating expressions in script
[sos: sigil=None]
run:
    #!/usr/bin/env sos-runner
    [10]
    print("This is ${step_name}")
    [20]
    print("This is ${step_name}")

This is default_10
This is default_20


INFO: Executing [32mdefault_10[0m: 
INFO: Executing [32mdefault_20[0m: 
INFO: Workflow default (ID=fc087fb3ae4c581e) is executed successfully.


Note that action `run`would not analyze shebang line of a script if it is executed in a docker container (with option `docker-image`) and would always assumed to be `bash`.

###  `sos_run`

Action `sos_run(workflow, **kwargs)` executes a specified workflow. The workflow can be a single workflow, a subworkflow (e.g. `A_-10`), or a combined workflow (e.g. `A + B`).

The nested workflow inheirts the dictionary of the step from which the `sos_run` action is executed. For example, the `sub` step in the following example gets value `param` from the `default` step within which `param` is defined.

In [4]:
[sub]
print(param)

[default]
for param in range(2):
    sos_run('sub')

0
1


The parameters can also be passed to the nested workflow as keyword parameters. For example, the above example can also be written as

In [5]:
[sub]
print(param)

[default]
sos_run('sub', param=0)
sos_run('sub', param=1)

0
1


If the `sos_run` action is called within a input group of the step, it will treat `_input` as the input of the nested workflow,

In [6]:
%sandbox
!touch a.txt b.txt

[process]
output: "${input}.processed"
run:
    echo Processing ${input}
    touch ${output}

[10]
input: '*.txt', group_by=1
output: "${_input}.processed"
sos_run('process')

Processing a.txt
Processing b.txt


The above example demonstrates a difference between the batch and interactive mode of SoS. In batch mode, the `process` step would be executed in its own process and namespace so the `output` of the `step_10` would be `['a.txt.processed', 'b.txt.processed']` after the completion of the step. However, in interactive mode the `output` of `step_10` is overwritten by the output of step `processed` so you are seeing only `b.txt.processed` in the output.

The subworkflow can also be defined by parameter targets, in which case a workflow would be created dynamically to produce the specified targets. Please refer to section [Auxiliary Steps](Auxiliary_Steps.html) for details about such usage.

### `report`

Action `report` writes some content to an output stream. The input can either be a string or content of one or more files specified by option `input`. The output is determined by parameter `output`, and command line option `-r`.

* If `output='filename'`, the content will be written to a file.
* If `output='>>filename'`, the content will be appended to specified file
* If `output=obj` and `obj` has a `write` function (e.g. a file handle), the content will be passed to the `write` function
* If output is unspecified and a filename is specified with option `-r`, the content will be appended to specified file (no `>>` prefix is needed).
* If output is unspecified and no filename is specified from option `-r`, the content will be written to standard output.

For example, the content of `report` actions is printed to standard output if no output is specified.

In [7]:
[10]
report:
    Runing ${step_name}

[20]
report:
    Runing ${step_name}


Runing default_10

Runing default_20



We can specify an output file with option `output`, but the output will be overwritten if multiple actions write to the same file

In [8]:
%sandbox
%preview report.txt
[10]
report: output='report.txt'
    Runing ${step_name}

[20]
report: output='report.txt'
    Runing ${step_name}

Runing default_20



This can be fixed by using the `>>` version of filename,

In [9]:
%sandbox
%preview report.txt
[10]
report: output='report.txt'
    Runing ${step_name}

[20]
report: output='>>report.txt'
    Runing ${step_name}

Failed to process statement report(r"""Runing ${step_name}...ort.txt')\n: Target >>report.txt does not exist after execution of action

Runing default_10

Runing default_20



Option `-r` can be used to set a default output of all `report` actions

In [None]:
%sandbox
%preview report.txt
%run -r report.txt
[10]
report: 
    Runing ${step_name}

[20]
report:
    Runing ${step_name}

An interesting feature of the `-r` option is that the filename passed by this option is treated as double-quoted strings and will be interpolated. This allows the use of a single `-r` option to specify step-dependent output files. For example, 

In [None]:
%sandbox
%preview default_10.txt default_20.txt
%run -r '${step_name}.txt'
[10]
report: 
    Runing ${step_name}

[20]
report:
    Runing ${step_name}

Action `report` can also take the content of one or more input files and write them to the output stream. For example, the `report` action in the following example writes the content of `out.txt` to the default report stream (which is the standard output in this case).

In [None]:
%sandbox
[10]
run:
   # run some command and generate a out.txt
   echo "some result " > out.txt

report(input='out.txt')

### `bash`, `sh`, `csh`, `tcsh`, `zsh`

Action `bash(script)` accepts a shell script and execute it using `bash`. `sh`, `csh`, `tcsh`, `zsh` uses respective shell to execute the provided script.

These actions, as well as all script-executing actions such as `python`, also accept an option `args` and allows you to pass additional arguments to the interpreter. For example

In [None]:
run: args='-n ${filename!q}'
      echo "a"

execute the script with command `bash -n` (check syntax), so command `echo` is not actually executed.

### `python` and `python3`

Action `python(script)` and `python3(script)` accepts a Python script and execute it with python or python3, respectively.

Because SoS can include Python statements directly in a SoS script, it is important to note that embedded Python
statements are interpreted by SoS and the `python` and `python3` actions are execute in separate processes without
access to the SoS environment.

For example, the following SoS step execute some python statements **within** SoS with direct access to SoS variables
such as `input`, and with `result` writing directly to the SoS environment,

```python
[10]
for filename in input:
    with open(filename) as data:
        result = filename + '.res'
        ....
```

while

```python
[10]
input: group_by='single'

python:

with open(${input!r}) as data:
   result = ${input!r} + '.res'
   ...


```

composes a Python script for each input file and calls separate Python interpreters to execute them. Whereas
the Python statement in the first example will always be executed, the statements in `python` will not be executed
in `inspect` mode.

###  `R`

Action `R(script)` execute the passed script using `Rscript` command. 

In [None]:
R:
    D <- data.frame(x=c(1,2,3,1), y=c(7,19,2,2))
    # Sort on x
    indexes <- order(D$x)
    D[indexes,]

###  `perl`

Action `perl(script)` execute the passed script using `perl` interpreter.

In [None]:
perl:
    my $name = "Brian";
    print "Hello, $name!\n";

###  `ruby`

Action `ruby(script)` execute the passed script using `ruby` interpreter.

In [None]:
ruby:
    a = [ 45, 3, 19, 8 ]
    b = [ 'sam', 'max', 56, 98.9, 3, 10, 'jill' ]
    print (a + b).join(' '), "\n"
    print a[2], " ", b[4], " ", b[-2], "\n"
    print a.sort.join(' '), "\n"
    a << 57 << 9 << 'phil'
    print "A: ", a.join(' '), "\n"

###  `node` and `JavaScript`

Action `node(script)` and `JavaScript(script)` execute the passed script using `node` interpreter.

In [None]:
node:
    var i, a, b, c, max;

    max = 1000000000;

    var d = Date.now();

    for (i = 0; i < max; i++) {
        a = 1234 + 5678 + i;
        b = 1234 * 5678 + i;
        c = 1234 / 2 + i;
    }

    console.log(Date.now() - d);

### `pandoc`

Action `pandoc` uses command [pandoc](http://pandoc.org/) to convert specified input to output. This input to this action can be specified from option `script` (usually specified in script format) and `input`.

First, if a script is specified, pandoc assumes it is in markdown format and convert it by default to 'HTML' format. For example,

In [None]:
pandoc:
    # this is header
    This is some test, with **emphasis**.        

You can specify an output with option `output`

In [None]:
%sandbox
%preview out.html
pandoc: output='out.html'
    Itemize

    * item 1
    * item 2

You can convert input file to another file type using a different file extension

In [None]:
%sandbox
%preview out.tex
pandoc: output='out.tex'
    Itemize

    * item 1
    * item 2

Or you can add more options to the command line by modifying `args`,

In [None]:
%sandbox
%preview out.html
pandoc: output='out.html', args='${input!q} --output ${output!q} -s'
    Itemize

    * item 1
    * item 2

The second usage of the `pandoc` action is to specify one or more input filenames. You have to use the function form of this action as follows

In [None]:
%sandbox
%preview out.html
[10]
report: output = 'out.md'
    Itemize

    * item 1
    * item 2

[20]
pandoc(input='out.md', output='out.html')
    

If multiple files are specified, the content of these input files will be concatenated. This is very useful for generating a single pandoc output with input from different steps. We will demonstrate this feature in the [Generating Reports](../tutorials/Generating_Reports.html) tutorial.

If both `script` and `input` parameters are specified, the content of input files would be appended to `script`. So

In [None]:
#%sandbox
%preview out.html
[10]
report: output = 'out10.md'
    Itemize

    * item 1
    * item 2

[20]
report: output= 'out20.md'
    enumerated

    1. item 1
    2. item 2

[30]
pandoc: input=['out10.md', 'out20.md'], output='out.html'
    Markdown supports both itemized and enumerated

### `Rmarkdown`

Action `Rmarkdown` shares the same user interface with action `pandoc`. The only big difference is that it used `R`'s `rmarkdown` package to render R-flavored Markdown language.

For example, the `Rmarkdown` action of the following example collects input files `A_10.md` and `A_20.md` and use `R`'s `rmarkdown` package to convert it to `out.html`.

In [None]:
%sandbox

[A_10]
report: output="A_10.md"
    step_10

[A_20]
report: output="A_20.md"
    Itemize

    * item 1
    * item 2

[A_30]
Rmarkdown(input=['A_10.md', 'A_20.md'], output='out.html')

###  `docker_build`

Build a docker image from an inline Docker file. The inline version of the action currently does not support adding any file from local machine because the docker file will be saved to a random directory. You can walk around this problem by creating a `Dockerfile` and pass it to the action through option `path`. This action accepts all parameters as specified in [docker-py documentation](http://docker-py.readthedocs.io/en/latest/images.html) because SoS simply pass additional parameters to the `build` function.

For example, the following step builds a docker container for [MISO](http://miso.readthedocs.org/en/fastmiso/) based on anaconda python 2.7.

```
[build_1]
# building miso from a Dockerfile
docker_build: tag='mdabioinfo/miso:latest'

	############################################################
	# Dockerfile to build MISO container images
	# Based on Anaconda python
	############################################################

	# Set the base image to anaconda Python 2.7 (miso does not support python 3)
	FROM continuumio/anaconda

	# File Author / Maintainer
	MAINTAINER Bo Peng <bpeng@mdanderson.org>

	# Update the repository sources list
	RUN apt-get update

	# Install compiler and python stuff, samtools and git
	RUN apt-get install --yes \
	 build-essential \
	 gcc-multilib \
	 gfortran \ 
	 apt-utils \
	 libblas3 \ 
	 liblapack3 \
	 libc6 \
	 cython \ 
	 samtools \
	 libbam-dev \
	 bedtools \
	 wget \
	 zlib1g-dev \ 
	 tar \
	 gzip

	WORKDIR /usr/local
	RUN pip install misopy
```

###  `download`

Action `download(URLs, dest_dir='.', dest_file=None, decompress=False)` download files from specified URLs, which can be a list of URLs, or a string with tab, space or newline separated URLs. 

* If `dest_file` is specified, only one URL is allowed and the URL can have any form.
* Otherwise all files will be downloaded to `dest_dir`. Filenames are determined from URLs so the URLs must have the last portion as the filename to save. 
* If `decompress` is True, `.zip` file, compressed or plan `tar` (e.g. `.tar.gz`) files, and `.gz` files will be decompressed to the same directory as the downloaded file.

For example,

```
[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'

download:   dest=GATK_RESOURCE_DIR
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz.md5
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz.md5
    ${GATK_URL}/dbsnp_138.hg19.vcf.gz
    ${GATK_URL}/dbsnp_138.hg19.vcf.gz.md5
    ${GATK_URL}/dbsnp_138.hg19.vcf.idx.gz
    ${GATK_URL}/dbsnp_138.hg19.vcf.idx.gz.md5
    ${GATK_URL}/hapmap_3.3.hg19.sites.vcf.gz
    ${GATK_URL}/hapmap_3.3.hg19.sites.vcf.gz.md5
    ${GATK_URL}/hapmap_3.3.hg19.sites.vcf.idx.gz
    ${GATK_URL}/hapmap_3.3.hg19.sites.vcf.idx.gz.md5
```

download the specified files to `GATK_RESOURCE_DIR`. The `.md5` files will be automatically used to validate the content of the associated files. Note that 

SoS automatically save signature of downloaded and decompressed files so the files will not be re-downloaded if the action is called multiple times. You can however still still specifies input and output of the step to use step signature


```
[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'
RESOUCE_FILES =  '''1000G_omni2.5.hg19.sites.vcf.gz
    1000G_omni2.5.hg19.sites.vcf.gz.md5
    1000G_omni2.5.hg19.sites.vcf.idx.gz
    1000G_omni2.5.hg19.sites.vcf.idx.gz.md5
    dbsnp_138.hg19.vcf.gz
    dbsnp_138.hg19.vcf.gz.md5
    dbsnp_138.hg19.vcf.idx.gz
    dbsnp_138.hg19.vcf.idx.gz.md5
    hapmap_3.3.hg19.sites.vcf.gz
    hapmap_3.3.hg19.sites.vcf.gz.md5
    hapmap_3.3.hg19.sites.vcf.idx.gz
    hapmap_3.3.hg19.sites.vcf.idx.gz.md5'''.split() 
input: []
output:  [os.path.join(GATK_RESOURCE_DIR, x) for x in GATK_RESOURCE_FILES]
download(['${GATK_URL}/${x}' for x in GATK_RESOURCE_FILES], dest=GATK_RESOURCE_DIR)
```

Note that the `download` action uses up to 5 processes to download files. You can change this number by adjusting system configuration `sos_download_processes`.

###  `fail_if`

Action `fail_if(expr, msg='')` raises an exception with `msg` (and terminate the execution of the workflow if the exception is not caught) if `expr` returns True.

###  `warn_if`

Action `warn_if(expr, msg)` yields a warning message `msg` if `expr` is evaluate to be true.

###  `stop_if`

Action `stop_if(expr, msg='')` stops the execution of the current step (or current processes if within `for_each` loop) and gives a warning message if `msg` is specified. For example,

In [None]:
%sandbox
!touch a.txt
!echo 'something' > b.txt

[10]
input: '*.txt', group_by=1

stop_if(os.path.getsize(_input[0]) == 0)
print(_input)

skips `a.txt` because it has size 0.

## Functions and objects

###  `get_output`

Function `get_output(cmd)` returns the output of command (decoded in `UTF-8`), which is a shortcut for `subprocess.check_output(cmd, shell=True).decode()`.

In [None]:
get_output('which ls')

This function also accepts two options `show_command=False`, and `prompt='$ '` that can be useful in case you would like to present the command that produce the output. For example,

In [None]:
print(get_output('which ls', show_command=True))

###  `expand_pattern`

Function `expand_pattern` expands a string to multiple ones using items of variables quoted between `{ }`. For example,

```python
output: expand_pattern('{a}_{b}.txt')
```

is equivalent to

```python
output: ['{x}_{y}.txt' for x,y in zip(a, b)]
```

if `a` and `b` are sequences of the same length. For example,

In [None]:
name = ['Bob', 'John']
salary = [200, 300]
expand_pattern("{name}'s salary is {salary}")

The sequences should have the same length

In [None]:
%sandbox --expect-error

salary = [200]
expand_pattern("{name}'s salary is {salary}")

An exception is made for variables of simple non-sequence types, in which case they are repeated in all expanded items

In [None]:
salary = 200
expand_pattern("{name}'s salary is {salary}")

###  `logger` object

The SoS logger object is a `logging` object used by SoS to produce various outputs. You can use this object to output error, warning, info, debug, and trace messages to terminal. For example,

In [None]:
%run -v2
[0]
logger.info("I am at ${step_name}")

The output of `logger` is controlled by logging level, for example, the above message would not be printed at `-v1` (warning)

In [None]:
%run -v1
[0]
logger.info("I am at ${step_name}")