# SoS Functions

## SoS Action

Although arbitrary python functions can be used in SoS step process, SoS defines some **`actions`** (e.g. the `run` function in the aforementioned examples)
that can be used in a SoS script. These fucntions accept a common set of **runtime options**, and are **not called in dryrun mode**.

For example, function `time.sleep(5)` would be executed in run mode,

In [1]:
[0]
import time
st = time.time()
time.sleep(1)
print('I just slept {:.2f} seconds'.format(time.time() - st))

I just slept 1.00 seconds


and also in dryrun mode (option `-n`),

In [2]:
%run -n
[0]
import time
st = time.time()
time.sleep(1)
print('I just slept {:.2f} seconds'.format(time.time() - st))

I just slept 1.00 seconds


because these statements are regular Python functions. However, if you put the statements in an action `python`, the statements would be executed in run mode,

In [3]:
[0]
python:
    import time
    st = time.time()
    time.sleep(1)
    print('I just slept {:.2f} seconds'.format(time.time() - st))

I just slept 1.01 seconds


but not executed in dryrun mode (option `-n`)

In [4]:
%run -n
[0]
python:
    import time
    st = time.time()
    time.sleep(1)
    print('I just slept {:.2f} seconds'.format(time.time() - st))

## Action options

Actions define their own parameters but their execution is controlled by a common set of options.

###  `active`

Action option `active` is used to activate or inactivate an action in an input loop. Basically, when a loop is defined by `for_each` or `group_by` options of `input:` statement, an action after input would be repeated for each input group. The `action` parameter accepts an integer, either a non-negative number, a negative number (counting backward), a sequence of indexes, or a slice object, for which the action would be active.

For example, for an input loop that loops through a sequence of numbers, the first action `run` is executed for all groups, the second action is executed for even number of groups, the last action is executed for the last step.

In [5]:
seq = range(5)
input: for_each='seq'
run:
   echo I am active at all groups ${_index}
run: active=slice(None, None, 2)
   echo I am active at even groups ${_index}
run: active=-1
   echo I am active at last group ${_index}

I am active at all groups 0
I am active at even groups 0
I am active at all groups 1
I am active at all groups 2
I am active at even groups 2
I am active at all groups 3
I am active at all groups 4
I am active at even groups 4
I am active at last group 4


### `workdir`

Option `workdir` changes the current working directory for the action, and change back once the action is executed. The directory will be created if it does not exist.

In [6]:
bash: workdir='tmp'
   touch a.txt
bash:
    ls tmp
    rm tmp/a.txt
    rmdir tmp

a.txt


### `docker_image`

If a docker image is specified (either a name, an Id, or a file), the action is assumed to be executed in the specified docker. The image will be automatically downloaded (pulled) or loaded (if a `.tar` or `.tar.gz` file is specified`) if it is not available locally. 

For example, executing the following script 

```
[10]
python3: docker_image='python'
  set = {'a', 'b'}
  print(set)
  ```

under a docker terminal (that is connected to the docker daemon) will

1. Pull docker image `python`,  which is the official docker image for Python 2 and 3.
2. Create a python script with the specified content
3. Run the docker container `python` and make the script available inside the container
4. Use the `python3` command inside the container to execute the script.

Additional `docker_run` parameters can be passed to actions when the action
is executed in a docker image. These options include

* `name`: name of the container (option `--name`)
* `tty`: if a tty is attached (default to `True`, option `-t`)
* `stdin_open`: if stdin should be open (default to `False`, option `-i`)
* `user`: username (default o `root`, option `-u`)
* `environment`: Can be a string, a list of string or dictinary of environment variables for docker (option `-e`)
* `volumes`: string or list of string, extra volumes that need to be link, in addition to SoS mounted (`/tmp`, `/Users` (if mac), `/Volumes` (if [properly configured](https://github.com/bpeng2000/SOS/wiki/SoS-Docker-guide) under mac) and script file)
* `volumes_from`: container names or Ids to get volumes from
* `working_dir`: working directory (option `-w`), default working directory, or working directory set by runtime option `workdir`.
* `port`: port opened (option `-p`)
* `extra_args`: If there is any extra arguments you would like to pass to the `docker run` process (after you check the actual command of `docker run` of SoS

### `docker_file`

This option allows you to import a docker from specified `docker_file`, which can be an archive file (`.tar`, `.tar.gz`, `.tgz`, `.bzip`, `.tar.xz`, `.txz`) or a URL to an archive file (e.g. `http://example.com/exampleimage.tgz`). SoS will use command `docker import` to import the `docker_file`. However, because SoS does not know the repository and tag names of the imported docker file, you will still need to use option `docker_image` to specify the image to use.

It is easy to define your own actions. All you need to do is to define a function and decorate it with a `SoS_Action` decorator. For example

```python
from pysos import SoS_Action

@SoS_Action(run_mode=('run', 'interactive'))
def my_action(parameters):
    do_something_with_parameters
	return 1
```

###  `args`

All script-executing actions accept an option `args`, which changes how the script is executed.

By default, such an action has an `interpreter` (e.g. `bash`), a default `args='${filename!q}'`, amd the script would be executed as `interpreter args`, which is
```
bash ${filename!q}
```
where `${filename!q}` would be replaced by the temporary script file.

If you would like to change the command line with additional parameters, or different format of filename, you can specify an alternative `args`, with variables `filename` (filename of temporary script) and `script` (actual content of the script).

For example, option `-n` can be added to command `bash` to execute script in dryrun mode

In [7]:
bash: args='-n ${filename!q}'
    echo "-n means running in dryrun mode (only check syntax)"

and you can actually execute a command without filename, and instead executing the script directly from command line

In [8]:
python: args='-m timeit ${script}'
    '"-".join(str(n) for n in range(100))'

10000 loops, best of 3: 31.2 usec per loop


## Action reference

###  `run`

`run` is the most frequently used action in sos. In most cases, it is equivalent to action `bash` that uses `bash` to execute specified script. Under the hood, this action is quite different from `bash` because the run action does not have a default interpreter and would behave differently under different situations.

In the simplest case when one or more commands are specified, action `run` would assume it is a batch script under windows, and a bash script otherwise.

In [9]:
run:
    echo "A"

A


However, if the script starts with a [shebang line](https://en.wikipedia.org/wiki/Shebang_(Unix)), this action would execute the script directly. This allows you to execute any script in any language. For example, the following script executes a python script using action `run`

In [10]:
run:
    #!/usr/bin/env python
    print('This is python')

This is python


and the following example runs a complete sos script using command `sos-runner`

In [11]:
# use sigil=None to stop interpolating expressions in script
[sos: sigil=None]
run:
    #!/usr/bin/env sos-runner
    [10]
    print("This is ${step_name}")
    [20]
    print("This is ${step_name}")

This is default_10
This is default_20


INFO: Executing [32mdefault_10[0m: 
INFO: Executing [32mdefault_20[0m: 
INFO: Workflow default (ID=fc087fb3ae4c581e) is executed successfully.


Note that action `run`would not analyze shebang line of a script if it is executed in a docker container (with option `docker-image`) and would always assumed to be `bash`.

###  `sos_run`

Action `sos_run(workflow, **kwargs)` executes a specified workflow. The workflow can be a single workflow, a subworkflow (e.g. `A_-10`), or a combined workflow (e.g. `A + B`).

The nested workflow inheirts the dictionary of the step from which the `sos_run` action is executed. For example, the `sub` step in the following example gets value `param` from the `default` step within which `param` is defined.

In [12]:
[sub]
print(param)

[default]
for param in range(2):
    sos_run('sub')

0
1


The parameters can also be passed to the nested workflow as keyword parameters. For example, the above example can also be written as

In [13]:
[sub]
print(param)

[default]
sos_run('sub', param=0)
sos_run('sub', param=1)

0
1


If the `sos_run` action is called within a input group of the step, it will treat `_input` as the input of the nested workflow,

In [14]:
%sandbox
!touch a.txt b.txt

[process]
output: "${input}.processed"
run:
    echo Processing ${input}
    touch ${output}

[10]
input: '*.txt', group_by=1
output: "${_input}.processed"
sos_run('process')

Processing a.txt
Processing b.txt


The above example demonstrates a difference between the batch and interactive mode of SoS. In batch mode, the `process` step would be executed in its own process and namespace so the `output` of the `step_10` would be `['a.txt.processed', 'b.txt.processed']` after the completion of the step. However, in interactive mode the `output` of `step_10` is overwritten by the output of step `processed` so you are seeing only `b.txt.processed` in the output.

The subworkflow can also be defined by parameter targets, in which case a workflow would be created dynamically to produce the specified targets. Please refer to section [Auxiliary Steps](Auxiliary_Steps.html) for details about such usage.

### `report`

Action `report` writes the passed string to an output, which is determined by parameter `output`, and command line option `-r`.

* If `output='filename'`, the content will be written to a file.
* If `output='>>filename'`, the content will be appended to specified file
* If `output=obj` and `obj` has a `write` function (e.g. a file handle), the content will be passed to the `write` function
* If output is unspecified and a filename is specified with option `-r`, the content will be appended to specified file (no `>>` prefix is needed).
* If output is unspecified and no filename is specified from option `-r`, the content will be written to standard output.

For example, the content of `report` actions is printed to standard output if no output is specified.

In [2]:
[10]
report:
    Runing ${step_name}

[20]
report:
    Runing ${step_name}


Runing default_10

Runing default_20

We can specify an output file with option `output`, but the output will be overwritten if multiple actions write to the same file

In [3]:
%sandbox
%preview report.txt
[10]
report: output='report.txt'
    Runing ${step_name}

[20]
report: output='report.txt'
    Runing ${step_name}

Runing default_20

This can be fixed by using the `>>` version of filename,

In [4]:
%sandbox
%preview report.txt
[10]
report: output='report.txt'
    Runing ${step_name}

[20]
report: output='>>report.txt'
    Runing ${step_name}

Runing default_10

Runing default_20

Option `-r` can be used to set a default output of all `report` actions

In [5]:
%sandbox
%preview report.txt
%run -r report.txt
[10]
report: 
    Runing ${step_name}

[20]
report:
    Runing ${step_name}

Runing default_10

Runing default_20

### `bash`, `sh`, `csh`, `tcsh`, `zsh`

Action `bash(script)` accepts a shell script and execute it using `bash`. `sh`, `csh`, `tcsh`, `zsh` uses respective shell to execute the provided script.

These actions, as well as all script-executing actions such as `python`, also accept an option `args` and allows you to pass additional arguments to the interpreter. For example

In [15]:
run: args='-n ${filename!q}'
      echo "a"

execute the script with command `bash -n` (check syntax), so command `echo` is not actually executed.

### `python` and `python3`

Action `python(script)` and `python3(script)` accepts a Python script and execute it with python or python3, respectively.

Because SoS can include Python statements directly in a SoS script, it is important to note that embedded Python
statements are interpreted by SoS and the `python` and `python3` actions are execute in separate processes without
access to the SoS environment.

For example, the following SoS step execute some python statements **within** SoS with direct access to SoS variables
such as `input`, and with `result` writing directly to the SoS environment,

```python
[10]
for filename in input:
    with open(filename) as data:
        result = filename + '.res'
        ....
```

while

```python
[10]
input: group_by='single'

python:

with open(${input!r}) as data:
   result = ${input!r} + '.res'
   ...


```

composes a Python script for each input file and calls separate Python interpreters to execute them. Whereas
the Python statement in the first example will always be executed, the statements in `python` will not be executed
in `inspect` mode.

###  `R`

Action `R(script)` execute the passed script using `Rscript` command. 

In [16]:
R:
    D <- data.frame(x=c(1,2,3,1), y=c(7,19,2,2))
    # Sort on x
    indexes <- order(D$x)
    D[indexes,]

  x  y
1 1  7
4 1  2
2 2 19
3 3  2


###  `perl`

Action `perl(script)` execute the passed script using `perl` interpreter.

In [17]:
perl:
    my $name = "Brian";
    print "Hello, $name!\n";

Hello, Brian!


###  `ruby`

Action `ruby(script)` execute the passed script using `ruby` interpreter.

In [18]:
ruby:
    a = [ 45, 3, 19, 8 ]
    b = [ 'sam', 'max', 56, 98.9, 3, 10, 'jill' ]
    print (a + b).join(' '), "\n"
    print a[2], " ", b[4], " ", b[-2], "\n"
    print a.sort.join(' '), "\n"
    a << 57 << 9 << 'phil'
    print "A: ", a.join(' '), "\n"

45 3 19 8 sam max 56 98.9 3 10 jill
19 3 10
3 8 19 45
A: 45 3 19 8 57 9 phil


###  `node` and `JavaScript`

Action `node(script)` and `JavaScript(script)` execute the passed script using `node` interpreter.

In [19]:
node:
    var i, a, b, c, max;

    max = 1000000000;

    var d = Date.now();

    for (i = 0; i < max; i++) {
        a = 1234 + 5678 + i;
        b = 1234 * 5678 + i;
        c = 1234 / 2 + i;
    }

    console.log(Date.now() - d);

1071


### `pandoc`

Action `pandoc` uses command [pandoc](http://pandoc.org/) to convert specified input to output. This action can be used in a few different ways.

First, if a script is specified, pandoc assumes it is in markdown format and convert it by default to 'HTML' format. For example,

In [1]:
pandoc:
    # this is header
    This is some test, with **emphasis**.        

<h1 id="this-is-header">this is header</h1>
<p>This is some test, with <strong>emphasis</strong>.</p>


You can specify an output with option `output`

In [9]:
%sandbox
%preview out.html
pandoc: output='out.html'
    Itemize

    * item 1
    * item 2

You can convert input file to another file type using a different file extension

In [10]:
%sandbox
%preview out.tex
pandoc: output='out.tex'
    Itemize

    * item 1
    * item 2

Itemize

\begin{itemize}
\tightlist
\item


Or you can add more options to the command line by modifying `args`,

In [11]:
%sandbox
%preview out.html
pandoc: output='out.html', args='${input!q} --output ${output!q} -s'
    Itemize

    * item 1
    * item 2

The second usage of the `pandoc` action is to specify an input filename. You have to use the function form of this action as follows

In [12]:
%sandbox
%preview out.html
[10]
report: output = 'out.md'
    Itemize

    * item 1
    * item 2

[20]
pandoc(input='out.md', output='out.html')
    

Finally, if action `pandoc` is specified with no script and no `input` parameter, it will try to convert the report speciied by command line option `-r`. This allows the following usage:

In [13]:
%sandbox
%run -r out.md
[10]
report:
    Itemize

    * item 1
    * item 2

[20]
pandoc()

<p>Itemize</p>
<ul>
<li>item 1</li>
<li>item 2</li>
</ul>


Basically, the output of the `report` action is written to `out.md` as specified by option `-r out.md`, which then becomes the default input of `pandoc`. Of course you can specify the output of `pandoc` in this case

In [14]:
%sandbox
%preview out.html
%run -r out.md
[10]
report:
    Itemize

    * item 1
    * item 2

[20]
pandoc(output='out.html')

###  `docker_build`

Build a docker image from an inline Docker file. The inline version of the action currently does not support adding any file from local machine because the docker file will be saved to a random directory. You can walk around this problem by creating a `Dockerfile` and pass it to the action through option `path`. This action accepts all parameters as specified in https://docker-py.readthedocs.org/en/stable/api/#build because SoS simply pass additional parameters to the `build` function.

For example, the following step builds a docker container for [MISO](http://miso.readthedocs.org/en/fastmiso/) based on anaconda python 2.7.

```
[build_1]
# building miso from a Dockerfile
docker_build: tag='mdabioinfo/miso:latest'

	############################################################
	# Dockerfile to build MISO container images
	# Based on Anaconda python
	############################################################

	# Set the base image to anaconda Python 2.7 (miso does not support python 3)
	FROM continuumio/anaconda

	# File Author / Maintainer
	MAINTAINER Bo Peng <bpeng@mdanderson.org>

	# Update the repository sources list
	RUN apt-get update

	# Install compiler and python stuff, samtools and git
	RUN apt-get install --yes \
	 build-essential \
	 gcc-multilib \
	 gfortran \ 
	 apt-utils \
	 libblas3 \ 
	 liblapack3 \
	 libc6 \
	 cython \ 
	 samtools \
	 libbam-dev \
	 bedtools \
	 wget \
	 zlib1g-dev \ 
	 tar \
	 gzip

	WORKDIR /usr/local
	RUN pip install misopy
```

###  `download`

Action `download(URLs, dest_dir='.', dest_file=None, decompress=False)` download files from specified URLs, which can be a list of URLs, or a string with tab, space or newline separated URLs. 

* If `dest_file` is specified, only one URL is allowed and the URL can have any form.
* Otherwise all files will be downloaded to `dest_dir`. Filenames are determined from URLs so the URLs must have the last portion as the filename to save. 
* If `decompress` is True, `.zip` file, compressed or plan `tar` (e.g. `.tar.gz`) files, and `.gz` files will be decompressed to the same directory as the downloaded file.

For example,

```
[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'

download:   dest=GATK_RESOURCE_DIR
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz.md5
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz.md5
    ${GATK_URL}/dbsnp_138.hg19.vcf.gz
    ${GATK_URL}/dbsnp_138.hg19.vcf.gz.md5
    ${GATK_URL}/dbsnp_138.hg19.vcf.idx.gz
    ${GATK_URL}/dbsnp_138.hg19.vcf.idx.gz.md5
    ${GATK_URL}/hapmap_3.3.hg19.sites.vcf.gz
    ${GATK_URL}/hapmap_3.3.hg19.sites.vcf.gz.md5
    ${GATK_URL}/hapmap_3.3.hg19.sites.vcf.idx.gz
    ${GATK_URL}/hapmap_3.3.hg19.sites.vcf.idx.gz.md5
```

download the specified files to `GATK_RESOURCE_DIR`. The `.md5` files will be automatically used to validate the content of the associated files. Note that 

SoS automatically save signature of downloaded and decompressed files so the files will not be re-downloaded if the action is called multiple times. You can however still still specifies input and output of the step to use step signature


```
[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'
RESOUCE_FILES =  '''1000G_omni2.5.hg19.sites.vcf.gz
    1000G_omni2.5.hg19.sites.vcf.gz.md5
    1000G_omni2.5.hg19.sites.vcf.idx.gz
    1000G_omni2.5.hg19.sites.vcf.idx.gz.md5
    dbsnp_138.hg19.vcf.gz
    dbsnp_138.hg19.vcf.gz.md5
    dbsnp_138.hg19.vcf.idx.gz
    dbsnp_138.hg19.vcf.idx.gz.md5
    hapmap_3.3.hg19.sites.vcf.gz
    hapmap_3.3.hg19.sites.vcf.gz.md5
    hapmap_3.3.hg19.sites.vcf.idx.gz
    hapmap_3.3.hg19.sites.vcf.idx.gz.md5'''.split() 
input: []
output:  [os.path.join(GATK_RESOURCE_DIR, x) for x in GATK_RESOURCE_FILES]
download(['${GATK_URL}/${x}' for x in GATK_RESOURCE_FILES], dest=GATK_RESOURCE_DIR)
```

Note that the `download` action uses up to 5 processes to download files. You can change this number by adjusting system configuration `sos_download_processes`.

###  `fail_if`

Action `fail_if(expr, msg='')` raises an exception with `msg` (and terminate the execution of the workflow if the exception is not caught) if `expr` returns True.

###  `warn_if`

Action `warn_if(expr, msg)` yields a warning message `msg` if `expr` is evaluate to be true.

###  `stop_if`

Action `stop_if(expr, msg='')` stops the execution of the current step (or current processes if within `for_each` loop) and gives a warning message if `msg` is specified. For example,

In [20]:
%sandbox
!touch a.txt
!echo 'something' > b.txt

[10]
input: '*.txt', group_by=1

stop_if(os.path.getsize(_input[0]) == 0)
print(_input)

['b.txt']


skips `a.txt` because it has size 0.

## Other functions and objects

###  `get_output`

Function `get_output(cmd)` returns the output of command (decoded in `UTF-8`), which is a shortcut for `subprocess.check_output(cmd, shell=True).decode()`.

In [21]:
get_output('which ls')

'/bin/ls\n'

This function also accepts two options `show_command=False`, and `prompt='$ '` that can be useful in case you would like to present the command that produce the output. For example,

In [22]:
print(get_output('which ls', show_command=True))

$ which ls
/bin/ls



###  `expand_pattern`

Function `expand_pattern` expands a string to multiple ones using items of variables quoted between `{ }`. For example,

```python
output: expand_pattern('{a}_{b}.txt')
```

is equivalent to

```python
output: ['{x}_{y}.txt' for x,y in zip(a, b)]
```

if `a` and `b` are sequences of the same length. For example,

In [23]:
name = ['Bob', 'John']
salary = [200, 300]
expand_pattern("{name}'s salary is {salary}")

["Bob's salary is 200", "John's salary is 300"]

The sequences should have the same length

In [24]:
%sandbox --expect-error

salary = [200]
expand_pattern("{name}'s salary is {salary}")

Failed to process statement 'expand_pattern("{name}\'s salary is {salary}")': Undefined variable name in pattern {name}'s salary is {salary}
Sandbox execution failed.

An exception is made for variables of simple non-sequence types, in which case they are repeated in all expanded items

In [25]:
salary = 200
expand_pattern("{name}'s salary is {salary}")

["Bob's salary is 200", "John's salary is 200"]

###  `logger` object

The SoS logger object is a `logging` object used by SoS to produce various outputs. You can use this object to output error, warning, info, debug, and trace messages to terminal. For example,

In [26]:
%run -v2
[0]
logger.info("I am at ${step_name}")

INFO: Running [32mdefault_0[0m: 
INFO: I am at default_0


The output of `logger` is controlled by logging level, for example, the above message would not be printed at `-v1` (warning)

In [27]:
%run -v1
[0]
logger.info("I am at ${step_name}")