# SoS actions, targets, and functions


## SoS Action

Although arbitrary python functions can be used in SoS step process, SoS defines many special functions called **`actions`** that accepts some shared parameters, and can behave differently in different modes of SoS.

For example, function `time.sleep(5)` would be executed in run mode,

In [12]:
%run
[0]
import time
st = time.time()
time.sleep(1)
print('I just slept {:.2f} seconds'.format(time.time() - st))

I just slept 1.00 seconds


and also in dryrun mode (option `-n`),

In [13]:
%run -n
[0]
import time
st = time.time()
time.sleep(1)
print('I just slept {:.2f} seconds'.format(time.time() - st))

I just slept 1.00 seconds


because these statements are regular Python functions. However, if you put the statements in an action `python`, the statements would be executed in run mode,

In [14]:
%run
[0]
python:
    import time
    st = time.time()
    time.sleep(1)
    print('I just slept {:.2f} seconds'.format(time.time() - st))

I just slept 1.00 seconds


but will print out the script it would execute in dryrun mode (option `-n`)

In [1]:
%run -n
[0]
python:
    import time
    st = time.time()
    time.sleep(1)
    print('I just slept {:.2f} seconds'.format(time.time() - st))

python:
import time
st = time.time()
time.sleep(1)
print('I just slept {:.2f} seconds'.format(time.time() - st))




## Action options

Actions define their own parameters but their execution is controlled by a common set of options.

### `input`

Parameter `input` specifies the input files that an action needs before it can be executed. However, unlike targets in `input:` statement of a step where lacking an input target would trigger the execution of an auxiliary step (if needs) to produce it, SoS would yield an error if the input file does not exist.

For example, in the following example, step `20` is executed after step `10` so its `report` action can report the content of `a.txt` produced by step `10`.

In [8]:
%sandbox
%run
[10]
output: 'a.txt'
bash:
    echo 'content of a.txt' > a.txt

[20]
report: input='a.txt'

content of a.txt



However, in the following example, step `20` is executed as the first step of workflow `default`. The `report` action requires input file `a.txt` and yields an error.

In [7]:
%sandbox --expect-error
%run
[a: provides='a.txt']
bash:
    echo 'content of a.txt' > a.txt

[20]
report: input='a.txt'

Failed to process statement 'report(r"""\n""", input=\'a.txt\')\n' (FileNotFoundError): [Errno 2] No such file or directory: 'a.txt'


`a.txt` has to be put into the input statement of step `20` for the auxiliary step to be executed:

In [9]:
%sandbox
%run
[a: provides='a.txt']
bash:
    echo 'content of a.txt' > a.txt

[20]
input: 'a.txt'
report: input=input[0]

INFO: Resolving 1 objects from 1 nodes
INFO: Adding step a with output ['a.txt']


content of a.txt



### `output`

Similar to `input`, parameter `output` defines the output of an action, which can be a single name (or target) or a list of files or targets. SoS would checks the existence of output target after the completion of the action. For example, 

In [2]:
%sandbox --expect-error
%run
[10]
bash: output='non_existing.txt'

Failed to process statement 'bash(r"""\n""", output=\'non_existing.txt\')\n' (RuntimeError): Output target non_existing.txt does not exist after completion of action bash


In addition to checking the existence of input and output files, specifying `input` and `output` of an action will allow SoS to create signatures of action so that it will not be executed when it is called again with the same input and output files. This is in addition to step-level signature and can be useful for long-running actions.

For example, suppose action `sh` is time-consuming that produces output `test.txt`

In [9]:
%run -s default
[10]
import time, os
time.sleep(2)

sh: input=[], output='test.txt'
   touch test.txt

print(os.path.getmtime('test.txt'))


1500226663.0


Because the action has parameter `input` and `output`, a signature will be created so it will not be re-executed even when the step itself is changed (from `sleep(2)` to `sleep(1)`).

In [10]:
%run -s default
[10]
import time, os
time.sleep(1)

sh: input=[], output='test.txt'
   touch test.txt

print(os.path.getmtime('test.txt'))


INFO: Action [32msh[0m is [32mignored[0m due to saved signature


1500226663.0


Note that we have to use option `-s default` for our examples because the default mode for SoS in Jupyter is `ignore` so no siguatures will be saved and used by default.

###  `active`

Action option `active` is used to activate or inactivate an action in an input loop. Basically, when a loop is defined by `for_each` or `group_by` options of `input:` statement, an action after input would be repeated for each input group. The `action` parameter accepts an integer, either a non-negative number, a negative number (counting backward), a sequence of indexes, or a slice object, for which the action would be active.

For example, for an input loop that loops through a sequence of numbers, the first action `run` is executed for all groups, the second action is executed for even number of groups, the last action is executed for the last step.

In [9]:
seq = range(5)
input: for_each='seq'
run:
   echo I am active at all groups ${_index}
run: active=slice(None, None, 2)
   echo I am active at even groups ${_index}
run: active=-1
   echo I am active at last group ${_index}

I am active at all groups 0
I am active at even groups 0
I am active at all groups 1
I am active at all groups 2
I am active at even groups 2
I am active at all groups 3
I am active at all groups 4
I am active at even groups 4
I am active at last group 4


### `workdir`

Option `workdir` changes the current working directory for the action, and change back once the action is executed. The directory will be created if it does not exist.

In [10]:
bash: workdir='tmp'
   touch a.txt
bash:
    ls tmp
    rm tmp/a.txt
    rmdir tmp

a.txt


### `docker_image`

If a docker image is specified (either a name, an Id, or a file), the action is assumed to be executed in the specified docker. The image will be automatically downloaded (pulled) or loaded (if a `.tar` or `.tar.gz` file is specified`) if it is not available locally. 

For example, executing the following script 

```
[10]
python3: docker_image='python'
  set = {'a', 'b'}
  print(set)
  ```

under a docker terminal (that is connected to the docker daemon) will

1. Pull docker image `python`,  which is the official docker image for Python 2 and 3.
2. Create a python script with the specified content
3. Run the docker container `python` and make the script available inside the container
4. Use the `python3` command inside the container to execute the script.

Additional `docker_run` parameters can be passed to actions when the action
is executed in a docker image. These options include

* `name`: name of the container (option `--name`)
* `tty`: if a tty is attached (default to `True`, option `-t`)
* `stdin_open`: if stdin should be open (default to `False`, option `-i`)
* `user`: username (default o `root`, option `-u`)
* `environment`: Can be a string, a list of string or dictinary of environment variables for docker (option `-e`)
* `volumes`: string or list of string, extra volumes that need to be link, in addition to SoS mounted (`/tmp`, `/Users` (if mac), `/Volumes` (if [properly configured](http://vatlab.github.io/SOS/doc/tutorials/SoS_Docker_Guide.html) under mac) and script file)
* `volumes_from`: container names or Ids to get volumes from
* `working_dir`: working directory (option `-w`), default working directory, or working directory set by runtime option `workdir`.
* `port`: port opened (option `-p`)
* `extra_args`: If there is any extra arguments you would like to pass to the `docker run` process (after you check the actual command of `docker run` of SoS

### `docker_file`

This option allows you to import a docker from specified `docker_file`, which can be an archive file (`.tar`, `.tar.gz`, `.tgz`, `.bzip`, `.tar.xz`, `.txz`) or a URL to an archive file (e.g. `http://example.com/exampleimage.tgz`). SoS will use command `docker import` to import the `docker_file`. However, because SoS does not know the repository and tag names of the imported docker file, you will still need to use option `docker_image` to specify the image to use.

It is easy to define your own actions. All you need to do is to define a function and decorate it with a `SoS_Action` decorator. For example

```python
from pysos import SoS_Action

@SoS_Action(run_mode=('run', 'interactive'))
def my_action(parameters):
    do_something_with_parameters
	return 1
```

###  `args`

All script-executing actions accept an option `args`, which changes how the script is executed.

By default, such an action has an `interpreter` (e.g. `bash`), a default `args='${filename!q}'`, amd the script would be executed as `interpreter args`, which is
```
bash ${filename!q}
```
where `${filename!q}` would be replaced by the temporary script file.

If you would like to change the command line with additional parameters, or different format of filename, you can specify an alternative `args`, with variables `filename` (filename of temporary script) and `script` (actual content of the script).

For example, option `-n` can be added to command `bash` to execute script in dryrun mode

In [11]:
bash: args='-n ${filename!q}'
    echo "-n means running in dryrun mode (only check syntax)"

and you can actually execute a command without filename, and instead executing the script directly from command line

In [12]:
python: args='-m timeit ${script}'
    '"-".join(str(n) for n in range(100))'

10000 loops, best of 3: 35 usec per loop


### allow_error

Option `allow_error` tells SoS that the action might fail but this should not stop the workflow from executing. This option essentially turns an error to a warning message and change the return value of action to `None`. 

For example, in the following example, the wrong shell script would stop the execution of the step so the following action is not executed.

In [15]:
%sandbox --expect-error
run: 
    This is not shell
print('Step after run')

/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpi62wxece: line 1: This: command not found
Failed to process statement run(r"""This is not shell\n""")...fter run'): Failed to execute script (ret=127). 
Please use command
    /bin/bash /var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpzn3zpjx3/.sos/interactive_0_0
under /private/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpzn3zpjx3 to test it.


but in this example, the error of `run` action is turned to a warning message and the later step would still be executed.

In [16]:
run: allow_error=True
    This is not shell
print('Step after run')

/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpag5371q4: line 1: This: command not found
Please use command
    /bin/bash /var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpzn3zpjx3/.sos/interactive_0_0
under /Users/bpeng1/SOS/docs/src/documentation to test it.[0m


Step after run


## Core Actions

###  Action `run`

`run` is the most frequently used action in sos. In most cases, it is similar to action `bash` and uses `bash` to execute specified script. Under the hood, this action is quite different from `bash` because the run action does not have a default interpreter and would behave differently under different situations.

In the simplest case when one or more commands are specified, action `run` would assume it is a batch script under windows, and a bash script otherwise.

In [1]:
run:
    echo "A"

A


echo "A"


It is different from an `bash` action in that it will exit with error if any of the commands exits with non-zero code. That is to say, whereas a `sh` action would print an error message but continue as follows

In [2]:
sh:
    echoo "A"
    echo "B"

B


/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmp030oqbna.sh: line 1: echoo: command not found


The `run` action would exit with error

In [3]:
%sandbox --expect-error
run:
    echoo "A"
    echo "B"

echoo "A"
/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpsuk9ul5a.sh: line 1: echoo: command not found
Failed to process statement 'run(r"""echoo "A"\necho "B"\n""")\n' (RuntimeError): Failed to execute script (ret=127).
Please use command
	``/bin/bash \
	  /var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmp2be4gap5/.sos/scratch_0_0_cec60319.sh``
under "/private/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmp2be4gap5" to test it.


In another word,
```
run:
    command1
    command2
    command3
```
is equivalent to

```
bash:
    command1 && command2 && command3
```
under Linux/MacOS systems.

However, if the script starts with a shebang line, this action would execute the script directly. This allows you to execute any script in any language. For example, the following script executes a python script using action `run`

In [4]:
run:
    #!/usr/bin/env python
    print('This is python')

This is python


and the following example runs a complete sos script using command `sos-runner`

In [5]:
%run
# use sigil=None to stop interpolating expressions in script
[sos: sigil=None]
run:
    #!/usr/bin/env sos-runner
    [10]
    print("This is ${step_name}")
    [20]
    print("This is ${step_name}")

This is default_10
This is default_20


INFO: Executing [32mdefault_10[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Executing [32mdefault_20[0m: 
INFO: input:    [32m[][0m
INFO: output:   [32m[][0m
INFO: Workflow default (ID=546b66a93731ebf2) is executed successfully.


Note that action `run`would not analyze shebang line of a script if it is executed in a docker container (with option `docker-image`) and would always assumed to be `bash`.

### Action `script`

Action `script` is the general form of all script-executing actions in SoS. It accepts a script, and parameters `interpreter` (required), `suffix` (if required by the interpreter) and optional `args` (command line arguments). It can be used to execute any script for which its interpreter is not currently supported by SoS. For example, the action

```
python:
    print('HERE')
```

can be executed as

In [6]:
script: interpreter='python'
    print('HERE')

HERE


###  Action `sos_run`

Action `sos_run(workflow=None, targets=None, shared=[], args={}, **kwargs)` executes a specified workflow. The workflow can be a single workflow, a subworkflow (e.g. `A_-10`), a combined workflow (e.g. `A + B`), or a workflow that is constructed to generate `targets`. The workflow

* Takes `_input` of the parental step as the input of the first step of the subworkflow
* Takes `args` (a dictionary) and `**kwargs` as parameters as if they are specified from command line
* Copies variables specified in `shared` (a string or a list of string) to the subworkflow if they exist in the parental namespace
* Returns variables defined in `shared` to the parental namespace after the completion of the workflow

`sos_run` would be executed in a separate process in batch mode, and would be executed in the same process in interactive mode, so parameter `shared` is only needed for batch execution.

The simplest use of action `sos_run` is for the execution of one or more workflows. For example,

In [7]:
%run
[A]
print(step_name)

[B]
print(step_name)

[default]
sos_run('A + B')

A_0
B_0


The subworkflows are executed separately and only takes the `_input` of the step as the `input` of the workflow. For example,

In [8]:
%sandbox
%run
!touch a.txt b.txt

[process]
print("Handling ${input}")

[default]
input: 'a.txt', 'b.txt', group_by=1
sos_run('process')

Handling a.txt
Handling b.txt


If you would like to send one or more variables to the subworkflow or return a variable from the execution of subworkflow, you can specify them with the `shared` variable. The return variable part is a bit tricky here because you can only return workflow level variable that are usually `shared` from a step of the subworkflow. For example,

In [9]:
%sandbox
%run

[process]
print("Working with seed ${seed}")

[default]
for seed in range(5):
    sos_run('process', seed=seed)

Working with seed 0
Working with seed 1
Working with seed 2
Working with seed 3
Working with seed 4


In [10]:
%sandbox
%run

[process: shared='result']
result = 100

[default]
sos_run('process')
print("Result from subworkflow process is ${result}")
    

Result from subworkflow process is 100


If the subworkflow accepts parameters, they can be specified using keyword arguments or as a dictionary for parameter `args` of the `sos_run` function. The subworkflow would take values from parameters as if they are passed from command line. 

For example, the following workflow defines parameter `cutoff` with default value 10. When it is executed without command line option, the default value is used.

In [11]:
%sandbox
%run

[default]
parameter: cutoff=10
print("Process with cutoff=${cutoff}")

[batch]
for value in range(2, 10, 2):
    sos_run('default', cutoff=value)

Process with cutoff=10


Command line argument could be used to specify a different `cutoff` value:

In [12]:
%sandbox
%rerun --cutoff=4

Process with cutoff=4


Now, if we run the `batch` workflow, which calls the `default` workflow with parameter `cutoff`, the `parameter: cutoff=10` statement takes the passed value as if it were specified from command line.

In [13]:
%sandbox
%rerun batch

Process with cutoff=2
Process with cutoff=4
Process with cutoff=6
Process with cutoff=8


Note that the parameters could also be specified with parameter `args`, 

In [14]:
%sandbox
%run batch

[default]
parameter: cutoff=10
print("Process with cutoff=${cutoff}")

[batch]
for value in range(2, 10, 2):
    sos_run('default', args={'cutoff': value})

Process with cutoff=2
Process with cutoff=4
Process with cutoff=6
Process with cutoff=8


although the keyword arguments are usually easier to use.

Action `sos_run` cannot be used in `task` (see [Remote Execution](Remote_Execution.html) for details) because tasks are designed to be executed independently of the workflow. 

### Action `report`

Action `report` writes some content to an output stream. The input can either be a string or content of one or more files specified by option `input`. The output is determined by parameter `output`, and command line option `-r`.

* If `output='filename'`, the content will be written to a file.
* If `output=obj` and `obj` has a `write` function (e.g. a file handle), the content will be passed to the `write` function
* If output is unspecified and no filename is specified from option `-r`, the content will be written to standard output.
* If output is unspecified and a filename is specified with option `-r`, the content will be appended to specified file.

For example, the content of `report` actions is printed to standard output if no output is specified.

In [15]:
%run

[10]
report:
    Runing ${step_name}

[20]
report:
    Runing ${step_name}


Runing default_10

Runing default_20



We can specify an output file with option `output`, but the output will be overwritten if multiple actions write to the same file

In [16]:
%sandbox
%run

%preview report.txt
[10]
report: output='report.txt'
    Runing ${step_name}

[20]
report: output='report.txt'
    Runing ${step_name}

Action `report` can also take the content of one or more input files and write them to the output stream, after the script content (if specified). For example, the `report` action in the following example writes the content of `out.txt` to the default report stream (which is the standard output in this case).

In [17]:
%sandbox
%run

[10]
output: 'out.txt'
run:
   # run some command and generate a out.txt
   echo "* some result " > out.txt

[20]
report: input='out.txt'
Summary Report:

# run some command and generate a out.txt
echo "* some result " > out.txt



Summary Report:

* some result



### Action  `bash`

Action `bash(script)` accepts a shell script and execute it using `bash`. `sh`, `csh`, `tcsh`, `zsh` uses respective shell to execute the provided script.

These actions, as well as all script-executing actions such as `python`, also accept an option `args` and allows you to pass additional arguments to the interpreter. For example

In [18]:
run: args='-n ${filename!q}'
      echo "a"

execute the script with command `bash -n` (check syntax), so command `echo` is not actually executed.

### Action `sh`
Execute script with a `sh` interpreter

### Action  `csh`
Execute script with a `csh` interpreter

### Action  `tcsh`
Execute script with a `tcsh` interpreter

###  Action `zsh`
Execute script with a `zsh` interpreter

### Action  `perl`

Action `perl(script)` execute the passed script using `perl` interpreter.

In [19]:
perl:
    my $name = "Brian";
    print "Hello, $name!\n";

Hello, Brian!


### Action  `ruby`

Action `ruby(script)` execute the passed script using `ruby` interpreter.

In [20]:
ruby:
    a = [ 45, 3, 19, 8 ]
    b = [ 'sam', 'max', 56, 98.9, 3, 10, 'jill' ]
    print (a + b).join(' '), "\n"
    print a[2], " ", b[4], " ", b[-2], "\n"
    print a.sort.join(' '), "\n"
    a << 57 << 9 << 'phil'
    print "A: ", a.join(' '), "\n"

45 3 19 8 sam max 56 98.9 3 10 jill
19 3 10
3 8 19 45
A: 45 3 19 8 57 9 phil


###  Action  `node`

Action `node(script)` executes the passed script using `node` (JavaScript) interpreter.

In [21]:
node:
    var i, a, b, c, max;

    max = 1000000000;

    var d = Date.now();

    for (i = 0; i < max; i++) {
        a = 1234 + 5678 + i;
        b = 1234 * 5678 + i;
        c = 1234 / 2 + i;
    }

    console.log(Date.now() - d);

976


### Action  `pandoc`

Action `pandoc` uses command [pandoc](http://pandoc.org/) to convert specified input to output. This input to this action can be specified from option `script` (usually specified in script format) and `input`.

First, if a script is specified, pandoc assumes it is in markdown format and convert it by default to 'HTML' format. For example,

In [22]:
pandoc:
    # this is header
    This is some test, with **emphasis**.        

<h1 id="this-is-header">this is header</h1>
<p>This is some test, with <strong>emphasis</strong>.</p>


You can specify an output with option `output`

In [23]:
%sandbox
%preview out.html
pandoc: output='out.html'
    Itemize

    * item 1
    * item 2

INFO: Report saved to out.html


You can convert input file to another file type using a different file extension

In [24]:
%sandbox
%preview out.tex
pandoc: output='out.tex'
    Itemize

    * item 1
    * item 2

INFO: Report saved to out.tex


Or you can add more options to the command line by modifying `args`,

In [25]:
%sandbox
%preview out.html
pandoc: output='out.html', args='${input!q} --output ${output!q} -s'
    Itemize

    * item 1
    * item 2

INFO: Report saved to out.html


The second usage of the `pandoc` action is to specify one or more input filenames. You have to use the function form of this action as follows

In [26]:
%sandbox
%preview out.html
[10]
report: output = 'out.md'
    Itemize

    * item 1
    * item 2

[20]
pandoc(input='out.md', output='out.html')
    

If multiple files are specified, the content of these input files will be concatenated. This is very useful for generating a single pandoc output with input from different steps. We will demonstrate this feature in the [Generating Reports](../tutorials/Generating_Reports.html) tutorial.

If both `script` and `input` parameters are specified, the content of input files would be appended to `script`. So

In [27]:
#%sandbox
%preview out.html
[10]
report: output = 'out10.md'
    Itemize

    * item 1
    * item 2

[20]
report: output= 'out20.md'
    enumerated

    1. item 1
    2. item 2

[30]
pandoc: input=['out10.md', 'out20.md'], output='out.html'
    Markdown supports both itemized and enumerated

### Action `docker_build`

Build a docker image from an inline Docker file. The inline version of the action currently does not support adding any file from local machine because the docker file will be saved to a random directory. You can walk around this problem by creating a `Dockerfile` and pass it to the action through option `path`. This action accepts all parameters as specified in [docker-py documentation](http://docker-py.readthedocs.io/en/latest/images.html) because SoS simply pass additional parameters to the `build` function.

For example, the following step builds a docker container for [MISO](http://miso.readthedocs.org/en/fastmiso/) based on anaconda python 2.7.

```
[build_1]
# building miso from a Dockerfile
docker_build: tag='mdabioinfo/miso:latest'

	############################################################
	# Dockerfile to build MISO container images
	# Based on Anaconda python
	############################################################

	# Set the base image to anaconda Python 2.7 (miso does not support python 3)
	FROM continuumio/anaconda

	# File Author / Maintainer
	MAINTAINER Bo Peng <bpeng@mdanderson.org>

	# Update the repository sources list
	RUN apt-get update

	# Install compiler and python stuff, samtools and git
	RUN apt-get install --yes \
	 build-essential \
	 gcc-multilib \
	 gfortran \ 
	 apt-utils \
	 libblas3 \ 
	 liblapack3 \
	 libc6 \
	 cython \ 
	 samtools \
	 libbam-dev \
	 bedtools \
	 wget \
	 zlib1g-dev \ 
	 tar \
	 gzip

	WORKDIR /usr/local
	RUN pip install misopy
```

### Action  `download`

Action `download(URLs, dest_dir='.', dest_file=None, decompress=False)` download files from specified URLs, which can be a list of URLs, or a string with tab, space or newline separated URLs. 

* If `dest_file` is specified, only one URL is allowed and the URL can have any form.
* Otherwise all files will be downloaded to `dest_dir`. Filenames are determined from URLs so the URLs must have the last portion as the filename to save. 
* If `decompress` is True, `.zip` file, compressed or plan `tar` (e.g. `.tar.gz`) files, and `.gz` files will be decompressed to the same directory as the downloaded file.

For example,

```
[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'

download:   dest=GATK_RESOURCE_DIR
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz.md5
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz
    ${GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz.md5
```

download the specified files to `GATK_RESOURCE_DIR`. The `.md5` files will be automatically used to validate the content of the associated files. Note that 

SoS automatically save signature of downloaded and decompressed files so the files will not be re-downloaded if the action is called multiple times. You can however still still specifies input and output of the step to use step signature


```
[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'
RESOUCE_FILES =  '''1000G_omni2.5.hg19.sites.vcf.gz
    1000G_omni2.5.hg19.sites.vcf.gz.md5
    1000G_omni2.5.hg19.sites.vcf.idx.gz
    1000G_omni2.5.hg19.sites.vcf.idx.gz.md5'''.split() 
input: []
output:  [os.path.join(GATK_RESOURCE_DIR, x) for x in GATK_RESOURCE_FILES]
download(['${GATK_URL}/${x}' for x in GATK_RESOURCE_FILES], dest=GATK_RESOURCE_DIR)
```

Note that the `download` action uses up to 5 processes to download files. You can change this number by adjusting system configuration `sos_download_processes`.

###  Action `fail_if`

Action `fail_if(expr, msg='')` raises an exception with `msg` (and terminate the execution of the workflow if the exception is not caught) if `expr` returns True.

###  Action `warn_if`

Action `warn_if(expr, msg)` yields a warning message `msg` if `expr` is evaluate to be true.

###  Action `stop_if`

Action `stop_if(expr, msg='')` stops the execution of the current step (or current iteration if within input loops specified by parameters `group_by` or `for_each`) and gives a warning message if `msg` is specified. For example,

In [28]:
%sandbox
%run
!touch a.txt
!echo 'something' > b.txt

[10]
input: '*.txt', group_by=1

stop_if(os.path.getsize(_input[0]) == 0)
print(_input)

['b.txt']


skips `a.txt` because it has size 0.

A side effect of `stop_if` is that it will clear `_output` of the iteration so that the step `output` consists of only files from non-stopped iterations. For example,

In [29]:
%sandbox
%run
[10]
input: for_each={'idx': range(10)}
output: "${idx}.txt"
stop_if(idx % 2 == 0)
run:
    echo "Generating ${_output}"
    touch ${_output}

[20]
print("Output of last step is ${input}")

Generating 1.txt


echo "Generating 1.txt"
touch 1.txt



Generating 3.txt


echo "Generating 3.txt"
touch 3.txt



Generating 5.txt


echo "Generating 5.txt"
touch 5.txt



Generating 7.txt


echo "Generating 7.txt"
touch 7.txt



Generating 9.txt


echo "Generating 9.txt"
touch 9.txt



Output of last step is 1.txt 3.txt 5.txt 7.txt 9.txt


## Core Targets

Targets are objects that a SoS step can input, output, or dependent on. They are usually files that are presented by filenames, but can also be other targets.

### Target `FileTarget`

Targets of type `FileTarget` represents files on a file system. The type `FileTarget` should not be used explicitly because SoS treats string type targets as `FileTarget`.

SoS uses md5 signature to detect changes of the content of files. To reduce the time to generate signature for large files, SoS extracts strips of data from large files to calculate partial MD5. The resulting signatures are different from md5 signature of complete files calcualted from other tools.

A file can be **zapped** by command `sos remove --zap` and still be considered available by SoS. This command removes a `file` and generates `${file}.zapped` with essential information such as signature and size of the original file. A step would not be rerun if its input, dependent, or output files are zapped instead of removed. This feature is useful for the removal of large intermediate files generated from the execution of workflows, whiling still keeping the complete runtime information of the workflow.

### Target `executable` 

In [30]:
`executable` targets are commands that should be accessible and executable by SoS. These targets are usually listed in the `depends` section of a SoS step. For example, SoS would stop if a command `fastqc` is not found.

%sandbox --expect-error
[10]
input:     'a.txt'
depends:   executable('some_command')
sh:
    some_command ${input}

`executable` target can also be output of a step but installing executables can be tricky because the commands should be installed to existing `$PATH` so that they can be immediately accessible by SoS. Because SoS automatically adds `~/.sos/bin` to `$PATH` (option `-b`), an environment-neutral way for on-the-fly installation is to install commands to this directory. For example

!rm -f ~/.sos/bin/lls

[lls: provides=executable('lls')]
sh:
    echo "#!/usr/bin/env bash" > ~/.sos/bin/lls
    echo "echo I am lls" >> ~/.sos/bin/lls
    chmod +x ~/.sos/bin/lls

[10]
depends: executable('lls')
sh:
    lls

creates an executable command `lls` under `~/.sos/bin` when this executable does not exist.

You can also have finer control over which version of the command is eligible by checking the output of commands. The trick here is to provide a complete command and one or more version strings as the string that should appear in the output of the command.

For example, command `perl --version` is executed in the following example to check if the output contains string `5.18`. The step would only be executed if the right version exists.

[10]
depends: executable('perl --version', version='5.18')
print('ok')

If no verion string is provided, SoS will only check the existence of the command and not actually execute the command.

### Target `sos_variable` 

`sos_variable(name)` targets represent SoS variables that are created by a SoS step and shared to other steps. These targets can be used to provide information to other steps. For example,

In [31]:
%sandbox
[counts: shared='counts']
input: 'result.txt'
with open(input[0]) as ifile:
    counts = int(ifile.read())

[10]

# perform some task and create a file with some statistics
output: 'result.txt'
run:
   echo 100 >> result.txt 

[100]
depends: sos_variable('counts')
report:
    There are ${counts} objects

Step `100` needed some information extracted from output of another step (step `10`). You can either parse the information in step `100` or use another step to provide the information. The latter is recommended because the information could be requested by multiple steps. Note that `counts` is an auxiliary step that provides `sos_variable('counts')` through its `shared` section option.

### Target `env_variable` 

SoS keeps tract of runtime environment and creates signatures of executed steps so that they do not have to be executed again. Some commands, especially shell scripts, could however behave differently with different environmental variables. To make sure a step would be re-executed with changing environments, you should list the variables that affects the output of these commands as dependencies of the step. For example

%sandbox --expect-error
[10]
depends:   env_variable('DEBUG')
sh:
    echo DEBUG is set to $DEBUG

### Target `sos_step` 

The `sos_step` target represents, needless to say, a SoS step. This target provides a straightforward method to specify step dependencies. For example,

[init]
print("Initialize")

[10]
depends: sos_step("init")
print("I am ${step_name}")

### Target `dynamic` 

A `dynamic` target is a target that can only be determined when the step is actually executed. 

For example,

In [1]:
%sandbox --expect-error
[10]
output: '*.txt'
sh:
    touch a.txt

[20]
print('Last output is ${input}')

To address this problem, you should try to expand the output file after the completion of the step, using a `dynamic` target.

In [None]:
%sandbox
[10]
output: dynamic('*.txt')
sh:
    touch a.txt

[20]
print("Last output is ${input}")

Please refer to chapter [SoS Step](SoS_Step.html) for details of such targets.

### Target `remote`

A target that is marked as `remote` and would be instantiated only when it is executed by a task. Please check section [Remote Execution](Remote_Execution.html) for details.

## Functions and objects

###  Function `get_output`

Function `get_output(cmd)` returns the output of command (decoded in `UTF-8`), which is a shortcut for `subprocess.check_output(cmd, shell=True).decode()`.

In [32]:
get_output('which ls')

'/bin/ls\n'

This function also accepts two options `show_command=False`, and `prompt='$ '` that can be useful in case you would like to present the command that produce the output. For example,

In [33]:
print(get_output('which ls', show_command=True))

$ which ls
/bin/ls



### Function  `expand_pattern`

Function `expand_pattern` expands a string to multiple ones using items of variables quoted between `{ }`. For example,

```python
output: expand_pattern('{a}_{b}.txt')
```

is equivalent to

```python
output: ['{x}_{y}.txt' for x,y in zip(a, b)]
```

if `a` and `b` are sequences of the same length. For example,

In [34]:
name = ['Bob', 'John']
salary = [200, 300]
expand_pattern("{name}'s salary is {salary}")

["Bob's salary is 200", "John's salary is 300"]

The sequences should have the same length

In [35]:
%sandbox --expect-error

salary = [200]
expand_pattern("{name}'s salary is {salary}")

Failed to process statement 'expand_pattern("{name}\'s salary is {salary}")' (ValueError): Undefined variable name in pattern {name}'s salary is {salary}


An exception is made for variables of simple non-sequence types, in which case they are repeated in all expanded items

In [36]:
salary = 200
expand_pattern("{name}'s salary is {salary}")

["Bob's salary is 200", "John's salary is 200"]

###  Object `logger`

The SoS logger object is a `logging` object used by SoS to produce various outputs. You can use this object to output error, warning, info, debug, and trace messages to terminal. For example,

In [37]:
%run -v2
[0]
logger.info("I am at ${step_name}")

INFO: I am at default_0


The output of `logger` is controlled by logging level, for example, the above message would not be printed at `-v1` (warning)

In [38]:
%run -v1
[0]
logger.info("I am at ${step_name}")

## Language `Python`

### Action `python`

Action `python(script)` and `python3(script)` accepts a Python script and execute it with python or python3, respectively.

Because SoS can include Python statements directly in a SoS script, it is important to note that embedded Python
statements are interpreted by SoS and the `python` and `python3` actions are execute in separate processes without
access to the SoS environment.

For example, the following SoS step execute some python statements **within** SoS with direct access to SoS variables
such as `input`, and with `result` writing directly to the SoS environment,

```python
[10]
for filename in input:
    with open(filename) as data:
        result = filename + '.res'
        ....
```

while

```python
[10]
input: group_by='single'

python:

with open(${input!r}) as data:
   result = ${input!r} + '.res'
   ...


```

composes a Python script for each input file and calls separate Python interpreters to execute them. Whereas
the Python statement in the first example will always be executed, the statements in `python` will not be executed
in `inspect` mode.

### Action `python2`

Action `python2` is similar to `python` but it tries to use interpreter `python2` (or `python2.7` on some systems) before `python`, which could be python 3. Note that this action does not actually test the version of interpreter so it would use python 3 if this is the only available version.

### Action `python3`

Action `python3` is similar to `python` but it tries to use interpreter `python3` (version 3 of python) before `python`, which could be python 2. Note that this action does not actually test the version of interpreter so it would use python 2 if this is the only available version.

### Target `Py_Module`

This target is usually used in the `depends` statement of a SoS step to specify a required Python module. If a module is not available, SoS will try to execute command `pip install` to install it, which might or might not succeed depending on your system configuration. For example,

In [39]:
depends: Py_Module('tabulate')
from tabulate import tabulate
table = [["Sun",696000,1989100000],["Earth",6371,5973.6],
    ["Moon",1737,73.5],["Mars",3390,641.85]]
print(tabulate(table))

-----  ------  -------------
Sun    696000     1.9891e+09
Earth    6371  5973.6
Moon     1737    73.5
Mars     3390   641.85
-----  ------  -------------


## Language `R`

### Action `R`

Action `R(script)` execute the passed script using `Rscript` command. 

In [40]:
R:
    D <- data.frame(x=c(1,2,3,1), y=c(7,19,2,2))
    # Sort on x
    indexes <- order(D$x)
    D[indexes,]

  x  y
1 1  7
4 1  2
2 2 19
3 3  2


### Action `Rmarkdown`

Action `Rmarkdown` shares the same user interface with action `pandoc`. The only big difference is that it used `R`'s `rmarkdown` package to render R-flavored Markdown language.

For example, the `Rmarkdown` action of the following example collects input files `A_10.md` and `A_20.md` and use `R`'s `rmarkdown` package to convert it to `out.html`.

In [41]:
%sandbox

[A_10]
report: output="A_10.md"
    step_10

[A_20]
report: output="A_20.md"
    Itemize

    * item 1
    * item 2

[A_30]
Rmarkdown(input=['A_10.md', 'A_20.md'], output='out.html')

### Target `R_library`

The `R_library` target represents a R library. If the libraries are not available, it will try to install it from [CRAN](https://cran.r-project.org/), [bioconductor](https://www.bioconductor.org/), or [github](https://github.com/). Github package name should be formatted as `repo/pkg`. A typical usage of this target would be

In [42]:
%sandbox
[10]
output: 'test.jpg'
depends: R_library('ggplot2')
R:
  library(ggplot2) 
  jpeg(${output!r})
  qplot(Sepal.Length, Petal.Length, data = iris, color = Species)
  dev.off()


`R_library` can also be used to check for specific versions of packages. For example:

```
R_library('edgeR', '3.12.0')
```
will result in a warning if edgeR version is not 3.12.0. You can specify multiple versions 

```
R_library('edgeR', ['3.12.0', '3.12.1'])
```

certain version or newer,
```
R_library('edgeR', '3.12.0+')
```

certain version or older
```
check_R_library('ggplot2', '1.0.0-')
```

The default R library repo is `http://cran.us.r-project.org`. It is possible to customize the repo for which a R library would be installed, for example:

```
R_library('Rmosek', repos = "http://download.mosek.com/R/7")
```