# Task Options

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less

  

## Task options

The following options are options to keyword `task:` and specify how tasks should be executed.

* The resource options such as `walltime` and `cores` will be sent to individual task queues in appropriate format. You do not have to specify all options because task queues can support a subset of these options and some task queues provide default values (and some do not). It is however generally a good idea to specify them all so that your tasks could be executed on all types of task queues. 

* The execution options such as `workdir`, `env`, `concurrent` specify environments in which tasks will be submitted and executed. 

### Option `walltime`

Estimated maximum running time of the task. This parameter will be sent to different task queues and it is up to the task queue to decide if the task would be killed if the task could not be completed within specified `walltime`. `walltime` could be specified as a string in the format of `HH:MM:SS` where `HH`, `MM` and `SS` are hours, minutes, and seconds, or an integer with units `s` (second), `m` (minute), `h` (hour), or `d` (day), although the internal format of `walltime` (when you use `walltime` in `job_template` etc) is always `HH:MM:SS`. For example, you could use `walltime='240:00:00'` or `walltime='10d'` for a job that would run 10 days.

It is worth noting that, if some tasks fail because of insufficient `walltime` (or `cores` or `mem`), it is safe to change these options and re-run the jobs. These will only restart the failed jobs because completed or running jobs are not affected by the change of these options.

### Option `cores`

Number of cores on each computing node, which corrsponds to the `ppn` option of a PBS system.

PBS task queue also accepts a parameer `nodes` (corresponds to PBS resource option `nodes`, default to 1) but it is currently unused because SoS does not yet support multi-node tasks.

### Option `mem`

The total amount of memory needed across all nodes. Default units are bytes; can also be expressed in megabytes (`mem=4000MB`). gigabytes (`mem=4GB`) or gibibytes (`mem=4GiB`), although all inputs are converted to bytes internally. To use this option in a `job_template`, you generally need to use expressions such as `{mem//1e9}GB`.

### Option `queue`

Option `queue` specifies a task queue to which the current task will be submitted. This option overrides system default (command line option `-q`) so it is generally a good idea to use command line option `-q` so that the task could be submitted to different task queues, unless the task has to be executed in a particular server (e.g. with a software that is unavailable elsewhere).

### Option `to_host`

Option `to_host` specifies additional files or directories that would be synchronized to the remote host before tasks are executed. It can be specified as

* A single file or directory (with respect to local file system), or
* A list of files or directories, or
* A dictinary of `{local: remote}` file maps that specify how local files are synchronized to the remote host.

In the first two cases, the files or directories will be translated using the host-specific path maps. In the last case, the `remote` path (that should be relative to the remote file system) will be used without translating `local` file.

Note that 
1. If a symbolic link is specified in `to_host`, both the symbolic link and the path it refers to would be synchronized to the remote host.
2. If the task is executed on the local host (remote host coincide with local host), `to_host` is usually ignored unless it is specified in the third dictionary format, which copies files to another location before task execution. 

### Option `from_host`

Option `from_host` specifies additional files or directories that would be synchronized from the remote host after tasks are executed. It can be specified as

* A single file or directory (with respect to local file system), or
* A list of files or directories, or
* A dictinary of `{local: remote}` file maps that specify how local files are synchronized from the remote host.

In the first two cases, the files or directories will be translated using the host-specific path maps to determine what remote files to retrieve. In the last case, the `remote` path (that should be relative to the remote file system) will be used without path translation. If the task is executed on the local host (remote host coincide with local host), this option is usually ignored unless it is specified in the third dictionary format, which copies files to another location after the task is executed. 

### Option `map_vars`

In addition to `input` (`_input`), `output` (`_output`), `depends` (`_depends`) that are defined implicitly by `input:`, `output:` and `depends:` statements, you can specify additional variables that will be translated from local to remote host. This option accepts paths int he format of `str` or sequence (`list`, `tuple`, `set` etc) of `str` and will be mapped to variable of the same type (with paths replaced by remote paths on remote host). 

### Option `trunk_size`

Options `trunk_size` and `trunk_workers` are useful for dividing a large number of small tasks into several larger tasks so that they can be executed efficiently on a cluster system.

Option `trunk_size` groups concurrent tasks into trunks of specified size. For example, if you need to run 10000 simulations that each lasts about 1 minute, you can group the tasks into `10`  umbrella tasks, each running `1000` simulations.

```
[10]
import random
input: for_each={'seed': [random.randint(1, 10000000) for x in range(10000)]}
task: mem=`1G`, walltime='1m', cores=2, trunk_size=1000
sh: expand=True
    run_simulation --seed (seed) >> res_{seed}.res
```

The unbrella tasks have the following properties:
1. Tasks embedded by an umbrella task are executed normally in the sense that they have their own input, output, task ID, signatures etc, although only the umbrella tasks are visible to task engines.
2. Umbrella task IDs are prefixed by `M#_` where `#` is the number of embedded tasks.
3. Umbrella tasks adjust resource options such as `walltime` automatically so in the above example, each umbrella task will have `walltime='16:40:00'` (1000 minutes). 
4. Option `name` (job name on PBS systems) will be adjusted to `{name)_##` (e.g. `default_10_6000_1000` if default `name='{step_name}_{_index}'` is used) where `##` is the number of subtasks.
5. The entire umbrella will fail if any of the subtasks fails. However, since each subtask has its own signature, completed tasks will be ignored when you rerun the umbrella task (unless `-s force` is specified to forcefully re-execute all tasks).

### Option `trunk_workers`

Option `trunk_workers` specifies number of workers for umbrella tasks. If this option is specified, an umbrella task will be executed by a master process that dispatches embedded tasks to `trunk_workers` workers. Using the same simulation example with `trunk_workers=5`,

```
import random
input: for_each={'seed': [random.randint(1, 10000000) for x in range(10000)]}
task: mem=`1G`, walltime='1m', cores=2, trunk_size=1000, trunk_workers=5
sh: expand=True
    run_simulation --seed (seed) >> res_{seed}.res
```

* There would be `10000 / 1000 = 10` umbrealla tasks each with `1000` (`trunk_size`) subtasks.
* Each umbrella task would use `2 * 5 + 1 = 11` cores where the extra core is used by the master process.
* Each umbrella task would use 5G of RAM (`5 * 1G`).
* Each umbrella task would have a `walltime` of `1000 / 5 * 1 = 200` minutes (`walltime='03:20:00'`).


### Option `workdir`

Default to current working directory.

Option `workdir` controls the working directory of the task. For example, the following step downloads a file to the `resource_dir` using command `wget`.

```python
[10]

task: workdir=resource_dir

run:
  wget a_url -O filename
```

Runtime option `workdir` will be translated to remote host if the task is executed remotely.

### Option `concurrent`

Default to `True`.

If the step process is repeated for multiple substeps (using input options `group_by` or `for_each`), all loop processes will by default be sent to the task engine to be executed in parallel (subject to `max_running_jobs` of individual task queue). If your tasks are sequential in nature (e.g. the next substep depends on the result of the current substep), you can set `concurrent=False`, in which case the next task will be generated and sent to the task queue only after the current one has been completed.

### Option `shared`

SoS tasks are executed externally and by default does not return any value. Similar to the `shared` step option (that passes step variables to the workflow), you can use `shared` option to pass task variables to the step in which the task is defined.

For example, the following script perform some simulations in 10 tasks and return the result by variable `rng`, which is then shared to the workflow by step option `shared` so that it can be available to the next step.

In [2]:
%run
[10 (simulate): shared=['rng', 'step_rng']]
input: for_each={'i': range(10)}
task: shared='rng'
print(f"{i}")
import random
rng = random.randint(1, 1000)

[20]
print(rng)
print(step_rng)



524
[129, 479, 825, 773, 453, 923, 459, 257, 676, 524]


It is important to note that option `shared` of the task passes variable `rng` to substeps of the step. The step level `shared='rng'` will only return `rng` of the last substep, and `shared='step_rng'` will return `rng` from all substeps as a list.

Also similar to step option `shared`, task option `shared` accepts a single variable (e.g. `rng`), a sequence of variables  (e.g. `('rng', 'sum')`), a dictionary of variable derived from an expression (e.g. `{'result': 'float(open(output).read())'}`, or sequences of names and variables. In the dictionary case, the values of the dictionary should be an expression (string), that will be evaluated upon the completion of the task, and assign to the specified variable.

### Option `env`

The `env` option allow you to modify runtime environment, similar to the `env` parameter of the `subprocess.Popen` function. For example, you can execute your command with in a specific directory using

```sos
task:  env={'PATH': '/path/to/mycommand' + os.sep + os.environ['PATH']}
run:
   mycommand 
```

Option `env` is NOT translated to remote host because it is of type directionay. The job template is usually a good place to set host-specific environment.

### Option `prepend_path`

Option `prepend_path` is a shortcut to option `env` to prepend one (a string) or more (a list of strings) paths to system path. For example, the above example can be shortened to

```sos
task:  prepend_path='/path/to/mycommand'
run:
   mycommand 
```

Option `prepend_path` is NOT translated to remote host because it is likely to be host specific.

### Option `active`

Option `active` specifies the active task within a input loop. It can be `True` (default), or `False` (or a condition that returns `True` or `False`), or an index or a list of indexes when the task will be executed. Negative index is acceptable (e.g. task for only the last input loop will be executed with `active=-1`).

For example, `task` in the following example is not executed because `a.txt` already exists.

In [4]:
%preview -n a.txt
!echo "hello" > a.txt

task: active=not path('a.txt').exists()
sh:
    echo "Task executed"

hello


### Option `tags`

By default, a task is tagged by step name and workflow ID so that tasks from the same step and/or workflow could be targetted with option `--tags` for commands such as `sos status`, `sos kill`, and `sos purge`.

You can specify additional tags to tasks using option `tags`. A tag can contain alphabetic and numeric characters, dash (`-`), underscore (`_`), and dot (`.`). Tags with other characters will still be accepted but with non-comforming characters removed.

For example, if each task handles a sample with an ID, you can use

```
input: for_each={'ID': IDs}
task: tags=ID
```
to add sample IDs to task tags, or use 

```
input: input_files, paired_with=['IDs', 'barcode']
task: tags=[_ID, _barcode, 'my_cmd']
sh:
    my_cmd ...
```
to tag tasks with sample ID, barcode, and command to each task.

## Further reading

* 