# Customized or remote execution of workflows

* **Difficulty level**: intermediate
* **Time need to lean**: 30 minutes or less
* **Key points**:
  * Option `-r host` executes workflow on `host`, optionally through a `workflow_template` specified through host configuration.
  * The remote host could be a regular server, or a cluster system, in which case the workflow could be executed using multiple computing nodes.

Option `-r host` executes workflow on `host`. Depending on the properties of `host`, this option allows you to

1. Execute workflows locally, but in a customized environment
2. Execute workflows on a remote server directly
3. Execute workflows on a remote cluster system

Please refer to [host configuration](host_setup.html) for details on host configuration.

## Customized environment for workflow execution

On my system there are two R versions, a system R installation under `/usr/local/bin` and a local installation in a conda environment. The latter version is the default version but if for some reasons you want to use the system R (e.g. if a library is only available there), you can change the local `PATH` of the `R` action using an `env` option (see [SoS actions](sos_actions.html) for details.

In [1]:
R:
    R.Version()$version.string

[1] "R version 3.6.1 (2019-07-05)"


In [2]:
import os

R: env={'PATH': f"/usr/local/bin:{os.environ['PATH']}"}
    R.Version()$version.string

[1] "R version 3.5.2 (2018-12-20)"


This action level `env` configuration is very flexible (e.g. you can use different versions of R in the same workflow) but can be difficult to maintain if you have multiple `R` actions. If your intent to use the same version of R throughout the workflow, it is easier to execute the entire workflow in a customized environment.

To achieve this, you can define a host as follows:

In [3]:
%save myconfig.yml -f
hosts:
  system_R:
    workflow_template: |
      export PATH=/usr/local/bin:$PATH
      {command}            

Then, we will be using the conda version of R by default

In [4]:
%run -v1
R:
    R.Version()$version.string

[1] "R version 3.6.1 (2019-07-05)"


and be using the system R if we execute the workflow in the `system_R` host, despite the fact that `system_R` is just a localhost with a template

In [5]:
%run -r system_R -c myconfig.yml -v1
R:
    R.Version()$version.string

[1] "R version 3.5.2 (2018-12-20)"


As you can imagine, the template can set up a variety of different environment such as conda environments, debug environments, and using `module load` on a cluster system.

## Execution of workflow on a remote host

<p align="center" height="500">
  <img src="https://vatlab.github.io/sos-docs/doc/media/remote_1_workflow.jpeg">
</p>

If the `host` is a real remote host, then

```bash
sos run script workflow -r host [other options]
```
would execute the entire workflow on the `host` through some command similar to the following
```bash
ssh host "bash --login -c sos run script workflow [other options]"
```
after the script is copied to the remote host.

This option is useful if you would like to **write the entire workflow for a remote host and execute the workflow with all input, software, and output files on the remote host**. Typical use cases for this option are when the data is too large to be processed locally, or when the software is only available on the remote host.

For example, with a host definition similar to

```
hosts:
  bcb:
    address:  myserver.mdanderson.edu
    paths:
      home: /Users/bpeng1/scratch
```

the following cell execute the workflow on `bcb`

In [6]:
%run -r bcb
R:
  set.seed(1)
  x <- 1:100
  y <- 0.029*x + rnorm(100)
  png("test.png", height=400, width=600)
  plot(x, y, pch=19, col=rgb(0.5, 0.5, 0.5, 0.5), cex=1.5)
  abline(lm(y ~ x))
  dev.off()


[91mERROR[0m: [91mFailed to connect to bcb: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@




@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


The ECDSA host key for bcbm-bpeng.mdanderson.edu has changed,


and the key for the corresponding IP address 23.217.138.110


is unknown. This could either mean that


DNS SPOOFING is happening or the IP address for the host


and its host key have changed at the same time.


@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@




@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!


Someone could be eavesdropping on you right now (man-in-the-middle attack)!


It is also possible that a host key has just been changed.


The fingerprint for the ECDSA key sent by the remote host is


SHA256:6MJJtqKhTdHXF2yzH/0UqGN2o4RZ2PDEp2ttdA/IJR8.


Please contact your system administrator.


Add correct host key in /Users/bpeng1/.ssh/known_hosts to get rid of this message.


Offending ECDSA key in /Users/bpeng1/.ssh/known_hosts:23


ECDSA host key for bcbm-bpeng.mdanderson.edu has changed and you have requested strict checking.


Host key verification failed.


[0m


The resulting `test.png` are generated on `bcb` and is unavailable for local preview. You can however preview the file with `-r` option of magic `%preview`

In [7]:
%preview -n test.png -r bcb

Failed to preview ['test.png'] on remote host bcb


In this case we do not define any `workflow_template` for `bcb` so the workflow is executed directly on `bcb`. If a `workflow_template` is defined, the workflow will be executed through the shell script that is expanded from the template.

## Executing workflows on multiple workers

<p align="center" height="500">
  <img src="https://vatlab.github.io/sos-docs/doc/media/remote_4_workers.jpeg">
</p>

SoS uses multiple worker processes to execute steps, substeps, and subworkflows. By default, SoS creates `n/2` workers on a local computer with `n` CPU cores, although it limits the default number of workers to 8 when there are more than 16 cores because those computers are most likely shared by multiple users.

The number of workers used by SoS can be controlled by option `-j`, where `-j 4` creates 4 workers so that you will see 5 sos processes (1 master and 4 worker) when you execute a workflow with command

```
sos run script -j 4
```

It is possible to start workers on multiple remote machines by specifying the name of the machine and number of processes on each of them with an extended version of option `-j`. For example

```
sos run script -j 4 node1:4 node2:4
```

will create 12 workers, 4 on localhost on which the master SoS process resides, 4 on `node1` and 4 on `node2`, where `node1` and `node2` can be name or IP address of machines, or an aliaes defined in SoS configuration files. A limitation of starting workers on remote servers is that the remote servers must share the same file systems as the local host so this approach only works for workstations with auto-mount home directories, or computing nodes of cluster systems.

<div class="bs-callout bs-callout-alert" role="alert">
  <h4>Using workers on remote hosts</h4>
    <p>SoS assumes that <b>all local and remote hosts share the same file systems</b> and propagates environment variables such as <code>$PATH</code> to all remote hosts. This ensures that all worker processes have identical running environments. </p>
    <p>You should create and execute external tasks if you would like to execute code on remote computing environments that do not share file systems with the local host.</p>
</div>

For example, the following example starts 4 local workers, and 4 on a remote host `macpro`.

In [8]:
%run -j 4 macpro:4

input: for_each=dict(i=range(10))

import time
time.sleep(i*2)

## Executing workflows on cluster systems

Having a number of servers that share the same file system is a scenario that appear most frequently on a cluster system where computing nodes share the same network file system. If you execute a workflow on a supported cluster system, SoS will be able to obtain the number of nodes and number of processes allocated on each node, and set the `-j` automatically.

So to execute a workflow on the cluster, all you need to do is wrapping the command 
```
sos execute script [options]
```
(without option `-j`) in a shell script that specifies the resources used, and submit to the cluster system. This part can, again, be achieved with proper host configuration with a `workflow_template`.

<p align="center">
  <img src="https://vatlab.github.io/sos-docs/doc/media/cluster_execution.jpg">
</p>

So basically you need a host that 

1. uses `pbs` as `queue_type`, 
2. defines commands to submit and queue jobs, and
3. defines a `workflow_template` that will be expanded to a shell script to be executed on the cluster

The host can be a local host (if you are submitting jobs on the headnode of a cluster system) or a remote host (if you are submitting jobs remotely). Its definition should be similar to

 ```yaml
 hosts:
     htc:
        address: htc_cluster.mdanderson.edu
        description: HTC cluster (PBS)
        queue_type: pbs
        status_check_interval: 60
        submit_cmd: qsub {job_file}
        status_cmd: qstat {job_id}
        kill_cmd: qdel {job_id}        
        nodes: 2
        cores: 4
        walltime: 01:00:00
        mem: 4G
        workflow_template: |
            #!/bin/bash
            #PBS -N {job_name}
            #PBS -l nodes={nodes}:ppn={cores}
            #PBS -l walltime={walltime}
            #PBS -l mem={mem}
            #PBS -m n
            module load R
            {command}
```            

With this definition, you can submit your workflow to it with option `-r htc` as follows:

In [9]:
%run -r htc walltime=00:10:00 nodes=4
[10]
input: for_each=dict(i=range(100))
R: expand=True
  Sys.sleep(100+{i})
        

[91mERROR[0m: [91mFailed to connect to htc: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@




@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


The ECDSA host key for q1prphtch00.mdanderson.edu has changed,


and the key for the corresponding IP address 23.217.138.110


is unknown. This could either mean that


DNS SPOOFING is happening or the IP address for the host


and its host key have changed at the same time.


@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@




@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!


Someone could be eavesdropping on you right now (man-in-the-middle attack)!


It is also possible that a host key has just been changed.


The fingerprint for the ECDSA key sent by the remote host is


SHA256:6MJJtqKhTdHXF2yzH/0UqGN2o4RZ2PDEp2ttdA/IJR8.


Please contact your system administrator.


Add correct host key in /Users/bpeng1/.ssh/known_hosts to get rid of this message.


Offending ECDSA key in /Users/bpeng1/.ssh/known_hosts:24


ECDSA host key for q1prphtch00.mdanderson.edu has changed and you have requested strict checking.


Host key verification failed.


[0m


The workflow is submitted to the cluster with script 

```sh
#!/bin/bash
#PBS -N e0e212c1577d5990
#PBS -l nodes=4:ppn=4
#PBS -l walltime=00:10:00
#PBS -l mem=4G
#PBS -m n
module load R
sos run /home/bpeng1/sos/sos-docs/src/user_guide/.tmp_script_mjno7xgq.sos
```
saved under `~/.sos/workflows/e0e212c1577d5990.sh`.

Here we note that

1. The template arguments are specified from command line (`walltime=00:10:00 nodes=4`)
2. The host definition provides default values for template variables, which will be used if they are not specified from command line.
3. Unlike the execution of remote tasks, SoS currently does not provide any means to check the status of jobs. You will have to do that manually with the job ID returned from SoS.

## Further reading

* [Host configuration](host_setup.html)