# Execution of tasks on remote hosts

SoS allows you to execute tasks on remote hosts or task queues with or without their own file systems. For example, you can execute a complex workflow mostly locally, but execute a few tasks on a remote host if it provides more computing power, or if it has some software that cannot be installed locally. The remote host could have its own file system (separate systems), share its file system with the local machine (e.g. nodes on the cluster), or share some storage with the local machine (e.g. have the same shared storage), so file synchronization will be needed in some cases.

With help from a few runtime options (options to `task`), SoS can

* Copy specified local files to the remote host, possibly to different directories
* Start a SoS task on the remote machine or submit the tak to a task queue (e.g. PBS or Celery) and wait for the completion of the task
* Copy results back from the remote host if the execution is successful

## System setup

### Public-key access

Following any online tutorial, set up public-key access from your local machine to the remote host. If your public key does not work, check file permissions of `~/.ssh`, files under `.ssh`, and `$HOME` in some cases. After setting up the server, make sure you can login without password using command

```bash
% ssh remote-host
```

### Software installation

You will need to install the latest version of sos (preferrably identical version between local and remote hosts), and the software you will need to run. Test it by logging to the remote machine with commands

```bash
% sos -h
```

### Check `$PATH`

Commands that are available in login shell are not necessarily available during remote execution. Basically, remote execution through `ssh` invokes a non-interactive and non-login shell with basic `$PATH`. SoS tries to address this problem by executing commands through a login shell

```bash
% ssh remote-host "bash --login -c 'sos execute task_id'"
```

However, default `.bashrc` on the remote server might contain a line like

```bash
[ -z "$PS1" ] && return
```

that makes it exit when `bash` is not running interactively. This line has to be removed in order to have complete `$PATH` during remote execution.

Now, fire command

```bash
% ssh remote-host "bash --login -c 'sos -h'"
```

from your local machine and see if `sos` can be invoked. Similarly, verify if the command you would like to execute remotely can be executed in this manner.

### Configure local and remote hosts

Now you need to configure your local and remote hosts so that SoS knows how to communicate between them. The hosts configurations should be defined in `~/.sos/hosts.yml`, and should look similar to

```
hosts:
    desktop:
        paths:
            home: /Users/myuser
    monster:
        address: dcdr1ue8ee.yourdomain.com
        paths:
            home: /home/myuser
```

The format is easy enough to edit directly, but you can also use commands such as

```
% sos config --hosts --set hosts.monster.address dcdrlue8ee.yourdomain.com
```

to add or change key `hosts['monster']['address']` to `dcdrlue8ee.yourdomain.com`.

### Configure `address`

You should specify in the `hosts` section the address of remote host, similar to

```
hosts:
  monster:
    address: dcdrlue8ee.yourdomain.com
```

If your account name differs between the local and remote servers, the complete address should be `username@address`. In this example `john@dcdrlue8ee.yourdomain.com` if the remote server account is `john`.

You can also specify `address` for your localhost if you plan to remotely login to the localhost.

### Configure `paths`

`paths` is a list of directories that will be translated between hosts. For example, if you work locally on a Mac machine with home directory `/Users/myuser`, and the remote server is a Linux machine with home directory `/home/myuser`, you should define a `paths` with definitions of `home` as follows:

```
hosts:
    desktop:
        paths:
            home: /Users/myuser
    monster:
        address: dcdr1ue8ee.yourdomain.com
        paths:
            home: /home/myuser
```

In this way, if the local data is `/User/myuser/projects/input.fastq`, the path will be translated to `/home/myuser/projects/input.fastq` during remote execution.

In more complicated cases where there are different directories, more than one `paths` can be specified. For example, if you have directories under different volumes, you can map them differently using

```bash
hosts:
    desktop:
        paths:
            home: /Users/myuser
            project: /Users/myuser/projects
            resource: /Volumes/Resource
    monster:
        address: dcdr1ue8ee.yourdomain.com
        paths:
            home: /home/myuser
            project: /home/myuser/scratch/projects
            resource: /home/myuser/resource
```

Note that

1. You can define multiple `paths` such as `home`, `scratch`, `working`, `resource`, but **paths should be defined for all hosts**.
2. All `paths` should be absolute (starts with `/` for Linux-like systems).
3. SoS expands local directories to absolute path before matching to a `paths`.
4. If there are multiple matches, SoS choose the longest-matching path. For example, path `/Users/myuser/projects/input.txt` would be identified as `project` (not `home`) and be mapped to `/home/myuser/scratch/projects/input.txt`.

### Configure `shared`

Option `shared` tells SoS which file systems are shared between local and remote hosts so that it does not have to synchronize files under these directories between the hosts.

* SoS assumes independent file systems so you do not have to specify option `shared` if the local and remote hosts does not share any file system.
 
* If your local and remote host share all file systems, you should list `/` as shared.

    ```
    hosts:
        desktop:
            shared:
                ALL: /
        monster:
            shared:
                ALL: /
    ```
    The name `ALL` does not matter as long as they match between hosts.

  
* If your local and remote host share one or more shared volumes, you can specify them with

    ```    
    hosts:
        desktop:
            shared:
                project: /project
        monster:
            shared:
                project: /scratch/project
    ```
  
  to indicate that local files under `/project` shared to `monster`.

Items under `shared` are treated as special `paths`. Files under these directories are mapped, but not synchronized.

Note that it is a bad idea to use dropbox or google drive as shared drives because files under these directories are not actually shared so a file created locally will not be available instantly on the remote host.

### Specify `localhost`

After you configure both local and remote host, you will need to tell sos what your `localhost` is in the `hosts` list using command

```
% sos config --global --set localhost desktop
```

which actually writes `localhost: desktop` in the system configuration file. 

If you have defined multiple hosts in the `hosts.yml` file, you should distribute this file to all hosts and set `localhost` accordingly, so that all machines know how to communicate with each other.

### Sample configurations

The server settings are critically important for the successful execution of commands on remote servers. As an example, I am working on a Mac mini (with limited CPU/RAM) and have access to a Mac Pro workstation and a Linux server. The hosts configurations for these machines are

```
hosts:
  mini:
    paths:
      home: /Users/bpeng1
      resource: /Users/bpeng1/.sos/resource
  macpro:
    address: mp-bpeng.mdanderson.edu
    paths:
      home: /Users/bpeng1
      resource: /Volumes/HOME/resource
  linux:
    address: dcdrlpmcfd.mdanderson.edu
    paths:
      home: /home/bpeng1
      resource: /home/bpeng1/.sos/resource
```

I defined two `paths` named `home` and `resource` because although `resource` is at the same location `~/.sos/resource`, it is in a dedicated volume `/Volumes/HOME/` on the macpro.

With this `hosts.yml` and proper definition of `localhost` on each machine, it is possible to submit jobs from `mini` to `macpro` and `linux`, from `macpro` to `linux` and from `linux` to `macpro`. It is not possible to submit jobs remotely to `mini` because no `address` is defined for this host.

## Running tasks remotely

### Option `queue`

If you have set up a host in SoS configuration file propertly, you can use option `queue` to specify the host on which the task will be executed.

```
task:   queue='monster'
```

Here `monster` is the alias of the host.

### Command line option `-q`

Command line option `-q` specify a default task queue (or remote host) to which all tasks would be submitted. This option does not override `queue` options defined in the script so you can submit most jobs to a queue while submitting some jobs to particular servers (e.g. the one with specific software) using task options.

### Options `to_host` and `from_host`

Now that you have your machine configured, you should try to copy some files and see if they work correctly. File copy is specified with options `to_host` and `from_host`, which accepts a single file or directory name (string) or list (or nested lists) of filenames relative to local file system. You can test these options using simple SoS steps such as (replace filenames with files you have, of course),

```sos
[1]
task: 
    queue: 'monster',
    to_host: ['~/projects/data/test1', '/Volumes/Resources/hg19.fasta'],
    from_host: '~/projects/data/test1.res'
run:
    echo "Hello, World"
```

**Input, dependent, and output files are automatically transferred** so `to_host` and `from_host` are only needed to transfer files or directories in addition to step input and outputs. If there is any problem with file transfer, use option `-v3` to check if filenames are mapped correctly.

### Variable translation

Each task has a **context dictionary** that contains variables that will be used to, for example, compose scripts to be executed. Even if the task will be executed remotely, you should write your task using local paths, and **define variables for all paths that would differ between local and remote hosts**. For example, you might have a script that generates a `STAR` index from a fasta file. You can have all these files available locally and write the task as:

```
depends:      hg19_fasta
run:
    STAR \
		--runThreadN 8 \
		--runMode genomeGenerate \
		--genomeDir ${hg19_star_index} \
		--genomeFastaFiles ${hg19_fasta} \
		--sjdbGTFfile ${hg19_genes_gtf} \
		--sjdbOverhang 100
```

where  `hg19_fasta`, `hg19_genes_gtf` and `hg19_star_index` are variables pointing to input and output files of this process.

**SoS will by default translate all variables (of type string and list of strings) as if they are local paths**. In this case, all three variables will be translated to remote paths during remote execution. You can view details of variable translation using option `-v3` (debug output).

### Option `preserved_vars`

Automatic variable translation is convenient but SoS can make mistakes because it does not really know which variable contains path names that need to be converted. For example, if you do not have `hg19_fasta` locally and use variable `hg19_fasta` to point to fasta file on the remote host, you can add this variable to option `preserved_vars` so that its value will not be mapped during context switch:

```sos
task:     queue='monster', preserved_vars='hg19_fasta'
run:
    STAR \
		--runThreadN 8 \
		--runMode genomeGenerate \
        --genomeDir ${hg19_star_index} \
        --genomeFastaFiles ${hg19_fasta} \
		--sjdbGTFfile ${hg19_genes_gtf} \
		--sjdbOverhang 100
```

Other variables that need to be preserved include sample names, command line options etc.

Note that you can write tasks for remote hosts (e.g. use hard-coded paths or preserve related variables) but that will make your task host-dependent. It is recommended that you **write your script in local paths** and let SoS do the conversion so that you do not have to change the script itself if you would like to execute the task locally or switch to hosts with differnt configurations.

### Running task

With all the pieces put together, you can now execute the task on the remote host using `task` options

```sos
depends:  hg19_fasta, hg19_genes_gtf
output:   "${hg19_star_index}/chrName.txt"
task:     queue='monster', from_host=hg19_star_index
run:
    STAR \
        --runThreadN 8 \
        --runMode genomeGenerate \
        --genomeDir ${hg19_star_index} \
        --genomeFastaFiles ${hg19_fasta} \
        --sjdbGTFfile ${hg19_genes_gtf} \
        --sjdbOverhang 100
```

For this example,

1. SoS automatically transfers all input (None in this example) and dependent files (`hg19_fasta1 and `hg19_genes_gtf` in this example) so no `to_host` is needed.
2. All variables are path names that can be safely translated by SoS so option `preserved_vars` is not needed.
3. Option `from_host` is needed because we need to transfer not only the reprsenting output file (`hg19_star_index}/chrName.txt`), but also the whole directory containing the whole indexes (`hg19_star_index`).

SoS tries its best to automate the process while allowing you to tweak the details with runtime options. Just to recap the use of these  options:

* `to_host` is needed to transfer **additional input** files or directories to remote host.
* `from_host` is needed to tranfer **additional output** files or directories from remote host.
* `preserve_vars` is needed to prevent some variables from being translated automatically by SoS.

### Monitor the status of the remote task

Tasks executed on remote servers are external in the sense that they could be monitored and killed externally. When you submit the task, you will be given a task ID, using which you can query the status of the task or cancel the tak. For example,

```
sos status task_id -q shark
```

can be used to check the status of `task_id` on a remote server `shark`, and command

```
sos kill task_id -q shark
``` 
can be used to kill a running task with ID `task_id`. You can also use command 

```
sos status task_id -q shark -v 3
```
to display the details of task, including the SoS (Python) statements that will be executed.

### Submit tasks to a task queue

Instead of executing the task directly on a remote server, you can submit the task to a task queue (e.g. PBS/Torch of a cluster sytem) where the task could be distributed to working nodes of a cluster system. A little more configuration would be needed, basically, you will need to configure

* `queue_type`: type of the task queue

and for `queue_type = pbs`, configurations such as

* `job_template`: A template to generate jobs to be submitted,
* `submit_cmd`: command to submit PBS jobs,
* `status_cmd`: command to check status of jobs, and
* `kill_cmd`: command to cancel/delete pending or running jobs.

You will also need to specify the resources needed to execute your task, using task options such 

* `walltime`: estimated time
* `mem`: estimated memory usage
* `procs`: number of process needed for the task

With these configurations and options properly set up, you will be able to execute the task using the same syntax, e.g.

```
sos run myscript -q pbs
sos status -q pbs
sos kill tasks -q pbs
```

where `pbs` is the alias of the cluster system.

### An example using docker

Here is real-world example of running a bioinformatics tool (`tophat2`) on a remote server using docker. The remote server is a Linux server with docker installed. The local machine has all the reference data and annotation files (`hg19_fasta`, `hg19_genes_gtf`), and the Bowtie2 index (`hg19_Bowtie2_index`) but do not have tophat installed (lacking a Python 2 environment).

The following step runs tophat2 on the input fastq files using a remote host (with alias `linux`) and docker image `genomicpariscentre/tophat2`.

```sos
[tophat-align]
# align reads using the TOPHAT aligner
depends:  hg19_genes_gtf, hg19_fasta, "${hg19_Bowtie2_index}/genome.1.bt2"
input:    fastq_files
output:   "${output_dir}/tophat_main/alignments.bam"

task:   queue='linux', to_host=hg19_Bowtie2_index,
	    from_host="${output_dir}/tophat_main", preserved_vars='sample_name'

R1 = sorted([x for x in input if '_R1_' in x])
R2 = sorted([x for x in input if '_R2_' in x])
stop_if(len(R1) != len(R2), "Unequal number of R1 and R2 files from input ${fastq_files}")

# genomicpariscentre only has tophat2 (bowtie2) so it does not support option --bortie1
run:    docker_image='genomicpariscentre/tophat2'

	[ -d ${output_dir} ] || mkdir -p ${output_dir}
	tophat2  \
		--read-realign-edit-dist 1 \
		--segment-length 24 \
		-o '${output_dir}/tophat_main' \
		-p 7 \
		--GTF '${hg19_genes_gtf}' \
		--rg-id 0 \
		--rg-sample '${sample_name}' \
		--library-type fr-firststrand \
		--no-coverage-search \
		--keep-fasta-order  \
		--fusion-search --fusion-anchor-length 13 \
		--fusion-ignore-chromosomes chrM,M '${hg19_Bowtie2_index}/genome' \
		'${R1!ae,}' '${R2!ae,}'
```

When the step is executed, SoS will
1. Transfer `hg19_genes_gtf`, `hg19_fasta`, `${hg19_Bowtie2_index}/genome.1.bt2` (specified by `depends`), `fastq_files` (specified by `input`) and `hg19_Bowtie2_index` (specified by `to_host`) to remote server.
2. Translate all variables (`input`, `output`, `hg19_genes_gtf` etc) except for `sample_name` (specified by option `preserved_vars`).
3. On server `linux`, before starting the script, download docker image `genomicpariscentre/tophat2` if not already available.
4. Start the bash script in the docker container.

With this setup, everything is provided and specified by the local host. The server does not have to have any data and software installed so you are free to make use of any server with docker installed.