# Execution of tasks on remote hosts

SoS allows you to execute tasks on remote hosts with or without their own file systems. For example, you can execute a complex workflow mostly locally, but execute a few jobs on a remote host if it provides more computing power, or if it has some software that cannot be installed locally. The remote host could have its own file system (separate systems), share its file system with the local machine (e.g. nodes on the cluster), or share some storage with the local machine (e.g. have the same shared storage), so file synchronization will be needed in some cases.

With help from a few runtime options (options to `task`), SoS can

* Copy specified local files to the remote host, possibly to different directories
* Start a SoS task on the remote machine and wait for the completion of the task
* Copy results back from the remote host if the execution is successful

## System setup

### Public-key access

Following any online tutorial, set up public-key access from your local machine to the remote host. If your public key does not work, check file permissions of `~/.ssh`, files under `.ssh`, and `$HOME` in some cases. After setting up the server, make sure you can login without password using command

```bash
% ssh remote-host
```

### Software installation

You will need to install the latest version of sos (preferrably identical version between local and remote hosts), and the software you will need to run. Test it by logging to the remote machine with commands

```bash
% sos -h
```

### Check `$PATH`

Commands that are available in login shell are not necessarily available during remote execution. Basically, remote execution through `ssh` invokes a non-interactive and non-login shell with basic `$PATH`. SoS tries to address this problem by executing commands through a login shell

```bash
% ssh remote-host "bash --login -c 'sos execute task_id'"
```

but default `.bashrc` on the remote server might contain a line like

```bash
[ -z "$PS1" ] && return
```

that makes it exit when `bash` is not running interactively. This line has to be removed in order to have complete `$PATH` during remote execution.

Now, fire command

```bash
% ssh remote-host "bash --login -c 'sos -h'"
```

from your local machine and see if `sos` can be invoked. Similarly, check if the command you would like to execute remotely can be executed in this manner.

### Configure `address`

This step is optional but is highly recommended. Basically you can save necessary information for each remote host that you would like to use in a SoS configuration file so that you do not have to specify them one by one.

First, give your host an alias so that you do not have to specify the long URL each time. To do this execute command

```
% sos config --global --set hosts.monster.address dcdrlue8ee.yourdomain.com
```

where `monster` is a short alias and `dcdrlue8ee.yourdomain.com` is the complete address.

This commands writes to `~/.sos/config.yml` the following entry

```bash
$ cat ~/.sos/config.yml
hosts:
  monster:
    address: dcdrlue8ee.yourdomain.com
```

You can write to this file directly if you are familiar with YML format.

If your account name differs between the local and remote servers, the complete address should be `username@address`. In this example `john@dcdrlue8ee.yourdomain.com` if the remote server account is `john`.

### Configure `path_map`

`path_map` is a list of directory mappings between local and remote directories. For example, if you work locally on a Mac machine with home directory `/Users/myuser`, and the remote server is a Linux machine with home directory `/home/myuser`, you should define a `path_map` using command

```
% sos config --global --set hosts.monster.path_map /Users/myuser/:/home/myuser/
```

In this way, if the local data is `/User/myuser/projects/input.fastq`, the path will be translated to `/home/myuser/projects/input.fastq` during remote execution.

In more complicated cases where there are different directories, more than one mapping can be specified. For example, if you have directories under different volumes, you can map them differently using

```bash
% sos config --global --set hosts.monster.path_map \
    /Users/myuser/projects/:/home/myuser/scratch/projects/  \
    /Volumes/Resource:/home/myuser/resource \
    /Users/myuser:/home/myuser
```

This command will write the following entries to the configuration file

```
$ cat ~/.sos/config.yml
hosts:
  monster:
    address: dcdrlue8ee.yourdomain.com
    path_map:
    - /Users/myuser/projects/:/home/myuser/scratch/projects/
    - /Volumes/Resource:/home/myuser/resource
    - /Users/myuser:/home/myuser
```

and will map your local direcrories as follows

```
~/projects/data ==> /home/myuser/scratch/projects/data
/Volumes/Resources/hg19.fasta ==> /home/myuser/resource/hg19.fasta
~/myscript ==> /home/myuser/myscript.py
```

Note that

1. Both source and destination paths should be absolute (starts with `/` for Linux-like systems).
2. SoS expands local directories to absolute path before applying `path_map`.
3. SoS applies path maps at the order in which they are specified, so you should specify general mappings after more specific ones. For example, if you specify `/Users/myuser:/home/myuser` before `/Users/myuser/projects/:/home/myuser/scratch/projects/`, path `/Users/myuser/projects/input.txt` would be mapped to `/home/myuser/projects/input`, not to the intended directory `/home/myuser/scratch/projects/input`.


### Configure `shared`

Option `shared` tells SoS which file systems are shared between local and remote hosts so that it does not have to synchronize files under these directories between the hosts.

* SoS assumes independent file systems so you do not have to specify option `shared` if the local and remote hosts does not share any file system.
 
* If your local and remote host share all file systems, you should use command

  ```bash
  % sos config --global --set hosts.monster.shared /
  ```
  to indicate that the root directory is shared so no cross-network copy is needed.
  
* If your local and remote host share one or more shared volumes, you can specify them with command

  ```bash
  % sos config --global --set hosts.monster.shared /projects /data
  ```
  to indicate that files under these directories are available on the remote host.

Shared file systems do not have to be mounted at the same locations. For example, a local file system `/projects` might be available at the remote host under `/scratch/projects`. In this case, you should

* Set `/projects` as `shared` so that files under `/projects` will not be copied.
* Set `/projects:/scratch/projects` in `path_map` so that the path can be correctly translated between local and remote hosts.

It is important to remember that **SoS does not copy files under shared directories**. If the local and remote host share a file system but you really would like to copy files to a differenet directory, you can ignore the `shared` option and let SoS copy files as if they are separate file systems.

### Sample configurations

The server settings are critically important for the successful execution of commands on remote servers. As an example, I am working on a Mac mini (with limited CPU/RAM) and have access to a Mac Pro workstation and a Linux server, and I use Google Drive to store scripts that shared by the Mac mini and the Mac Pro workstation.

The hosts configurations for these two machines are

```
hosts:
  macpro:
    address: mp-bpeng.mdanderson.edu
    path_map:
    - /Users/bpeng1/.sos/RNASeq/:/Volumes/Home/.sos/
    - /Users/bpeng1/Google Drive:/Users/bpeng1/gdrive
    shared: /Users/bpeng1/Google Drive
  linux:
    address: dcdrlpmcfd.mdanderson.edu
    path_map:
    - /Users/bpeng1/Google Drive:/home/bpeng1/gdrive
    - /Users/bpeng1/.sos/RNASeq/:/home/bpeng1/.sos/
    - /Users/bpeng1:/home/bpeng1
```

For the Macpro,
1. Resources under `~/.sos/RNASeq` is mapped to a separate volume under `/Volumes/Home/.sos`.
2. Local `Google Drive` is mapped to `gdrive` on the remote server. This is not required but path with space can cause trouble to a lot of software so I created a symbolic `gdrive` for `Google Drive` on the Mac server and use path `gdrive` to access `Google Drive`.
3. `~/Google Drive` is declared to be `shared` so files under `Google Drive` will not be synchronized (but path will still be mapped to `gdrive`).
4. Both systems have `/Users/bpeng1` so no other mapping is needed.

For the linux server,
1. Scripts under `Googld Drive` are mapped to `/home/bpeng1/gdrive` because Linux is even less tolerant to spaces in paths.
2. Resources under local `~/.sos/RNASeq` is mapped to `/home/bpeng1/.sos` on the server
3. Home directory `/Users/bpeng1` is mapped to `/home/bpeng1` on the linux server.
4. No file system is shared between the Mac mini and the Linux server.

## Running tasks remotely

### Option `on_host`

If you have set up a host in SoS configuration file propertly, you can use option `on_host` to specify the host on which the task will be executed.

```
task:   on_host='monster'
```

Here `monster` is the alias of the host.

Instead of using a configuration file to save configurations to all hosts, you can also specify all host related information in option `on_host` as a dictionary. This method is preferred by users who would like to keep everything in the script. A common usage pattern is to define all hosts in the global section of the script:

```sos
host1 = {
        'address': 'address-of-host1'
        'path_map': ['path1:remote_path1',
                     'path2:remote_path2'],
         'shared':  '/shared/path'
         }
host2 = ...
```

and use the configurations as follows:

```sos
task:     on_host=host1,
          to_host=..., from_host=...
```

Or even allows the specification of hosts from command line:

```sos
hosts = {'host1': ...,
         'host2': ...,
         'localhost': None
         }
# by default execute locally        
parameter: host = 'localhost'

[10]
task:   on_host=hosts[host]
```

### Options `to_host` and `from_host`

Now that you have your machine configured, you should try to copy some files and see if they work correctly. File copy is specified with options `to_host` and `from_host`, which accepts a single file or directory name (string) or list (or nested lists) of filenames. You can test these options using simple SoS steps such as (replace filenames with files you have, of course),

```
[1]
task: 
    on_host: 'monster',
    to_host: ['~/projects/data/test1', '/Volumes/Resources/hg19.fasta'],
    from_host: '~/projects/data/test1.res'
run:
    echo "Hello, World"
```

**Input, dependent, and output files are automatically transferred** so `to_host` and `from_host` are only needed to transfer files or directories in addition to step input and outputs. If there is any problem with file transfer, use option `-v3` to check if filenames are mapped correctly.

### Variable translation

Each task has a *context* that contains variables that will be used to, for example, compose scripts to be executed. Even if the task will be executed remotely, you should write your task using local paths, and **define variables for all paths that would differ between local and remote hosts**. For example, you might have a script that generates a `STAR` index from a fasta file. You can have all these files available locally and write the task as:

```
depends:      hg19_fasta
run:
    STAR \
		--runThreadN 8 \
		--runMode genomeGenerate \
		--genomeDir ${hg19_star_index} \
		--genomeFastaFiles ${hg19_fasta} \
		--sjdbGTFfile ${hg19_genes_gtf} \
		--sjdbOverhang 100
```

where  `hg19_fasta`, `hg19_genes_gtf` and `hg19_star_index` are variables pointing to input and output files of this process.

**SoS will by default translate all variables (of type string and list of strings) as if they are local paths**. In this case, all three variables will be translated to remote paths during remote execution. You can view details of variable translation using option `-v3` (debug output).

### Option `preserved_vars`

Automatic variable translation is convenient but SoS can make mistakes because it does not really know which variables contains path names that need to be converted. For example, if you do not have `hg19_fasta` locally and use variable `hg19_fasta` to point to fasta file on the remote host, you can add this variable to option `preserved_vars` so that its value will not be mapped during context switch:

```sos
task:     on_host='monster', preserved_vars='hg19_fasta'
run:
    STAR \
		--runThreadN 8 \
		--runMode genomeGenerate \
        --genomeDir ${hg19_star_index} \
        --genomeFastaFiles ${hg19_fasta} \
		--sjdbGTFfile ${hg19_genes_gtf} \
		--sjdbOverhang 100
```

Other variables that need to be preserved include sample names, command line options etc.

Note that you can write tasks for remote hosts (e.g. use hard-coded paths or preserve related variables) but that will make your task host-dependent. It is recommended that you **write your script in local paths** and let SoS do the conversion so that you do not have to change the script itself if you would like to execute the task locally or  switch to hosts with differnt configurations.

### Running task

With all the pieces put together, you can now execute the task on the remote host using `task` options

```sos
depends:  hg19_fasta, hg19_genes_gtf
output:   "${hg19_star_index}/chrName.txt"
task:     on_host='monster', from_host=hg19_star_index
run:
    STAR \
        --runThreadN 8 \
        --runMode genomeGenerate \
        --genomeDir ${hg19_star_index} \
        --genomeFastaFiles ${hg19_fasta} \
        --sjdbGTFfile ${hg19_genes_gtf} \
        --sjdbOverhang 100
```

For this example,

1. SoS automatically transfers all input (None in this example) and dependent files (`hg19_fasta1 and `hg19_genes_gtf` in this example) so no `to_host` is needed.
2. All variables are path names that can be safely translated by SoS so option `preserved_vars` is not needed.
3. Option `from_host` is needed because we need to transfer not only the reprsenting output file (`hg19_star_index}/chrName.txt`), but also the whole directory containing the whole indexes (`hg19_star_index`).

SoS tries its best to automate the process while allowing you to tweak the details with runtime options. Just to recap the use of these  options:

* `to_host` is needed to transfer **additional input** files or directories to remote host.
* `from_host` is needed to tranfer **additional output** files or directories from remote host.
* `preserve_vars` is needed to prevent some variables from being translated automatically by SoS.

### An example using docker

Here is real-world example of running a bioinformatics tool (`tophat2`) on a remote server using docker. The remote server is a Linux server with docker installed. The local machine has all the reference data and annotation files (`hg19_fasta`, `hg19_genes_gtf`), and the Bowtie2 index (`hg19_Bowtie2_index`) but do not have tophat installed (lacking a Python 2 environment).

The following step runs tophat2 on the input fastq files using a remote host (with alias `linux`) and docker image `genomicpariscentre/tophat2`.

```sos
[tophat-align]
# align reads using the TOPHAT aligner
depends:  hg19_genes_gtf, hg19_fasta, "${hg19_Bowtie2_index}/genome.1.bt2"
input:    fastq_files
output:   "${output_dir}/tophat_main/alignments.bam"

task:   on_host='linux', to_host=hg19_Bowtie2_index,
	    from_host="${output_dir}/tophat_main", preserved_vars='sample_name'

R1 = sorted([x for x in input if '_R1_' in x])
R2 = sorted([x for x in input if '_R2_' in x])
stop_if(len(R1) != len(R2), "Unequal number of R1 and R2 files from input ${fastq_files}")

# genomicpariscentre only has tophat2 (bowtie2) so it does not support option --bortie1
run:    docker_image='genomicpariscentre/tophat2'

	[ -d ${output_dir} ] || mkdir -p ${output_dir}
	tophat2  \
		--read-realign-edit-dist 1 \
		--segment-length 24 \
		-o '${output_dir}/tophat_main' \
		-p 7 \
		--GTF '${hg19_genes_gtf}' \
		--rg-id 0 \
		--rg-sample '${sample_name}' \
		--library-type fr-firststrand \
		--no-coverage-search \
		--keep-fasta-order  \
		--fusion-search --fusion-anchor-length 13 \
		--fusion-ignore-chromosomes chrM,M '${hg19_Bowtie2_index}/genome' \
		'${R1!ae,}' '${R2!ae,}'
```

When the step is executed, SoS will
1. Transfer `hg19_genes_gtf`, `hg19_fasta`, `${hg19_Bowtie2_index}/genome.1.bt2` (specified by `depends`), `fastq_files` (specified by `input`) and `hg19_Bowtie2_index` (specified by `to_host`) to remote server.
2. Translate all variables (`input`, `output`, `hg19_genes_gtf` etc) except for `sample_name` (specified by option `preserved_vars`).
3. On server `linux`, before starting the script, download docker image `genomicpariscentre/tophat2` if not already available.
4. Start the bash script in the docker container.

With this setup, everything is provided and specified by the local host. The server does not have to have any data and software installed so you are free to make use of any server with docker installed.

## Advanced usages

### Configure `send_cmd`, `received_cmd`, and `execute_cmd`

SoS uses `rsync` command to exchange files between hosts, and use `ssh` to execute command. If the default commands do not work for your configuration (e.g. if you do not have `rsync` and need to use `scp`, you can 

1. Use option `-v3` to display the exact command used to transfer files and execute commands.

2. Define options `send_cmd`, `received_cmd` and `execute_cmd` for your particular configuration. These variables should be defined with `${source}` and `${dest}` which will be replaced by source and destination filenames for each file.

Similar to options `path_map` and `shared`, these three options can be defined in configuration file or as keys to `on_host`.
