# Execution of tasks on remote hosts, an overview

* **Difficulty level**: intermediate
* **Time need to lean**: 30 minutes or less
* **Key points**:
  * A configuration file specifies how SoS should interact with a remote host
  * Tasks should be written in a way that can be executed remotely
  * Tasks can be submitted to remote hosts with option `-q`

This tutorial gives an overview of the SoS task model and steps to submit tasks to remote hosts.

## How remote task execution works

<p align="center">
  <img src="https://vatlab.github.io/sos-docs/doc/media/task_overview.png" width="80%">
</p>

Executing tasks on remote hosts is generally easy once everything is set up correctly, but  can be very tricky if you do not have a clear idea how things work. As illustrated avove, here is a summary of how SoS execute tasks on remote hosts:

**Step 1: Prepare task**
* SoS generates a task file, which contains information such as the content of the task (e.g. script to execute) and environments (variables, parameter etc).
* If the remote system is a batch system such as `PBS`, `SLURM` and `LSF`, a shell script is generated from a host-specific template. 

**Step 2: Copying tasks and reuired files to remote host**
* SoS prepares the task file to be executed on the remote host. It essentially records the path definitions of remote host so that the task can be executed with correct paths.
* SoS copies the task file, shell script, and sometimes input files to remote host using `rsync` or `rcp`.

**Step 3: Execute task remotely, optionally througgh a task queue**
* If the remote system is not a batch system, a `sos execute task_id` is started on the remote host (through `ssh`) to execute the task.
* If the remote system is a batch system, a system-dependent command (e.g. `qsub task_id.sh`) is executed on the remote host to submit the task to the scheduler. The shell script is essentially a wrapper to the `sos execute` command.

**Step 4: Monitor the execute of jobs**
* The task is run on remote host, either directly or by a batch system.
* SoS periodically sends a `sos status` command to remote host to check the status of the task.

**Step 5: Retrieve its result once it is done**
* Once the task is completed, SoS copies the task file and specified output file back to local host.

This tutorial walks through the setup and execution of remote tasks through simple examples. I will be using a Ubuntu VM with a task-spooler program that mimics a task queue system, and I will make some intentional errors to show how things work.

## One-time system configuration

SoS requires public-key access to remote hosts and tools such as `ssh`, `rsync` and `scp` to copy files. Unlike other programs that require a deamon process (a service) on the server to coordinate the execution of tasks, SoS relies on a basic remote access toolchain to access the remote host and execute commands remotely, which makes it easy to deploy SoS to a majority of the servers.

### 1. Describe the host in `~/.sos/hosts.yml`

A VM is created with basic system and a SSH server. After getting the IP address of the VM with commna `ifconfig`, it is time to add it to SoS host definition file `~/.sos/hosts.yml`.

Here I create such a file using the `report` action where

1. `localhost` specifies the machine I am working on, which is a Mac Pro desktop.
2. `vm` specifies the remote host, with `192.169.47.134` being the IP address of the VM.
3. Two `paths` named `home` and `project` are defined for each host.

In [3]:
report: output='~/.sos/hosts.yml'

    localhost: macpro
    hosts:
        macpro:
            address: localhost
            paths:
                home:  /Users/{user_name}
                project: /Users/{user_name}/Documents
        vm:
            address: 192.168.47.134
            paths:
                home: /home/{user_name}
                project: /home/{user_name}/projects

### 2. Set up public-key access to the server

There are plenty of online tutorials for setting up public-key access to remote hosts. You can do this manually or simply enter the following command

In [None]:
!sos remote setup vm

and enter password when prompted.

### 3. Install `sos` on remote host

Since the VM comes with only Python 2.7 and SoS requires Python 3.6+, I installed [Anaconda python 3](https://www.anaconda.com/distribution/) and then `SoS` with command

```bash
$ pip install sos
```

If you are on a system without root privilege, you can ask your system administrator to install `SoS` or install anaconda Python locally.

### 4. Testing the remote host

Now you will need to run command

In [7]:
!sos remote test vm

Alias Address        Queue Type ssh scp sos paths shared
----- -------        ---------- --- --- --- ----- ------
vm    192.168.47.134 process    OK  OK  OK  OK    OK    


to verify if the remote host is ready to use. This command tests

* `ssh`: whether or not sos can ssh to the host without being prompt for password. If this test fails after `sos remote setup`, please try to manually enable public-key access.
* `scp`: whether or not `rsync`, `rcp` etc are available and can be used. Install these tools if they are not available.
* `sos`: whether or not `sos` command can be called directly on server. If you have installed `sos` and this test fails, it is likely that your `PATH` is set in `.bashrc`, not `.bash_profile`, and `sos` is called with command `ssh host "bash --login -c sos"`. Check the [bash manual](https://www.gnu.org/software/bash/manual/html_node/Bash-Startup-Files.html) for details.
* `paths`: if all named paths actually exist and accessible. I had to login to the VM and create `~/projects` to fix this test.
* `shared`: useful only when two hosts share some or all file systems.

If you see five `OK`s, the remote host `vm` is ready to use.

## Writting tasks for remote execution

### 1. Writing tasks with remote paths

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Tasks written for remote host</h4>
    Task written for remote host can use paths on remote host directly. It is recommended to specify queue directly using task option <code>queue</code>.
</div>

The `vm` host has type `process`, which means the task will be executed directly on the host. This is good enough in many cases and let us write and submit a task to VM: 

In [1]:
task: queue='vm', workdir='/home/bpeng1/projects'
sh:
   echo 'This is my first task on vm' > 'result.txt'

0,1,2,3,4
,a5dbf5181f2cf8d5,76998591254477abscratch_0,Ran for < 5 seconds,completed


In this workflow, we use `queue='vm'` to let SoS know that the task will be submitted to host `vm`, and the task uses a workdir (`/home/bpeng1/projects`) that is only available on the remote host. You can specify the queue using command line option `-q`,

In [2]:
%run -q vm
task: workdir='/home/bpeng1/projects'
sh:
   echo 'This is my first task on vm' > 'result.txt'

INFO: a5dbf5181f2cf8d5 [32mstarted[0m
INFO: Task [32ma5dbf5181f2cf8d5[0m for substep [32mscratch_0[0m (index=0) is [32mignored[0m due to saved signature


However, the task cannot be executed locally because `/home/bpeng1/projects` is not a valid `workdir` locally.

In [4]:
task: workdir='/home/bpeng1/projects'
sh:
   echo 'This is my first task on vm' > 'result.txt'

0,1,2,3,4
,a5dbf5181f2cf8d5,889b4523f1093852scratch_0,Ran for < 5 seconds,failed


### 2. Interpreters, applications, libraries, and the use of containers

The second problem with remote execution is the availability of language interpreters (e.g. `R`), applications (e.g. `STAR`), libraries (e.g. `bioconductor`) on remote host. Your tasks will not be able to run if they need to use these tools but they are not available on the remote host. 

Other than installing these applications and libaries on the remote host, or begging your system adminstrator to do so for you, you can use containers in your task. For example,

In [5]:
task:
R: container='r-base'
   cat('this is R')

Failed to execute process
"'R("cat(\'this is R\')\\n", container=\'r-base\')\n'"
name 'path' is not defined


If the remote host has internet access, if might be able to pull the images and execute the scripts directly. Otherwise you will have to transfer the images to remote host manually.

### 3. Input and supporting files

Input data of tasks can be small or large, can be available locally or remotely, and you do not always want the data to be synchronized. So here is the rule of thumb for data synchronization:

1. SoS by default automatically synchronize input and output files. If you specify `input` and `output` statements and specifies `_input` and `_output` of your task, SoS will synchronize input files to remote host before task execution, and synchronize output files from remote host after task execution.

2. If you are submitting tasks to process data on remote host. Do not specify `input`. If do want to specify them for clarity, use `remote()` function to let SoS know that the data is on remote host.


## Working with batch systems

### 1. Job submission script

If the remote host is a batch system such as `PBS`, your task needs to be submitted as a job using particular commands. SoS allows you to define the submission scripts and a job template and automatically execute the command to submit the task for you.



### 2. Queue, CPU, and memory specifications

All batch systems require the specification of CPU and memory usages, so you will have to estimate the required resources and specify them with task.

## Writing portable tasks

### Reminder: named paths

Host-dependent paths play a key role in the portability of SoS tasks. As described in [this tutorial](config_files.html#Host-dependent-paths), 

```
path(name='home')
```
refers to entry `home` in the `paths` definition of local host. What that tutorial did not say is that `path(name='home')` will refer to the named path of the host on which the tasks will be executed.

That is to say, with the configuration of the `macpro` and `vm`, the shell script will be executed under `/Users/bpeng1` on `macpro` and under `/home/bpeng1` in the `vm` because `path(name='project')` returns host-dependent values for the named path `project`.

In [4]:
task:

run: workdir=path(name='project')
  echo "I am working at `pwd`"

0,1,2,3,4
,5afa824d28b68f06,2c1300c4eac2562cscratch_0,Ran for < 5 seconds,completed


### 1. `workdir` of tasks and actions

<div class="bs-callout bs-callout-primary" role="alert">
    <h4><code>workdir</code> of tasks and actions</h4>
    <p>All SoS tasks and actions (scripts) executed within the tasks have a <code>workdir</code>, which determines the starting location when the task will be executed. Users can use either the default <code>workdir</code>, which is the current directory, or specify a <code>workdir</code>. For the task to be executable on a remote host, the <code>workdir</code> needs to exist on remote host. That is to say</p>
    <ul>
        <li>If a default <code>workdir</code> is used, the current directory must be under one of the named directories such as <code>home</code> </li>
        <li>User specified <code>workdir</code> should better use one of the named directories.</li>
    </ul>
</div>

The first rule to remember is that you should **either work under one of the named directories (e.g. `home`, or `project`) or specify a `workdir` that is under one of the named directrories**. The reason behind this rule is that SoS executes task on remote host either under user-specified `workdir` or a `workdir` that is derived from current working directory, so whereas `workdir` in tasks

```
task: workdir=path(name='project')
```
and
```
task:
```
with default `workdir` = `path(name='home') / 'sos-docs'` could be located on remote host,

```
task: workdir='/some/local/path'
```
would fail if `/some/local/path` is not a valid path on remote host. Note that the last example would work perfectly locally, but it cannot be executed on remote host without `/some/local/path`. On the other hand, it is perfectly ok for you to hard-code remote path as in

```
task: workdir='/some/remote/path'
```
if you only intend to execute the task on a particular remote host.

## Summary

So at the end of the day, after you set up everything, or if are lucky that a system adminstrator has set up everything for you, you can almost execute your tasks on every defined remote hosts, and you only need to remember:

1. If you are writing a task specifically for a remote host, you can
    * use host-specific path name
    * rely on host-specific applications
    * do not specify step input output or use `remote()` for `input` or `output` if 
2. If you would like to make your task more portable, namely executable on local and multiple remote hosts,
    * Use named paths in your task
3. Verify if the applications, libraries, or containers are available on remote host.
4. Specify CPU and memory usages of tasks if you are submitting the tasks to a batch system

Good luck.

## Further reading

* [Named host-dependent paths](config_files.html#Host-dependent-paths)