# Path translation and file synchronization

* **Difficulty level**: intermediate
* **Time need to lean**: 20 minutes or less
  * A remote host might have different paths from the local host, making the execution of tasks difficult
  * SoS automatically translates paths specified in `_input`, `_depends` and `_output` according to host configurations
  * Options `to_host` and `from_host` specify files and directories send before task execution and retrieve after task execution, respectively.
  * Use of named path could make your workflow more portable and easier to read.  

## Translation of input and output paths

When local and remote hosts do not share file systems (or share only some file systems), things can get a bit complicated because SoS will need to decide what paths to use on the remote host. There are a few things to understand here:

**The current project directory, and all input, output and dependent files that are involved need to be under paths defined for local and remote host.** This is usually not a problem if you are working under your home directory and you have `home` defined under `paths` of both local and remote hosts, but can become more complicated if your tasks involves system directories such as `resource`, `temp`, and `scratch` that are outside of `home`. In these cases, all involved directories need to be defined for both local and remote hosts.

**Unless specified otherwise, the tasks will be executed under the remote version of the current working directory.**. That is to say, the execution of tasks will leave files on remote hosts that will not be automatically removed, and in a worse scenario **might overwrite remote files without warning**. This is why we recommend that you set remote `home` to a directory other than the true `home` (e.g. `/home/user_name/scratch`, or `/home/user_name/sos_temp`). In this way SoS will write to sos-specified directories on remote hosts and will not containminate your real `home` directory.

**Unless specified otherwise, input and dependent files will be copied to remote host before execution, and output files will be copied to local host after the completion of the task.** It is therefore important for you to plan ahead and avoid synchronization of large files that should stay on remote hosts.

## Working directory of tasks (Option `workdir`)

The `workdir` of task is default to the current working directory, or, in the case of remote execution, the remote counterpart of the current working directory.

Option `workdir` controls the working directory of the task. For example, the following step downloads a file to the `resource` directory using [action `download`](download.html).

In [1]:
task: queue='localhost', workdir='resource'

download:
  ftp://speedtest.tele2.net/512KB.zip

0,1,2,3,4
,85ea891331ab4bcb,5057fa441d6e1755scratch_0user_guide,Ran for < 5 seconds,completed


In [2]:
!ls resource

512KB.zip


## Sending additional files before task execution (Option `to_host`)

Option `to_host` specifies additional files or directories that would be synchronized to the remote host before tasks are executed. It can be specified as

* A single file or directory (with respect to local file system), or
* A list of files or directories, or

The files or directories will be translated using the host-specific path maps. Note that if a symbolic link is specified in `to_host`, both the symbolic link and the path it refers to would be synchronized to the remote host.

Just to demontrate how to use this option, let us copy all notebooks in this directory to a remote host and count the number of them.

In [3]:
%preview -n wc.txt 
output: 'wc.txt'
task: to_host='task*.ipynb', queue='bcb' 
sh: expand=True
  wc -l *.ipynb > {_output}

0,1,2,3,4
,9e7b75df6a5d3767,5b7627b1ac52aa8fscratch_0user_guide,Ran for < 5 seconds,completed


     363 task_files.ipynb
     386 task_management.ipynb
     817 task_statement.ipynb
     223 task_tags.ipynb
     390 task_template.ipynb

## Retrieving additional files after task completion (Option `from_host`)

Option `from_host` specifies additional files or directories that would be synchronized from the remote host after tasks are executed. It can be specified as

* A single file or directory (with respect to local file system), or
* A list of files or directories, or

The files or directories will be translated using the host-specific path maps to determine what remote files to retrieve.

## Absolute paths and named paths

The use of relative paths are highly recommended because relative paths are not system dependent. Although `data/sample1.csv` can be under different paths on local and remote hosts, SoS handles the mapping of current project directory and `data/sample1.csv` would represent the same file under local and remote hosts.

Things get a lot more complicated when absolute paths are involved. In the following example, `_output` is specified with absolute path, the task still magically works on a cluster system with home directory `/home/bpeng1` because SoS automatically translates input and output files, and knows the output should be `/home/bpeng1/scratch/sos/sos-docs/src/user_guide/random_output.txt` on the remote host. The output files are correctly synchronized to local host.

In [4]:
%preview -n random_output.txt
output: '/Users/bpeng1/sos/sos-docs/src/user_guide/random_output.txt'
task: queue='htc', mem='4G'
import random
with open(_output, 'w') as out:
  out.write(f'Random number is {random.randint(0, 1000)}')

0,1,2,3,4
,4b724175bd0657a0,7cb2935fe9e6b29dscratch_0user_guide,Ran for < 5 seconds,completed


Random number is 762

However, if you execute the workflow directly on the remote host using option `-r`, it would fail because '/Users' is not a writable directory on the remote host.

In [5]:
%env --expect-error

%run -r htc-headnode
output: '/Users/bpeng1/sos/sos-docs/src/user_guide/random_output.txt'

import random
with open(_output, 'w') as out:
  out.write(f'Random number is {random.randint(0, 1000)}')

INFO: Running [32mdefault[0m: 
[91mERROR[0m: [91m[default]: [default]: Failed to process step output ('/Users/bpeng1/sos/sos-docs/src/user_guide/random_output.txt'): [Errno 13] Permission denied: '/Users'[0m
[91mERROR[0m: [91mFailed to submit workflow sos run /home/bpeng1/sos/sos-docs/src/user_guide/.tmp_script_p9m5uv2f.sos: Command 'ssh -o "ControlMaster=auto" -o "ControlPath=/Users/bpeng1/.ssh/controlmasters/%r@%h:%p" -o "ControlPersist=10m" -q q1prphtch00.mdanderson.edu -p 22 "bash --login -c ' [ -d /home/bpeng1/sos/sos-docs/src/user_guide ] || mkdir -p /home/bpeng1/sos/sos-docs/src/user_guide; cd /home/bpeng1/sos/sos-docs/src/user_guide &&  sos run /home/bpeng1/sos/sos-docs/src/user_guide/.tmp_script_p9m5uv2f.sos'" ' returned non-zero exit status 1.[0m


RuntimeError: Workflow exited with code 1

This problem could be solved by the use of host-specific paths. For example, if you are running the work on `htc-headnode`, you can change the output to use the correct path for this host.

In [6]:
%run -r htc-headnode
output: '/home/bpeng1/sos/sos-docs/src/user_guide/random_output.txt'

import random
with open(_output, 'w') as out:
  out.write(f'Random number is {random.randint(0, 1000)}')

INFO: Running [32mdefault[0m: 
INFO: [32mdefault[0m is [32mcompleted[0m.
INFO: [32mdefault[0m output:   [32m/home/bpeng1/sos/sos-docs/src/user_guide/random_output.txt[0m
INFO: Workflow default (ID=1c5299a18b627a6e) is executed successfully with 1 completed step.


A better choice that would make your workflow more "portable" would be using [named paths](targets.ipynb). For example, if you use `#home` on `htc-headnode` which has the correct named paths defined, the workflow would execute successfully.

In [7]:
%run -r htc-headnode

output: '#home/sos/sos-docs/src/user_guide/random_output.txt'

import random
with open(_output, 'w') as out:
  out.write(f'Random number is {random.randint(0, 1000)}')

INFO: Running [32mdefault[0m: 
INFO: [32mdefault[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mdefault[0m output:   [32m/home/bpeng1/sos/sos-docs/src/user_guide/random_output.txt[0m
INFO: Workflow default (ID=95ff85f084c10b32) is ignored with 1 ignored step.
