# Staging Data with RADICAL-Pilot

RADICAL-Pilot (RP) provides capabilities of moving data from the client side to the agent side (on the HPC platform) and back, and within the agent space (i.e., between sandboxes for session, pilots and tasks). To ensure that the task finds the data it needs on the HPC platform where it runs, RP provides a mechanism to stage the task's input data before its execution. As well, it lets to stage output data, generated by tasks, back to the user or any other place on the HPC platform, where it will be collected.

If it is not required to transfer task's input and output data, for example, if the data is located within the shared file system on the HPC platform and is accessible from compute nodes, where tasks will be executed (and paths to that data are provided in the task description), then the user doesn't need to define any staging directives, since data is already available for RP. The rest of the tutorial presents how the user can describe data staging and what that means for the running application.

## Staging directives

### Format

The staging directives are specified using a `dict` type in the following form:

```python
staging_directive = {'source': str, 'target': str, 'action': str, 'flags': int}
```

- **Source** - data files (or directory) that need to be staged (see the section [Locations](#Locations));
- **Target** - the target path for the staged data (see the section [Locations](#Locations));
- **Action** - defines how the provided data should be staged (see the section [Actions](#Actions));
- **Flags** - sets the options applied for a corresponding action (see the section [Flags](#Flags))

### Locations

`Source` and `Target` locations can be given as strings or `radical.utils.Url` instances. Strings containing `://` are converted into URLs, while strings without `://` are considered absolute or relative paths and are thus interpreted in the context of the client's working directory (see the section [Simplified directive format](#Simplified-directive-format) for examples).

Special URL schemas are relative to certain locations:
* `client://` - client's working directory (`./`);
* `endpoint://` - root of the file system on the target platform;
* `resource://` - agent sandbox (`radical.pilot.sandbox/`) on the target platform;
* `session://` - session sandbox on the target platform (within the agent sandbox)
* `pilot://` - pilot sandbox on the target platform (within the session sandbox);
* `task://` - task sandbox on the target platform (within the pilot sandbox).

All locations are interpreted as directories, never as files. For the above schemas, we interpret `schema://` the same as `schema:///`, i.e., we treat this as a namespace, not as location qualified by a hostname - the `hostname` element of the URL is expected to be empty, and the path is _always_ considered relative to the locations specified above (even though URLs usually don't have a notion of relative paths).

`endpoint://` is based on the `filesystem_endpoint` attribute of the platform config (see the tutorial [RADICAL-Pilot Configuration System](configuration.ipynb#User-defined-resource)) and points to the file system accessible via that URL. Note that the notion of `root` depends on the access protocol and the providing service implementation.

The initial description of sandboxes is provided in the tutorial [Getting Started](getting_started.ipynb#Generated-Output). The hierarchy of the sandboxes is the following: 

```shell
<default_remote_workdir>/radical.pilot.sandbox/<session_sandbox_ID>/<pilot_sandbox_ID>/<task_sandbox_ID>
```

where `default_remote_workdir` is the attribute of the platform config (see the tutorial [RADICAL-Pilot Configuration System](configuration.ipynb#User-defined-resource)), and if it is not provided then the current directory is used (`$PWD`). Sandboxes for session, pilot and task are named with their unique IDs (`uid`).

__Examples of the expanded locations__

```shell
# assumptions for the examples below
#   - client's working directory
#        /home/user
#   - agent's sandboxes hierarchy
#        /tmp/radical.pilot.sandbox/rp.session.0000/pilot.0000/task.0000

in : 'client:///tmp/input_data'
out: '/home/user/tmp/input_data'

in : 'task:///test.txt'
out: '/tmp/radical.pilot.sandbox/rp.session.0000/pilot.0000/task.0000/test.txt'
```

### Actions

* `radical.pilot.TRANSFER` (__*default*__) - remote file transfer from `source` URL to `target` URL;
* `radical.pilot.COPY` - local file copy (i.e., not crossing host boundaries);
* `radical.pilot.MOVE` - local file move;
* `radical.pilot.LINK` - local file symlink.

### Flags

Flags are set automatically, but a user also can set them explicitly.

* `radical.pilot.CREATE_PARENTS` - create the directory hierarchy for targets on the fly;
* `radical.pilot.RECURSIVE` - if `source` is a directory, handles it recursively.

### Simplified directive format

RP gives some flexibility in the description of staging between the client side and the sandboxes for pilot and task. Thus, if a user provides just names (absolute or relative paths, e.g., names of files or directories), then RP expands them into corresponding directives. 
* If a string directive is a single path, then after expanding it, the _source_ will be a provided path within the `client://` location, while the _target_ will be a base name from a provided path within the `pilot://` or the `task://` location for [`radical.pilot.PilotDescription`](../apidoc.rst) or [`radical.pilot.TaskDescription`](../apidoc.rst) respectively.
* Having directional characters `>`, `<` within a string directive defines the direction of the staging between corresponding paths
   * Input staging: `source_path > target_name`, the `source_path` defines a path within the `client://` location, and the `target_name` defines a base name within the `pilot://` or the `task://` location for [`radical.pilot.PilotDescription`](../apidoc.rst) or [`radical.pilot.TaskDescription`](../apidoc.rst) respectively.
   * Output staging: `target_name < source_path` (applied for [`radical.pilot.TaskDescription`](../apidoc.rst) only), the `source_path` defines a path within the `task://` location, and the `target_name` defines a base name within the `client://` location.

__Examples of the staging directives being expanded__

[`radical.pilot.PilotDescription.input_staging`](../apidoc.rst)
```shell
in : [ '/tmp/input_data/' ]
out: [{'source' : 'client:///tmp/input_data',
       'target' : 'pilot:///input_data',
       'action' : radical.pilot.TRANSFER,
       'flags'  : radical.pilot.CREATE_PARENTS|radical.pilot.RECURSIVE}]
in : [ 'input.dat > staged.dat' ]
out: [{'source' : 'client:///input.dat',
       'target' : 'pilot:///staged.dat',
       'action' : radical.pilot.TRANSFER,
       'flags'  : radical.pilot.CREATE_PARENTS}]
```

[`radical.pilot.TaskDescription.input_staging`](../apidoc.rst)
```shell
in : [ '/tmp/task_input.txt' ]
out: [{'source' : 'client:///tmp/task_input.txt',
       'target' : 'task:///task_input.txt',
       'action' : radical.pilot.TRANSFER,
       'flags'  : radical.pilot.CREATE_PARENTS}]
```

[`radical.pilot.TaskDescription.output_staging`](../apidoc.rst)
```shell
in : [ 'collected.dat < output.txt' ]
out: [{'source' : 'task:///output.txt',
       'target' : 'client:///collected.dat',
       'action' : radical.pilot.TRANSFER,
       'flags'  : radical.pilot.CREATE_PARENTS}]
```

## Examples

<div class="alert alert-info">
    
__Note:__ For the initial setup regarding MongoDB see the tutorial [Getting Started](tutorials/getting_started.ipynb).

</div>

In [None]:
%env RADICAL_PILOT_DBURL=mongodb://guest:guest@mongodb:27017/default

<div class="alert alert-info">

__Note:__ In our examples, we will not show a progression bar while waiting for some operation to complete, e.g., while waiting for a pilot to stop. That is because the progression bar offered by RP's reporter does not work within a notebook. You could use it when executing an RP application as a standalone Python script.

</div>

In [None]:
%env RADICAL_REPORT_ANIME=FALSE

In [None]:
import radical.pilot as rp
import radical.utils as ru

In [None]:
session = rp.Session()
pmgr    = rp.PilotManager(session=session)
tmgr    = rp.TaskManager(session=session)

In [None]:
!mkdir -p ./input_dir

In [None]:
# Staging directives for the pilot.

with open('./input_dir/input.txt', 'w') as f:
    f.write('Staged input (pilot_id=$RP_PILOT_ID | session_id=$RP_SESSION_ID)')

pd = rp.PilotDescription({
    'resource'      : 'local.localhost',
    'cores'         : 2,
    'runtime'       : 15,
    'input_staging' : ['input_dir']
})

# The staging directive above lists a single directory name.
# This will automatically be expanded to:
#
#    {'source' : 'client:///input_dir',
#     'target' : 'pilot:///input_dir',
#     'action' : rp.TRANSFER,
#     'flags'  : rp.CREATE_PARENTS|rp.RECURSIVE}

pilot = pmgr.submit_pilots(pd)
tmgr.add_pilots(pilot)

<div class="alert alert-info">
    
__Note:__ Staging of input data for a pilot can be describe within [`radical.pilot.PilotDescription`](../apidoc.rst) or as input parameters in the [`radical.pilot.Pilot.stage_in()`](../apidoc.rst) method, but staging of output data can only be done by using the [`radical.pilot.Pilot.stage_out()`](../apidoc.rst) method.

</div>

In [None]:
# Staging directives for the task.

td = rp.TaskDescription({
    'executable'    : 'eval',
    'arguments'     : ['echo "$(cat input.txt)"'],
    'stdout'        : 'output.txt',
    'input_staging' : [{'source': 'pilot:///input_dir/input.txt',
                        'target': 'task:///input.txt',
                        'action': rp.LINK}],
    'output_staging': [{'source': 'task:///output.txt',
                        'target': 'pilot:///output.txt',
                        'action': rp.COPY}]
})

tmgr.submit_tasks(td)
tmgr.wait_tasks()

In [None]:
pilot.stage_out({'source': 'pilot:///output.txt',
                 'target': 'client:///result.txt',
                 'action': rp.TRANSFER})

In [None]:
!cat result.txt

In [None]:
session.close(cleanup=True)