# Using RADICAL-Pilot on HPC Platforms

RADICAL-Pilot consists of a client and a so-called agent. Client and agent can execute on two different machines, e.g., the former on your workstation, the latter on an HPC platform's compute node. Alternatively, the client can execute on the login node of an HPC platform and the agent on a compute node, or both client and agent can execute on a compute node. 

How you deploy RADICAL-Pilot depends on the platform's policies that regulate access to the platform (ssh, DUO, hardware token), and the amount and type of resources that can be used on a login node (usually very little). Further, each HPC platform will require a specific resource configuration file (provided with RADICAL-Pilot) and, in some cases, some user-denpendent configuration.

## Access via SSH

If you can manually `ssh` into the target HPC platform, RADICAL-Pilot can do the same. You will have to set up an ssh key and, if you are not familiar with how to do that, check out this [guide](https://www.ssh.com/academy/ssh-keys#how-to-configure-key-based-authentication). Don't forget to configure your `ssh-agent` as, without it, RADICAL-Pilot will not work as it would have to enter your ssh key passphrase to access the HPC platform.

You can also use RADICAL-Pilot's API and store ssh-specific information within your application. Remote usernames, passwords, and keyfiles and can be set in a [`radical.pilot.Context`](apidoc.html#radical.pilot.Context) object. For example, if you want to tell RADICAL-Pilot your user-id on the remote resource, you can use the following:

```python
import radical.pilot as rp

session   = rp.Session()
context   = rp.Context('ssh')
c.user_id = "user1"
session.add_context(c)
```

## Pre-Configured Resources

The RADICAL-Pilot developer team maintains a growing set of resource configuration files. Several of the settings included there can be overridden in the `radical.pilot.PilotDescription` object. For example, the snipped above replaces the default queue standard with the queue large.

```python
pdesc = rp.PilotDescription()
pdesc.queue = "large"
```

For a list of supported configurations, see [List of Supported Platforms](supported_platforms.rst). Resource configuration files can are located at `radical/pilot/configs/` in the [RADICAL-Pilot](https://github.com/radical-cybertools/radical.pilot) git repository.

## Writing a Custom Resource Configuration File

If you want to use RADICAL-Pilot with a resource that is not in any of the provided resource configuration files, you can write your own, and save it in `$HOME/.radical/pilot/configs/<your_resource_configuration_file_name>.json.`

<div class="alert alert-info">

__Note:__ The remote resource configuration file name must start with “resource_” and end with the `.json` suffix. Within each resource file, multiple resources can be listed. For example, the [resource_xsede.json](https://radicalpilot.readthedocs.io/en/stable/_downloads/6f233ab23e448ee0a3071b6b39b907df/resource_xsede.json) file contains many different HPC resources from XSEDE.

</div>

<div class="alert alert-warning">

__Warning:__ Be advised that you may need specific knowledge about the target resource to do so. Also, while RADICAL-Pilot can handle very different types of systems and batch system, it may run into trouble on specific configurations or software versions we did not encounter before. If you run into trouble using a resource not in our list of officially supported ones, please open [an issue](https://github.com/radical-cybertools/radical.pilot/issues).

</div>

### Example of customized configuration file

A configuration file has to be valid JSON. For example, the resource configuration file named `resource_lrz.json` as the following keys and values:

```{json}
{
    "supermuc":
    {
        "description"                 : "The SuperMUC petascale HPC cluster at LRZ.",
        "notes"                       : "Access only from registered IP addresses.",
        "schemas"                     : ["gsissh", "ssh"],
        "ssh"                         :
        {
            "job_manager_endpoint"    : "loadl+ssh://supermuc.lrz.de/",
            "filesystem_endpoint"     : "sftp://supermuc.lrz.de/"
        },
        "gsissh"                      :
        {
            "job_manager_endpoint"    : "loadl+gsissh://supermuc.lrz.de:2222/",
            "filesystem_endpoint"     : "gsisftp://supermuc.lrz.de:2222/"
        },
        "default_queue"               : "test",
        "resource_manager"            : "SLURM",
        "task_launch_method"          : "SSH",
        "mpi_launch_method"           : "MPIEXEC",
        "forward_tunnel_endpoint"     : "login03",
        "virtenv"                     : "/home/hpc/pr87be/di29sut/pilotve",
        "python_dist"                 : "default",
        "pre_bootstrap_0"             : ["source /etc/profile",
                                         "source /etc/profile.d/modules.sh",
                                         "module unload mpi.ibm", "module load mpi.intel",
                                         "source /home/hpc/pr87be/di29sut/pilotve/bin/activate"
                                        ],
        "valid_roots"                 : ["/home", "/gpfs/work", "/gpfs/scratch"],
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN"
    },
    "ANOTHER_KEY_NAME":
    {
        ...
    }
}
```

The name of your file (here resource_lrz.json) together with the name of the resource (supermuc) form the resource key which is used in the PilotDescription resource attribute (lrz.supermuc).

All fields are **mandatory**, unless indicated otherwise below.

* `description`: a human readable description of the resource.
* `notes`: information needed to form valid pilot descriptions, such as what parameter are required, etc.
* `schemas`: allowed values for the access_schema parameter of the pilot description. The first schema in the list is used by   default. For each schema, a subsection is needed which specifies job_manager_endpoint and filesystem_endpoint.
* `job_manager_endpoint`: access url for pilot submission (interpreted by SAGA).
* `filesystem_endpoint`: access url for file staging (interpreted by SAGA).
* `default_queue`: queue to use for pilot submission (optional).
* `resource_manager`: type of job management system. Valid values are: LOADL, LSF, PBSPRO, SGE, SLURM, TORQUE, FORK.
* `task_launch_method`: type of compute node access, required for non-MPI tasks. Valid values are: SSH,``APRUN`` or LOCAL.
* `mpi_launch_method`: type of MPI support, required for MPI tasks. Valid values are: MPIRUN, MPIEXEC, APRUN, IBRUN, etc.
* `python_interpreter`: path to python (optional).
* `python_dist`: anaconda or default, i.e., not anaconda (mandatory).
* `pre_bootstrap_0`: list of commands to execute for initialization of main agent (optional).
* `pre_bootstrap_1`: list of commands to execute for initialization of sub-agent (optional).
* `valid_roots`: list of shared file system roots (optional). Note: pilot sandboxes must lie under these roots.
* `forward_tunnel_endpoint`: name of the host which can be used to create ssh tunnels from the compute nodes to the outside world (optional).