# Using HPC Platforms

RADICAL-Pilot (RP) has two main components: **Client** and **Agent**, where the Client component is responsible for initiating and handling process managers (`rp.Session`, `rp.PilotManager`, `rp.TaskManager`), pilot and task descriptions (`rp.PilotDescription`, `rp.TaskDescription`), while the Agent component is responsible for the execution process of tasks within the pilot after allocating requested resources. Client and Agent can run either on the same machine or on different ones.

<div class="alert alert-info">

__Note:__ Running Client and Agent components on different machines (e.g., running Client on user workstation and Agent on the target HPC platform) depends on the access policy of the target platform, where the execution of computing tasks will be performed. We advise you to check platforms user guides and [supported HPC platforms](../supported.rst) for more details.

</div>

RP provides two ways to use [supported HPC platforms](../supported.rst) to execute workloads:

* Launching RP application **from the target platform** (_local access_)
   * Run the Client component on platform login nodes or within the batch job (using either interactive session or batch script) on compute nodes. The Agent component will run on the batch node (i.e., launcher node within the job allocation, also called MOM node, which could be a regular compute node, if platform doesn't support a dedicate node type).
* Launching RP application **outside the target platform** (_remote access_)
   * Run the Client component on the machine, which is not associated with the target platform. Client will make a remote job submission and Agent will run within the job allocation in a similar way as for the previous mode.

## Launching from a batch job

We recommend to launch RP applications from the batch job, since such mode means no running processes on login nodes. Some HPC platforms limit the number of processes running on login nodes and the amount of resources used by such processes (see [Examples of login nodes policies](#Examples-of-login-nodes-policies)), system daemons might terminate any of user processes if that violates corresponding rules and policies. _If it is not the case for your chosen platform_, feel free to [Launch RP application from a login node](#Launching-from-a-login-node). The downside of "launching from a batch job" is that it requires the user to do one of the following operations _manually_: either to start the interactive session (i.e., interactive job) or submit a corresponding batch script calling RP application from it.

<div class="alert alert-info">

__Note:__ The command to acquire an interactive job and the script language to write a batch job depends on the batch system deployed on the HPC platform and on its configuration. That means that you may have to use different commands or scripts depending on the HPC platform that you want to use. See the guide for each [supported HPC platform](../supported.rst) for more details.
    
</div>

<div class="alert alert-info">

__Note:__ Make sure that the amount of resources specified within [`rp.PilotDescription`](../apidoc.rst#radical.pilot.PilotDescription) (`pd.nodes` and `pd.runtime`) of your RP application corresponds to the amount of resources requested for a batch job.
    
</div>

### Examples of interactive jobs

As with any job, an interactive job is queued until the specified number of nodes is available. After job is started as the interactive session, you need to activate a corresponding virtual environment with installed RP package in it, and launch RP application as `python rp_application.py`.

* **SLURM Scheduler**. Initiate an [interactive job](https://slurm.schedmd.com/faq.html#prompt) with [`salloc`](https://slurm.schedmd.com/salloc.html). **It is recommended to use `salloc` over `srun --pty $SHELL`, since `srun` has certain limitations regarding interactive jobs.**
   * [OLCF/ORNL Frontier](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#interactive-jobs)
     ```shell
     salloc -A PROJECT_NAME -p PARTITION_NAME -J JOB_NAME \
            -N 1 -t 00:30:00
     ```
   * [NCSA Delta](https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/running_jobs.html#interactive-jobs)
     ```shell
     salloc --account=PROJECT_NAME --partition=gpuA40x4-interactive,gpuA100x4-interactive \
            --nodes=1 --cpus-per-task=2 --gpus-per-node=1 --mem=16g --time=00:30:00
     ```
* **PBSPro Scheduler**. The `qsub` command is used to request an interactive job.
   * [ALCF/ANL Polaris](https://docs.alcf.anl.gov/polaris/running-jobs/)
     ```shell
     qsub -I  -A PROJECT_NAME -q PARTITION_NAME \
              -l select=1 -l filesystems=home:eagle -l walltime=00:30:00
     ```

### Examples of batch scripts

Batch jobs are submitted through a _batch script_ using a corresponding job submission command, for example, for SLURM such command is `sbatch` and for PBSPro it is `qsub`. Batch script specifies your resource requirements, application run time, and the RP application that you want to execute.

* **SLURM Scheduler**. Job submission: `sbatch jobscript.slurm`, where `jobscript.slurm` looks as following:
   * [OLCF/ORNL Frontier](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#batch-scripts)
     ```
     #!/bin/bash
     #SBATCH -A PROJECT_NAME
     #SBATCH -p PARTITION_NAME
     #SBATCH -N 1
     #SBATCH -t 00:30:00
     #SBATCH -J JOB_NAME
     #SBATCH -o %x-%j.out
     source ~/ve_rp/bin/activate
     python rp_application.py
     ```
* **PBSPro Scheduler**. Job submission: `qsub jobscript.pbs`, where `jobscript.pbs` looks as following:
   * [ALCF/ANL Polaris](https://docs.alcf.anl.gov/polaris/running-jobs/)
     ```
     #!/bin/bash -l
     #PBS -A PROJECT_NAME
     #PBS -q PARTITION_NAME
     #PBS -l select=4:ncpus=256
     #PBS -l walltime=0:30:00
     #PBS -N JOB_NAME
     source ~/ve_rp/bin/activate
     python rp_application.py
     ```

## Launching from a login node

<div class="alert alert-warning">

__Warning:__ Launching applications from login nodes might be restricted by platform rules and policies. Please check platform user guides regarding such restrictions (see [Examples of login nodes policies](#Examples-of-login-nodes-policies)).
    
</div>

To run your RP application on the login node of a supported HPC platform, you will need to `ssh` into the login node, load the python virtual environment (see [Getting Started](../getting_started.ipynb)) and launch your RP application. **RP will start Client related processes on the login node and will keep them running until RP application is finished.** RP will make a job submission on a user behalf, and will start tasks execution after corresponding batch job starts (i.e., pilot state `rp.PMGR_ACTIVE` indicates that job starts).

```shell
ssh username@target_platform
# assuming that the virtual environment is
# already prepared with the RP package in it
source ~/ve_rp/bin/activate
python rp_application.py
```

```python
# within `rp_application.py` for ALCF/ANL Polaris
pd = rp.PilotDescription({'resource' : 'anl.polaris',  # target platform
                          'project'  : 'PROJECT_NAME',
                          'queue'    : 'PARTITION_NAME',
                          'runtime'  : 30})
```

### Examples of login nodes policies

* [OLCF/ORNL Frontier](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#login-vs-compute-nodes):
  > When you connect to the system, you are placed on a login node. Login nodes are used for tasks such as code editing, compiling, etc. They are shared among all users of the system, so it is not appropriate to run tasks that are long/computationally intensive on login nodes. Users should also limit the number of simultaneous tasks on login nodes (e.g. concurrent tar commands, parallel make).
  > Compute-intensive, memory-intensive, or other disruptive processes running on login nodes may be killed without warning.
* [TACC Frontera Conduct](https://docs.tacc.utexas.edu/basics/conduct/#conduct-loginnodes):
  > Each HPC resource's login nodes are shared amongst all users. Depending on the resource, dozens of users may be logged on at one time accessing the shared file systems. A single user running computationally expensive or disk intensive task/s will negatively impact performance for other users. Running jobs on the login nodes is one of the fastest routes to account suspension. Instead, run on the compute nodes via an interactive session (idev) or by submitting a batch job.

## Launching remotely

<div class="alert alert-warning">

__Warning:__ Remote submission **does not work with two factors authentication**. Target HPC platforms need to support passphrase-protected ssh keys as a login method without the use of a second authentication factor. Usually, the user needs to reach an agreement with the system administrators of the platform in order to allow `ssh` connections from a specific IP address.

</div>

<div class="alert alert-warning">

__Warning:__ Remote submissions **require a `ssh` connection to be alive for the entire duration of the application run**. If the `ssh` connection fails while the application runs, the application will fail. This has the potential of leaving an orphan RP Agent running on the HPC platform, consuming allocation and failing to properly execute any new application task. Remote submissions should not be attempted on a laptop with a Wi-Fi connection; and the risk of interrupting the `ssh` connection increases with the time taken by the application to complete.

</div>

If you can manually `ssh` into the target HPC platform, RP can do the same. You will have to set up an SSH key and, for example, follow up this [guide](https://www.ssh.com/academy/ssh-keys#how-to-configure-key-based-authentication) if you need to become more familiar. RP will not work without configuring the `ssh-agent`, and it will require entering the user's SSH key passphrase to access the HPC platform.

After setting up and configuring `ssh`, you can instruct RP to run its Client on your local workstation and its Agent on one or more HPC platforms. With the remote submission mode, you need to set a particular `access_schema`, which will point to corresponding endpoints for the job submission and the filesystem access. All other parameters stay the same as for launching from a login node.

```python
pd = rp.PilotDescription({'resource'     : 'tacc.frontera',
                          'access_schema': 'ssh',
                          'project'      : 'PROJECT_NAME',
                          'queue'        : 'PARTITION_NAME',
                          'runtime'      : 6000})

# where `tacc.frontera` configuration has the following endpoints set:
#
#    "schemas"                     : {
#        "ssh"                     : {
#            "job_manager_endpoint": "slurm+ssh://frontera.tacc.utexas.edu/",
#            "filesystem_endpoint" : "sftp://frontera.tacc.utexas.edu/"
#        }
```
