# RADICAL-Pilot Configuration System

RADICAL-Pilot (RP) uses a configuration system to set control and management parameters for the initialization of its components and to define resource entry points.

It includes:

- **[Resource description](#Resource)**
  - Batch system (e.g., `SLURM`, `LSF`, etc.);
  - Provided launch methods (e.g., `SRUN`, `MPIRUN`, etc.);
  - Environment setup (including package manager, working directory, etc.);
  - Entry points: batch system URL, file system URL.
- **[Run description](#Run-description)**
  - Project allocation name (i.e., account/project, _specific for HPC platforms_);
  - Job queue name (i.e., queue/partition, _specific for HPC platforms_);
  - Amount of the resources for the run (e.g., `cores`, `gpus`, `memory`) for a particular period of time runtime;
  - Mode to access the resource (e.g., `local`, `ssh`, `batch/interactive`).

All this information is provided to the pilot of a RP application via the `radical.pilot.PilotDescription` object. The following examples will use a corresponding instance `pd`.

## Resource

Users have to describe at least one pilot in each RADICAL-Pilot application. That is done by instantiating a [`radical.pilot.PilotDescription`](../apidocs.rst#PilotDescription) object. Among that object's attributes, `resource` is mandatory and requires an HPC platform ID as its value. Users need to know what ID corresponds to the HPC platform on which they want to execute their RP application. 

### Predefined resources

The RADICAL-Pilot development team maintains a growing set of configuration files, each containing a number of IDs of supported HPC platforms. Users can find these pre-defined IDs in RP's [GitHub repository](https://github.com/radical-cybertools/radical.pilot/tree/master/src/radical/pilot/configs).

Each configuration file identifies a facility (e.g., ACCESS, TACC, ORNL, ANL, etc.), is written in JSON and is named following the `resource_<facility_name>.json` convention. Each facility configuration file contains a set of platform IDs. 

For example, if users want to execute their RP application on Frontera, they will have to search for the [`resource_tacc.json`](https://github.com/radical-cybertools/radical.pilot/blob/master/src/radical/pilot/configs/resource_tacc.json) file and, inside that file, for the key(s) that start with the name `frontera`. The file `resource_tacc.json` contains the keys `frontera`, `frontera_rtx`, and `frontera_prte`. Each key identifies a specific set of configuration parameters: `frontera` offers a general-purpose set of configuration parameters; `frontera_rtx` enables the use of the `rtx` queue; and `frontera_prte` the use of the PRTE launch method to execute the application's tasks. 

Once users decide what configuration to use, they will compose the value to use for the `resource` attribute of their `radical.pilot.PilotDescription` object. That value is named following the `<facility_name>.<platform_name>` convention. Thus, for Frontera, the value will be `tacc.frontera`, `tacc.frontera_rtx` or `tacc.frontera_prte`.  

### Customizing a predefined resource

Users can customize existing platform configuration files by overwriting existing key/value pairs. There are two ways to do that:

1. Redefining a key/value pair when coding the `radical.pilot.PilotDescription` object;
2. creating a configuration file in their local home directory.

Redefining a key/value pair is easy but it needs to be done for every pilot description with that ID and specific set of configurations. For example, looking at [`tacc.frontera`](https://github.com/radical-cybertools/radical.pilot/blob/master/src/radical/pilot/configs/resource_tacc.json#L4), the configured queue is `normal` but a user could overwrite that parameter doing the followning:

In [None]:
import radical.pilot as rp

pd_init = {'resource'     : 'tacc.frontera',
           'runtime'      : 30,
           'queue'        : 'development',
           'node'         : 1
          }
pdesc = rp.PilotDescription(pd_init)

Alternatively, users can write a file in `$HOME/.radical/pilot/configs/` with the same name as the one that contains the ID for whcih they want to overraide some configuration parameter. For the Frontera's example, the file would be called `resource_tacc.json` and would contain the following:

```
{
    "frontera": {
        "default_queue": "development"
    }
 }
 ```
 
 With that file, every pilot description using `'resource': 'tacc.frontera'` would use the queue `development` by default.
 
<div class="alert alert-info">

__Note:__ To change the location of user-defined resource descriptions, please, use env variable `RADICAL_CONFIG_USER_DIR`, which will be used instead of env variable `HOME` in the location path above. Make sure that the corresponding path exists, before creating configs there.

</div>

In [None]:
!mkdir -p "${RADICAL_CONFIG_USER_DIR:-$HOME}/.radical/pilot/configs/"

### User-defined resource

Users can write whole new platform ID configurations. For example, let us create a custom Platform configuration entry `resource_tacc.json` locally. That file will be loaded into the RP's `radical.pilot.session` object alongside other TACC-related platforms.

In [None]:
resource_tacc_tutorial = \
{
    "frontera_tutorial":
    {
        "description"                 : "Short description of the resource",
        "notes"                       : "Notes about resource usage",

        "schemas"                     : ["local", "ssh", "batch", "interactive"],
        "local"                       :
        {
            "job_manager_endpoint"    : "slurm://frontera.tacc.utexas.edu/",
            "filesystem_endpoint"     : "file://frontera.tacc.utexas.edu/"
        },
        "ssh"                         :
        {
            "job_manager_endpoint"    : "slurm+ssh://frontera.tacc.utexas.edu/",
            "filesystem_endpoint"     : "sftp://frontera.tacc.utexas.edu/"
        },
        "batch"                       : "interactive",
        "interactive"                 :
        {
            "job_manager_endpoint"    : "fork://localhost/",
            "filesystem_endpoint"     : "file://localhost/"
        },

        "default_queue"               : "production",
        "resource_manager"            : "SLURM",

        "cores_per_node"              : 56,
        "gpus_per_node"               : 0,
        "system_architecture"         : {
                                         "smt"           : 1,
                                         "options"       : ["nvme", "intel"],
                                         "blocked_cores" : [],
                                         "blocked_gpus"  : []
                                        },

        "agent_config"                : "default",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "default_remote_workdir"      : "$HOME",

        "pre_bootstrap_0"             : [
                                        "module unload intel impi",
                                        "module load   intel impi",
                                        "module load   python3/3.9.2"
                                        ],
        "launch_methods"              : {
                                         "order"  : ["MPIRUN"],
                                         "MPIRUN" : {
                                             "pre_exec_cached": [
                                                 "module load TACC"
                                             ]
                                         }
                                        },
        
        "python_dist"                 : "default",
        "virtenv_mode"                : "local"
    }
}

The definition of the each field:

* `description` (__*optional*__) - human readable description of the resource.
* `notes` (__*optional*__) - information needed to form valid pilot descriptions, such as what parameters are required, etc.
* `schemas` - allowed values for the `pd.access_schema` attribute of the pilot description. The first schema in the list is used by default. For each schema, a subsection is needed, which specifies `job_manager_endpoint` and `filesystem_endpoint`.
* `job_manager_endpoint` - access url for pilot submission (interpreted by RADICAL-SAGA).
* `filesystem_endpoint` - access url for file staging (interpreted by RADICAL-SAGA).
* `default_queue` (__*optional*__) - queue name to be used for pilot submission to a corresponding batch system (see `job_manager_endpoint`).
* `resource_manager` - the type of a job management system. Valid values are: `CCM`, `COBALT`, `FORK`, `LSF`, `PBSPRO`, `SLURM`, `TORQUE`, `YARN`.
* `cores_per_node` (__*optional*__) - number of available cores per compute node. If not provided then it will be discovered by RADICAL-SAGA and by Resource Manager in RADICAL-Pilot.
* `gpus_per_node` (__*optional*__) - number of available gpus per compute node. If not provided then it will be discovered by RADICAL-SAGA and by Resource Manager in RADICAL-Pilot.
* `system_architecture` (__*optional*__) - set of options that describe resource features:
   * `smt` - Simultaneous MultiThreading (i.e., threads per physical core). If it is not provided then the default value `1` is used. It could be reset with env variable `RADICAL_SMT` exported before running RADICAL-Pilot application. RADICAL-Pilot uses `cores_per_node x smt` to calculate all available cores/cpus per node.
   * `options` - list of system specific attributes/constraints, which are provided to RADICAL-SAGA.
      * `COBALT` uses option `--attrs` for configuring location as `filesystems=home,grand`, mcdram as `mcdram=flat`, numa as `numa=quad`;
      * `LSF` uses option `-alloc_flags` to support `gpumps`, `nvme`;
      * `PBSPRO` uses option `-l` for configuring location as `filesystems=grand:home`, placement as `place=scatter`;
      * `SLURM` uses option `--constraint` for compute nodes filtering.
   * `blocked_cores` - list of cores/cpus indicies, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.
   * `blocked_gpus` - list of gpus indicies, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.
* `agent_config` - configuration file for RADICAL-Pilot Agent (default value is `default` for a corresponding file [`agent_default.json`](https://github.com/radical-cybertools/radical.pilot/blob/master/src/radical/pilot/configs/agent_default.json)).
* `agent_scheduler` - Scheduler in RADICAL-Pilot (default value is `CONTINUOUS`).
* `agent_spawner` - Executor in RADICAL-Pilot, which spawns task execution processes (default value is `POPEN`).
* `default_remote_workdir` (__*optional*__) - directory for agent sandbox (see the tutorial [Getting Started](getting_started.ipynb), section "Generated Output", for more details) If not provided then the current directory is used (`$PWD`).
* `forward_tunnel_endpoint` (__*optional*__) - name of the host which can be used to create ssh tunnels from the compute nodes to the outside of the resource.
* `pre_bootstrap_0` (__*optional*__) - list of commands to execute for the bootstrapping proceess to launch RADICAL-Pilot Agent.
* `pre_bootstrap_1` (__*optional*__) - list of commands to execute for initialization of sub-agent, which are used to run additional instances of RADICAL-Pilot components such as Executor and Stager.
* `launch_methods` - set of supported launch methods. Valid values are `APRUN`, `CCMRUN`, `FLUX`, `FORK`, `IBRUN`, `JSRUN` (`JSRUN_ERF`), `MPIEXEC` (`MPIEXEC_MPT`), `MPIRUN` (`MPIRUN_CCMRUN`, `MPIRUN_DPLACE`, `MPIRUN_MPT`, `MPIRUN_RSH`), `PRTE`, `RSH`, `SRUN`, `SSH`. For each launch method, a subsection is needed, which specifies `pre_exec_cached` with list of commands to be executed to configure the launch method, and method related options (e.g., `dvm_count` for `PRTE`)
   * `order` - sets the order of launch methods to be selected for the task placement.
* `python_dist` - python distribution. Valid values are `default` and `anaconda`.
* `virtenv_mode` - bootstrapping process set the enviroment for RADICAL-Pilot Agent:
   * `create` - create a python virtual enviroment from scratch;
   * `recreate` - delete the exsiting virtual enviroment and build it from scratch, if not found then `create`;
   * `use` - use the existing virtual enviroment, if not found then `create`;
   * `update` - update the existing virtual enviroment, if not found then `create` (__*default*__);
   * `local` - use the client existing virtual enviroment (environment from where RADICAL-Pilot application was launched).
* `virtenv` (__*optional*__) - path to the existing virtual environment or its name with the pre-installed RCT stack; use it only when `virtenv_mode=use`.
* `rp_version` - RADICAL-Pilot installation or reuse process:
   * `local` - install from tarballs, from client existing environment (__*default*__);
   * `release` - install the latest released version from PyPI;
   * `installed` - do not install, target virtual environment has it.

In [None]:
import os

import radical.pilot as rp
import radical.utils as ru

In [None]:
# save resource description in the user-defined configuration space
ru.write_json(resource_tacc_tutorial, os.path.join(os.path.expanduser('~'), '.radical/pilot/configs/resource_tacc.json'))

In [None]:
tutorial_cfg = rp.utils.get_resource_config(resource='tacc.frontera_tutorial')

for attr in ['schemas', 'resource_manager', 'cores_per_node', 'system_architecture']:
    print('%-20s : %s' % (attr, tutorial_cfg[attr]))

In [None]:
print('job_manager_endpoint : ', rp.utils.get_resource_job_url(resource='tacc.frontera_tutorial', schema='ssh'))
print('filesystem_endpoint  : ', rp.utils.get_resource_fs_url(resource='tacc.frontera_tutorial', schema='ssh'))

This resource description also will be available within RADICAL-Pilot Session (see the following section).

## Run description

### Resource allocation parameters

Every run should state the project name (i.e., allocation account) and the amount of required resources explicitly, unless it is a local run without accessing any batch system.

In [None]:
import radical.pilot as rp

pd = rp.PilotDescription({
    'resource': 'tacc.frontera_tutorial',  # resource description label
    'project' : 'XYZ000',                  # allocation account
    'queue'   : 'debug',                   # optional (default value might be set in resource description)
    'cores'   : 10,                        # amount of CPU slots
    'gpus'    : 6,                         # amount of GPU slots
    'runtime' : 15                         # maximum runtime for a pilot (in minutes)
})

## Resource access schema

Resource access schema (`pd.access_schema`) is provided as part of a resource description, and in case of more than one access schemas user can set a specific one in `PilotDescription`. Check schemas availability per resource.
* `local` - launch RADICAL-Pilot application from the target resource (e.g., login nodes of the specific machine).
* `ssh` - launch RADICAL-Pilot application outside of the target resource and use `ssh` protocol and corresponding SSH client to access the resource remotely.
* `gsissh` - launch RADICAL-Pilot application outside of the target resource and use GSI-enabled SSH to access the resource remotely.
* `interactive` - launch RADICAL-Pilot application from the target resource within the interactive session after being placed on allocated compute nodes.
* `batch` - launch RADICAL-Pilot application by a submitted batch script.

<div class="alert alert-info">
    
__Note:__ For the detailed description about running applications on HPC see the tutorial [Using RADICAL-Pilot on HPC Platforms](hpc.ipynb).

</div>

<div class="alert alert-info">
    
__Note:__ For the initial setup regarding MongoDB see the tutorial [Getting Started](getting_started.ipynb).

</div>

In [None]:
os.environ['RADICAL_PILOT_DBURL'] = 'mongodb://guest:guest@mongodb:27017/default'

In [None]:
session = rp.Session()

Let's confirm that newly created resource description is within the session.

In [None]:
tutorial_cfg = ru.as_dict(session.get_resource_config(resource='tacc.frontera_tutorial', schema='batch'))
for attr in ['label', 'launch_methods', 'job_manager_endpoint', 'filesystem_endpoint']:
    print('%-20s : %s' % (attr, tutorial_cfg[attr]))

`Session` object has all provided resource configurations (pre- and user-defined ones), thus for a pilot we need to select a needed one configuration and a corresponding access schema in the pilot description.

In [None]:
pd.access_schema = 'batch'

pmgr  = rp.PilotManager(session=session)
pilot = pmgr.submit_pilots(pd)

In [None]:
from pprint import pprint

pprint(pilot.as_dict())

After exploring pilot setup and configuration we close the session.

In [None]:
session.close()