# RADICAL-Pilot Configuration System

RADICAL-Pilot (RP) uses a configuration process to set control and management parameters for the initialization of its components and to define resource entry points.

It includes:

- **[Resource description](#Resource-description)**
  - Batch system (e.g., `SLURM`, `LSF`, etc.);
  - Provided launch methods (e.g., `SRUN`, `MPIRUN`, etc.);
  - Environment setup (including package manager, working directory, etc.);
  - Entry points: batch system URL, file system URL.
- **[Run description](#Run-description)**
  - Project allocation name (i.e., account/project, _specific for HPC platforms_);
  - Job queue name (i.e., queue/partition, _specific for HPC platforms_);
  - Amount of the resources for the run (e.g., `cores`, `gpus`, `memory`) for a particular period of time runtime;
  - Mode to access the resource (e.g., `local`, `ssh`, `batch/interactive`).

All this information is provided for the `Pilot` by a `PilotDescription` object (the following examples will use a corresponding instance `pd`).

## Resource description

Resource description (i.e., resource configuration) is represented by a record of a particular resource within a JSON file, which collects resources per facility and is named as `resource_<facility_name>.json`. Resource description label (`pd.resource`) is set as `<facility_name>.<resource_name>` and is referenced in `PilotDescription`.

Local paths for resource desciptions (located on the same machine from where RADICAL-Pilot application is launched):
- **Pre-defined** resource descriptions ([GitHub repo](https://github.com/radical-cybertools/radical.pilot/tree/master/src/radical/pilot/configs))
  - `<ve>/lib/python<py_ver>/site-packages/radical/pilot/configs/`
- **User-defined** resource descriptions
  - `$HOME/.radical/pilot/configs/`
  - **Will be merged with pre-defined resource descriptions, parameters with the same name will be overwritten by user-defined ones**

<div class="alert alert-info">

__Note:__ To change the location of user-defined resource descriptions, please, use env variable `RADICAL_CONFIG_USER_DIR`, which will be used instead of env variable `HOME` in the location path above. Make sure that the corresponding path exists, before creating configs there.

</div>

In [1]:
!mkdir -p "${RADICAL_CONFIG_USER_DIR:-$HOME}/.radical/pilot/configs/"

### Custom resource description file

Let's create a custom file `resource_tacc.json` locally (it will be loaded into RADICAL-Pilot `Session` with other TACC-related resources).

In [2]:
resource_tacc_tutorial = \
{
    "frontera_tutorial":
    {
        "description"                 : "Short description of the resource",
        "notes"                       : "Notes about resource usage",

        "schemas"                     : ["local", "ssh", "batch", "interactive"],
        "local"                       :
        {
            "job_manager_endpoint"    : "slurm://frontera.tacc.utexas.edu/",
            "filesystem_endpoint"     : "file://frontera.tacc.utexas.edu/"
        },
        "ssh"                         :
        {
            "job_manager_endpoint"    : "slurm+ssh://frontera.tacc.utexas.edu/",
            "filesystem_endpoint"     : "sftp://frontera.tacc.utexas.edu/"
        },
        "batch"                       : "interactive",
        "interactive"                 :
        {
            "job_manager_endpoint"    : "fork://localhost/",
            "filesystem_endpoint"     : "file://localhost/"
        },

        "default_queue"               : "production",
        "resource_manager"            : "SLURM",

        "cores_per_node"              : 56,
        "gpus_per_node"               : 0,
        "system_architecture"         : {
                                         "smt"           : 1,
                                         "options"       : ["nvme", "intel"],
                                         "blocked_cores" : [],
                                         "blocked_gpus"  : []
                                        },

        "agent_config"                : "default",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "default_remote_workdir"      : "$HOME",

        "pre_bootstrap_0"             : [
                                        "module unload intel impi",
                                        "module load   intel impi",
                                        "module load   python3/3.9.2"
                                        ],
        "launch_methods"              : {
                                         "order"  : ["MPIRUN"],
                                         "MPIRUN" : {
                                             "pre_exec_cached": [
                                                 "module load TACC"
                                             ]
                                         }
                                        },
        
        "python_dist"                 : "default",
        "virtenv_mode"                : "local"
    }
}

The definition of the each field:

* `description` (__*optional*__) - human readable description of the resource.
* `notes` (__*optional*__) - information needed to form valid pilot descriptions, such as what parameters are required, etc.
* `schemas` - allowed values for the `pd.access_schema` attribute of the pilot description. The first schema in the list is used by default. For each schema, a subsection is needed, which specifies `job_manager_endpoint` and `filesystem_endpoint`.
* `job_manager_endpoint` - access url for pilot submission (interpreted by RADICAL-SAGA).
* `filesystem_endpoint` - access url for file staging (interpreted by RADICAL-SAGA).
* `default_queue` (__*optional*__) - queue name to be used for pilot submission to a corresponding batch system (see `job_manager_endpoint`).
* `resource_manager` - the type of a job management system. Valid values are: `CCM`, `COBALT`, `FORK`, `LSF`, `PBSPRO`, `SLURM`, `TORQUE`, `YARN`.
* `cores_per_node` (__*optional*__) - number of available cores per compute node. If not provided then it will be discovered by RADICAL-SAGA and by Resource Manager in RADICAL-Pilot.
* `gpus_per_node` (__*optional*__) - number of available gpus per compute node. If not provided then it will be discovered by RADICAL-SAGA and by Resource Manager in RADICAL-Pilot.
* `system_architecture` (__*optional*__) - set of options that describe resource features:
   * `smt` - Simultaneous MultiThreading (i.e., threads per physical core). If it is not provided then the default value `1` is used. It could be reset with env variable `RADICAL_SMT` exported before running RADICAL-Pilot application. RADICAL-Pilot uses `cores_per_node x smt` to calculate all available cores/cpus per node.
   * `options` - list of system specific attributes/constraints, which are provided to RADICAL-SAGA.
      * `COBALT` uses option `--attrs` for configuring location as `filesystems=home,grand`, mcdram as `mcdram=flat`, numa as `numa=quad`;
      * `LSF` uses option `-alloc_flags` to support `gpumps`, `nvme`;
      * `PBSPRO` uses option `-l` for configuring location as `filesystems=grand:home`, placement as `place=scatter`;
      * `SLURM` uses option `--constraint` for compute nodes filtering.
   * `blocked_cores` - list of cores/cpus indicies, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.
   * `blocked_gpus` - list of gpus indicies, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.
* `agent_config` - configuration file for RADICAL-Pilot Agent (default value is `default` for a corresponding file [`agent_default.json`](https://github.com/radical-cybertools/radical.pilot/blob/master/src/radical/pilot/configs/agent_default.json)).
* `agent_scheduler` - Scheduler in RADICAL-Pilot (default value is `CONTINUOUS`).
* `agent_spawner` - Executor in RADICAL-Pilot, which spawns task execution processes (default value is `POPEN`).
* `default_remote_workdir` (__*optional*__) - directory for agent sandbox (see the tutorial [Getting Started](getting_started.ipynb), section "Generated Output", for more details) If not provided then the current directory is used (`$PWD`).
* `forward_tunnel_endpoint` (__*optional*__) - name of the host which can be used to create ssh tunnels from the compute nodes to the outside of the resource.
* `pre_bootstrap_0` (__*optional*__) - list of commands to execute for the bootstrapping proceess to launch RADICAL-Pilot Agent.
* `pre_bootstrap_1` (__*optional*__) - list of commands to execute for initialization of sub-agent, which are used to run additional instances of RADICAL-Pilot components such as Executor and Stager.
* `launch_methods` - set of supported launch methods. Valid values are `APRUN`, `CCMRUN`, `FLUX`, `FORK`, `IBRUN`, `JSRUN` (`JSRUN_ERF`), `MPIEXEC` (`MPIEXEC_MPT`), `MPIRUN` (`MPIRUN_CCMRUN`, `MPIRUN_DPLACE`, `MPIRUN_MPT`, `MPIRUN_RSH`), `PRTE`, `RSH`, `SRUN`, `SSH`. For each launch method, a subsection is needed, which specifies `pre_exec_cached` with list of commands to be executed to configure the launch method, and method related options (e.g., `dvm_count` for `PRTE`)
   * `order` - sets the order of launch methods to be selected for the task placement.
* `python_dist` - python distribution. Valid values are `default` and `anaconda`.
* `virtenv_mode` - bootstrapping process set the enviroment for RADICAL-Pilot Agent:
   * `create` - create a python virtual enviroment from scratch;
   * `recreate` - delete the exsiting virtual enviroment and build it from scratch, if not found then `create`;
   * `use` - use the existing virtual enviroment, if not found then `create`;
   * `update` - update the existing virtual enviroment, if not found then `create` (__*default*__);
   * `local` - use the client existing virtual enviroment (environment from where RADICAL-Pilot application was launched).
* `virtenv` (__*optional*__) - path to the existing virtual environment or its name with the pre-installed RCT stack; use it only when `virtenv_mode=use`.
* `rp_version` - RADICAL-Pilot installation or reuse process:
   * `local` - install from tarballs, from client existing environment (__*default*__);
   * `release` - install the latest released version from PyPI;
   * `installed` - do not install, target virtual environment has it.

In [3]:
import os

import radical.pilot as rp
import radical.utils as ru

In [4]:
# save resource description in the user-defined configuration space
ru.write_json(resource_tacc_tutorial, os.path.join(os.path.expanduser('~'), '.radical/pilot/configs/resource_tacc.json'))

In [5]:
tutorial_cfg = rp.utils.get_resource_config(resource='tacc.frontera_tutorial')

for attr in ['schemas', 'resource_manager', 'cores_per_node', 'system_architecture']:
    print('%-20s : %s' % (attr, tutorial_cfg[attr]))

schemas              : ['local', 'ssh', 'batch', 'interactive']
resource_manager     : SLURM
cores_per_node       : 56
system_architecture  : {'blocked_cores': [], 'blocked_gpus': [], 'options': ['nvme', 'intel'], 'smt': 1}


In [6]:
print('job_manager_endpoint : ', rp.utils.get_resource_job_url(resource='tacc.frontera_tutorial', schema='ssh'))
print('filesystem_endpoint  : ', rp.utils.get_resource_fs_url(resource='tacc.frontera_tutorial', schema='ssh'))

job_manager_endpoint :  slurm+ssh://frontera.tacc.utexas.edu/
filesystem_endpoint  :  sftp://frontera.tacc.utexas.edu/


This resource description also will be available within RADICAL-Pilot Session (see the following section).

## Run description

### Resource allocation parameters

Every run should state the project name (i.e., allocation account) and the amount of required resources explicitly, unless it is a local run without accessing any batch system.

In [7]:
import radical.pilot as rp

pd = rp.PilotDescription({
    'resource': 'tacc.frontera_tutorial',  # resource description label
    'project' : 'XYZ000',                  # allocation account
    'queue'   : 'debug',                   # optional (default value might be set in resource description)
    'cores'   : 10,                        # amount of CPU slots
    'gpus'    : 6,                         # amount of GPU slots
    'runtime' : 15                         # maximum runtime for a pilot (in minutes)
})

## Resource access schema

Resource access schema (`pd.access_schema`) is provided as part of a resource description, and in case of more than one access schemas user can set a specific one in `PilotDescription`. Check schemas availability per resource.
* `local` - launch RADICAL-Pilot application from the target resource (e.g., login nodes of the specific machine).
* `ssh` - launch RADICAL-Pilot application outside of the target resource and use `ssh` protocol and corresponding SSH client to access the resource remotely.
* `gsissh` - launch RADICAL-Pilot application outside of the target resource and use GSI-enabled SSH to access the resource remotely.
* `interactive` - launch RADICAL-Pilot application from the target resource within the interactive session after being placed on allocated compute nodes.
* `batch` - launch RADICAL-Pilot application by a submitted batch script.

<div class="alert alert-info">
    
__Note:__ For the detailed description about running applications on HPC see the tutorial [Using RADICAL-Pilot on HPC Platforms](hpc.ipynb).

</div>

<div class="alert alert-info">
    
__Note:__ For the initial setup regarding MongoDB see the tutorial [Getting Started](getting_started.ipynb).

</div>

In [8]:
os.environ['RADICAL_PILOT_DBURL'] = 'mongodb://guest:guest@mongodb:27017/default'

In [9]:
session = rp.Session()

[94mnew session: [39m[0m[rp.session.95984fa0-ca8c-11ed-a943-0242ac140003][39m[0m[94m                 \
database   : [39m[0m[mongodb://guest:****@mongodb:27017/default][39m[0m[92m                     ok
[39m[0m

Let's confirm that newly created resource description is within the session.

In [10]:
tutorial_cfg = ru.as_dict(session.get_resource_config(resource='tacc.frontera_tutorial', schema='batch'))
for attr in ['label', 'launch_methods', 'job_manager_endpoint', 'filesystem_endpoint']:
    print('%-20s : %s' % (attr, tutorial_cfg[attr]))

label                : tacc.frontera_tutorial
launch_methods       : {'MPIRUN': {'pre_exec_cached': ['module load TACC']}, 'order': ['MPIRUN']}
job_manager_endpoint : fork://localhost/
filesystem_endpoint  : file://localhost/


`Session` object has all provided resource configurations (pre- and user-defined ones), thus for a pilot we need to select a needed one configuration and a corresponding access schema in the pilot description.

In [11]:
pd.access_schema = 'batch'

pmgr  = rp.PilotManager(session=session)
pilot = pmgr.submit_pilots(pd)

[94mcreate pilot manager[39m[0m[92m                                                          ok
[39m[0m[94msubmit 1 pilot(s)[39m[0m
        pilot.0000   tacc.frontera_tutorial     10 cores       6 gpus[39m[0m[92m         ok
[39m[0m

In [12]:
from pprint import pprint

pprint(pilot.as_dict())

{'client_sandbox': '/tutorials/radical-pilot',
 'description': {'access_schema': 'batch',
                 'app_comm': [],
                 'candidate_hosts': [],
                 'cleanup': False,
                 'cores': 10,
                 'exit_on_error': True,
                 'gpus': 6,
                 'input_staging': [],
                 'job_name': None,
                 'layout': 'default',
                 'memory': 0,
                 'nodes': 0,
                 'output_staging': [],
                 'prepare_env': {},
                 'project': 'XYZ000',
                 'queue': 'debug',
                 'resource': 'tacc.frontera_tutorial',
                 'runtime': 15,
                 'sandbox': None,
                 'services': [],
                 'uid': None},
 'endpoint_fs': 'file://localhost/',
 'js_hop': 'fork://localhost/',
 'js_url': 'fork://localhost/',
 'log': None,
 'pilot_sandbox': 'file://localhost/home/jovyan/radical.pilot.sandbox/rp.session.95984

After exploring pilot setup and configuration we close the session.

In [13]:
session.close()

[94mclosing session rp.session.95984fa0-ca8c-11ed-a943-0242ac140003[39m[0m[94m                \
close pilot manager[39m[0m[94m                                                            \
wait for 1 pilot(s)
        [39m[0mO[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[39m[0-[39m[0\[39m[0|[39m[0/[