## Getting Started

This notebook demonstrate a walkthrough example of how to run RADICAL-Pilot on any linux or macOS machine.
The tutorial includes an example of how to a execute a simple workload of identical tasks (a _bag of tasks_).

### Activate your environment

```shell
# replace the virtualenv path with the correct path for your system
source ~/.virtualenvs/radical-pilot-env/bin/activate
```

### Check the versions

`!radical-stack`

```
  python               : /home/workstation/.local/share/virtualenvs/radical-pilot-env/bin/python3
  pythonpath           : 
  version              : 3.6.15
  virtualenv           : /home/workstation/.local/share/virtualenvs/radical-pilot-env

  radical.gtod         : 1.6.7
  radical.pilot        : 1.13.0-v1.13.0-161-gef63995ca@feature-issue_1578
  radical.saga         : 1.13.0
  radical.utils        : 1.13.0
```

Loading the environment variables from .env file. To read on how to setup .env for RP see this. `RADICAL_PILOT_DBURL` is required in .env file for RP to work.

In [None]:
%load_ext dotenv
%dotenv ../../../.env

!radical-stack

In [None]:
import os
import sys
import pprint

import radical.pilot as rp
import radical.utils as ru


### Configuration Files

The examples encoded in this notebook target `localhost` as execution resource, but are also able to run on other resources for which we have pre-configured some relevant values.  We load those configuration options here:

In [None]:
from dotenv import load_dotenv
load_dotenv()

RADICAL_PILOT_DBURL = os.getenv("RADICAL_PILOT_DBURL")

resource = 'local.localhost'
config = ru.read_json('../config.json')
pprint.pprint(config, width=1)


### Reporter for a better visualization
All code examples of this guide use the reporter facility of RADICAL-Utils to print well formatted runtime and progress information. You can control that output with the `RADICAL_REPORT` variable, which can be set to `TRUE` or `FALSE` to enable / disable reporter output. We assume the setting to be TRUE when referencing any output in this chapter.

**NOTE:** noteboks don't handle ANSII-Escape sequences very well - you may want to disable reporter animations with the following setting in you `.env`:

    export RADICAL_REPORT_ANIME=False

In [None]:
report = ru.Reporter(name='radical.pilot')
report.title('Getting Started (RP version %s)' % rp.version)

### Setting up the session

A `rp.Session` is the root object for all other objects in RADICAL-Pilot.  The session ID is used to uniquely identify the application run.  That ID will, for example, be used to create a session sandbox in the current wrking directory into which logfiles etc. will be stored. 

**NOTE:** It is important to close the session at the end of the application's execution: 

    session.close()

For one, that will terminate a number of local processes the session spawns to manage large workloads.  But also, and possibly more importantly, that will terminate all pilots of that session.  If you fail to terminate those pilots, they will still execute and allocate resources, and even though those resources idle that will count against your allocation on that resource.  Both processes and pilots will eventually time out - but that might take quite a bit of time.

In [None]:
session = rp.Session()

A `rp.PilotManager` will, as the name suggests, manage the pilot instances the application will use, and the `rp.TaskManager` will manage the application tasks to be executed.  Both manager instances are attached to the Session, and their lifetime is controlled by the session.

In [None]:
pmgr = rp.PilotManager(session=session)
tmgr = rp.TaskManager(session=session)

### Setting up a Pilot

RP applications allocate resources by submitting a pilot job to the target resource.  That is usually an HPC cluster, but in these examples we will run the pilot on localhost.  We first describe that resource allocation request in the `rp.PilotDescription`, and then will request the pilot to be launched by the `rp.PilotManager` which we instantiated above.

### Pilot resource specification

A `rp.PilotDescription` is used to specify the type and number of resources (CPU cores and GPUs) to allocate, and also what project is to be used for accounting, for how long the resources are requested, etc.  We use the values from the configuration loaded above - feel free to set your own values!

The most important elements of the PilotDescription are:

  - `resource`: a label which specifies the target resource, either local or remote, on which to run the pilot;
  - `runtime`: the numbers of minutes the pilot is expected to be active, i.e., the runtime of the pilot.
  - `cores`: the number of CPU cores the pilot is expected to manage, i.e., the size of the pilot;
  - `gpus`: the number of GPUs the pilot is expected to manage;

In [None]:
pdesc = rp.PilotDescription({'resource': resource,
                             'runtime' : 30,
                             'project' : config.get('project'),
                             'queue'   : config.get('queue'),
                             'cores'   : config.get('cores'),
                             'gpus'    : config.get('gpus')})

### Launching the Pilot

Pilots are launched via a `rp.PilotManager`, by passing the `rp.PilotDescription` to the `submit_pilots()` method. To make use of the pilot, we register it with the task manager which then can use it for task execution  (you can add any number of pilots to the task manager).

In [None]:
pilot = pmgr.submit_pilots(pdesc)
tmgr.add_pilots(pilot)

### Task execution

At this point we have the system set up and ready to execute tasks.  Well, in fact the pilot job may still be waiting in a batch queue, or the pilot job's bootstrapper may still prepare the pilot virtualenv when starting the pilot job - but either way, we can begin to submit tasks to the `TaskManager` for execution.

Similar to the pilot submission, we create an `rp.TaskDescription` for each task we want to execute.  That description specifies what we want to execute as a task, how to execute it, what resources it needs, etc.  For now, we submit a number (128) trivial `/bin/date` executable which runs on a single core:

In [None]:
report.progress_tgt(16, label='create')

tds = list()
for _ in range(16):
    tds.append(rp.TaskDescription({'executable': '/bin/date'}))
    report.progress()
    
report.progress_done()

tasks = tmgr.submit_tasks(tds)


The tasks will now be scheduled for execution on the pilot we created above.  We now wait for the tasks to complete, i.e., to reach one of the final states `DONE`, `CANCELED` or `FAILED`.  Unless instructed otherwise, `tmgr.wait_tasks()` will wait for all tasks known to that task manager:

In [None]:
tmgr.wait_tasks()

Once completed, we can inspect the tasks for details of their execution: we print a summary for all tasks and then inspect one of them in a bit more detail:

In [None]:
for task in tasks:
    report.plain('  * %s: %s, exit: %3s, out: %s'
                % (task.uid, task.state[:4], task.exit_code, task.stdout[:35]))

report.plain('\n')

report.plain('uid             : %s\n' % task.uid)
report.plain('tmgr            : %s\n' % task.tmgr.uid)
report.plain('pilot           : %s\n' % task.pilot)
report.plain('name            : %s\n' % task.name)
report.plain('executable      : %s\n' % task.description['executable'])
report.plain('state           : %s\n' % task.state)
report.plain('exit_code       : %s\n' % task.exit_code)
report.plain('stdout          : %s\n' % task.stdout.strip())
report.plain('stderr          : %s\n' % task.stderr)
report.plain('return_value    : %s\n' % task.return_value)
report.plain('exception       : %s\n' % task.exception)
report.plain('\n')
report.plain('endpoint_fs     : %s\n' % task.endpoint_fs)
report.plain('resource_sandbox: %s\n' % task.resource_sandbox)
report.plain('session_sandbox : %s\n' % task.session_sandbox)
report.plain('pilot_sandbox   : %s\n' % task.pilot_sandbox)
report.plain('task_sandbox    : %s\n' % task.task_sandbox)
report.plain('client_sandbox  : %s\n' % task.client_sandbox)



Not all tasks will usually succeed when executing - be it for internal errors (a simulation ran into an invalid condition), or because a compute node went down, etc.  Let's submit a new set of tasks and inspect the failure modes: we will scan `/bin/date` for acceptable single letter arguments:

In [None]:
import string
letters = string.ascii_lowercase + string.ascii_uppercase

report.progress_tgt(len(letters), label='create')

tds = list()
for letter in letters:
    tds.append(rp.TaskDescription({'executable': '/bin/date',
                                   'arguments': ['-' + letter]}))
    report.progress()
    
report.progress_done()

tasks = tmgr.submit_tasks(tds)

This time we wait only for the newly submitted tasks and check which ones succeeded - for those we check the resulting output.

In [None]:
tmgr.wait_tasks([task.uid for task in tasks])

for task in tasks:
    if task.state == rp.DONE:
        print('%s: %s: %s' % (task.uid, task.description['arguments'], task.stdout.strip()))


## Handle Failing Tasks

All applications can fail, often for reasons out of control of the user. A Task is no different, it can fail as well. Many non-trivial application will need to have a way to handle failing tasks – detecting the failure is the first and necessary step to do so, and RP makes that part easy: RP’s task state model defines that a failing task will immediately go into FAILED state, and that state information is available as task.state property.

The task also has the task.stderr property available for further inspection into causes of the failure – that will only be available though if the task did reach the EXECUTING state in the first place.


In [None]:
report.info('create %d task description(s)\n\t' % 10)

tds = list()
for i in range(0, 10):

    td = rp.TaskDescription()
    if i % 2:
        td.executable = '/bin/date'
    else:
        # trigger an error now and then
        td.executable = '/bin/data'  # does not exist
    tds.append(td)
    report.progress()

report.ok('>>ok\n')

tasks = tmgr.submit_tasks(tds)

report.header('gather results')
tmgr.wait_tasks()


report.info('\n')
for task in tasks:
    if task.state in [rp.FAILED, rp.CANCELED]:
        report.plain('  * %s: %s, exit: %5s, err: -%35s-'
                    % (task.uid, task.state[:4],
                       task.exit_code, task.stderr))
        report.error('>>err\n')

    else:
        report.plain('  * %s: %s, exit: %5s, out: %35s'
                    % (task.uid, task.state[:4],
                       task.exit_code, task.stdout))
        report.ok('>>ok\n')
            

## Use Multiple Pilots

We have seen in the previous examples how an RP pilot acts as a container for multiple task executions. There is in principle no limit on how many of those pilots are used to execute a specific workload, and specifically, pilots don’t need to run on the same resource!

The below example demonstrates that. Instead of creating one pilot description, we here create one for any resource specified as command line parameter, no matter if those parameters point to the same resource targets or not.
The tasks are distributed over the created set of pilots according to some scheduling mechanism – section Selecting a Task Scheduler will discuss how an application can choose between different scheduling policies. The default policy used here is Round Robin.

In [None]:
pdescs = list()
pdescs.append(pdesc)
pdescs.append(pdesc)
pilots = pmgr.submit_pilots(pdescs)

tmgr.add_pilots(pilots)


tmgr = rp.TaskManager(session=session)
tmgr.add_pilots(pilots)

n = 10 
report.info('create %d task description(s)\n\t' % n)

tds = list()
for i in range(0, n):
    td = rp.TaskDescription()
    td.executable = '/bin/echo'
    td.arguments  = ['$RP_PILOT_ID']

    tds.append(td)
    report.progress()
report.ok('>>ok\n')
tasks = tmgr.submit_tasks(tds)
report.header('gather results')
tmgr.wait_tasks()
            


In [None]:
counts = dict()
for task in tasks:
    out_str = task.stdout.strip()[:35]
    report.plain('  * %s: %s, exit: %3s, out: %s\n'
            % (task.uid, task.state[:4],
                task.exit_code, out_str))
    if out_str not in counts:
        counts[out_str] = 0
    counts[out_str] += 1

report.info("\n")
for out_str in counts:
    report.info("  * %-20s: %3d\n" % (out_str, counts[out_str]))
report.info("  * %-20s: %3d\n" % ('total', sum(counts.values())))


### MPI Tasks and Task Resources

In the examples we have, so far, been running single-core tasks.  By expanding the task description to include the `ranks` attribute, we can request multiple MPI ranks to be created where each rank uses one core.  Additional resource requirements can be specified per rank:

  - `cores_per_rank`: the number of cores each rank can user for spawning additional threads or processes
  - `gpus_per_rank`: the number of GPUs each rank can utilize
  - `mem_per_rank`: the size of memory (in Megabytes) which is available to each rank
  - `lfs_per_rank`: the amount of node-local file storage which is available to each rank

We use the `radical-pilot-hello.sh` command to report on rank creation:

In [None]:
tds = list()
for n in range(4):
    tds.append(rp.TaskDescription({'executable': '%s/bin/radical-pilot-hello.sh' % base, 
                                   'ranks': (n + 1)}))
    report.progress()
    
report.progress_done()

tasks = tmgr.submit_tasks(tds)
tmgr.wait_tasks([task.uid for task in tasks])

for task in tasks:
    print('--- %s:\n%s\n' % (task.uid, task.stdout.strip()))


In [None]:
report.header('finalize')
# session.close()