<h1>3.4. Use Multiple Pilots</h1>

We have seen in the previous examples how an RP pilot acts as a container for multiple task executions. There is in principle no limit on how many of those pilots are used to execute a specific workload, and specifically, pilots don’t need to run on the same resource!

The below example demonstrates that. Instead of creating one pilot description, we here create one for any resource specified as command line parameter, no matter if those parameters point to the same resource targets or not.
The tasks are distributed over the created set of pilots according to some scheduling mechanism – section Selecting a Task Scheduler will discuss how an application can choose between different scheduling policies. The default policy used here is Round Robin.

<h2>3.4.1. Running the Example</h2>

We start by importing the radical.pilot module and initializing the reporter facility used for printing well formatted runtime and progress information.

In [None]:
import os
import sys

verbose  = os.environ.get('RADICAL_PILOT_VERBOSE', 'REPORT')
os.environ['RADICAL_PILOT_VERBOSE'] = verbose

import radical.pilot as rp
import radical.utils as ru

report = ru.Reporter(name='radical.pilot')
report.title('Getting Started (RP version %s)' % rp.version)

We will now import the dotenv module for fetching our environment variables. To create a new Session, you need to provide the URL of a MongoDB server which we will fetch from our .env file.

Another point to be noted is that in this example, we exemplarily start a pilot on local.localhost, and one on xsede.stampede:

In [None]:
from dotenv import load_dotenv
load_dotenv()

RADICAL_PILOT_DBURL = os.getenv("RADICAL_PILOT_DBURL")
os.environ['RADICAL_PILOT_DBURL'] = RADICAL_PILOT_DBURL
sys.argv.append('local.localhost')
sys.argv.append('xsede.stampede')

if len(sys.argv) >= 2  : resources = sys.argv[1:]
else                   : resources = ['local.localhost']
session = rp.Session()

All other pilot code is now tried/excepted. If an exception is caught, we can rely on the session object to exist and be valid, and we can thus tear the whole RP stack down via a <i>'session.close()'</i> call in the <i>'finally'</i> clause.

In [None]:
try:

    # read the config used for resource details
    report.info('read config')
    config = ru.read_json('../config.json')
    report.ok('>>ok\n')

    report.header('submit pilots')

    # Add a Pilot Manager. Pilot managers manage one or more Pilots.
    pmgr = rp.PilotManager(session=session)

    # Define an [n]-core local pilot that runs for [x] minutes
    # Here we use a dict to initialize the description object
    pdescs = list()
    for resource in resources:
        pd_init = {
                   'resource'      : resource,
                   'runtime'       : 15,  # pilot runtime (min)
                   'exit_on_error' : True,
                   'project'       : config[resource].get('project', None),
                   'queue'         : config[resource].get('queue', None),
                   'access_schema' : config[resource].get('schema', None),
                   'cores'         : config[resource].get('cores', 1),
                   'gpus'          : config[resource].get('gpus', 0),
                  }
        pdescs.append(rp.PilotDescription(pd_init))

    # Launch the pilots.
    pilots = pmgr.submit_pilots(pdescs)


    for gen in range(1):

        report.header('submit tasks [%d]' % gen)

        # Register the Pilot in a TaskManager object.
        tmgr = rp.TaskManager(session=session)
        tmgr.add_pilots(pilots)

        # Create a workload of Tasks.
        # Each task reports the id of the pilot it runs on.

        n = 128   # number of tasks to run
        report.info('create %d task description(s)\n\t' % n)

        tds = list()
        for i in range(0, n):

            # create a new Task description, and fill it.
            # Here we don't use dict initialization.
            td = rp.TaskDescription()
            td.executable = '/bin/echo'
            td.arguments  = ['$RP_PILOT_ID']

            tds.append(td)
            report.progress()
        report.ok('>>ok\n')
        tasks = tmgr.submit_tasks(tds)
        report.header('gather results')
        tmgr.wait_tasks()

    report.info('\n')
    counts = dict()
    for task in tasks:
        out_str = task.stdout.strip()[:35]
        report.plain('  * %s: %s, exit: %3s, out: %s\n'
                % (task.uid, task.state[:4],
                    task.exit_code, out_str))
        if out_str not in counts:
            counts[out_str] = 0
        counts[out_str] += 1

    report.info("\n")
    for out_str in counts:
        report.info("  * %-20s: %3d\n" % (out_str, counts[out_str]))
    report.info("  * %-20s: %3d\n" % ('total', sum(counts.values())))


except Exception as e:
    # Something unexpected happened in the pilot code above
    report.error('caught Exception: %s\n' % e)
    raise

except (KeyboardInterrupt, SystemExit):
    # the callback called sys.exit(), and we can here catch the
    # corresponding KeyboardInterrupt exception for shutdown.  We also catch
    # SystemExit (which gets raised if the main threads exits for some other
    # reason).
    report.warn('exit requested\n')

finally:
    # always clean up the session, no matter if we caught an exception or
    # not.  This will kill all remaining pilots.
    report.header('finalize')
    session.close(cleanup=False)

report.header()