# Debugging a RADICAL-Pilot Application

RADICAL-Pilot is a complex runtime system which employes multiple distributed components to orchestrate workload execution.  It is also a *research software*,  funded by research grants.  As such it is possibly it is not quite comparable to commercially supported software systems.  

Also, RADICAL-Pilot targets mostly academic HPC environments and high end machines which are usually at the cutting edge of hard and software development.  Those machines thus usually have their own custom and sometimes peculiar and evolving system environment.

All that is to say that it might be necessary to investigate various possible failure modes, both failures related to the execution of your workload tasks, and also possibly failures related to RADICAL-Pilot's own operation.

This notebook attempts to guide you through different means to investigate possible failure modes.  That is not necessarily an intuitive process, but hopefully serves to cover the most common problems.  We want to encourage you to seek support from the RCT develope community via TODO if the presented means proof insufficient.

## Investigating Task Failures

You created a task description, submitted your task, and they end up in `FAILED` state.  On the API level, you can inspec the tasks `stdout` and `stderr` values as follows:

In [23]:
import os

from unittest import mock

import radical.pilot as rp

task = mock.Mock()

if task.state == rp.FAILED:
    print('%s stdout: %s' % (task.uid, task.stdout))
    print('%s stderr: %s' % (task.uid, task.stderr))
    

Note though that the available length of both values is shortened to 1024 characters.  If that is inefficient you can still inspect the complete values on the file system of the target resource.  For that you would navigate to the task sandbox (whose value can be inspected via `task.sandbox`).  

Assume you run a task with the following description:

In [10]:

td = rp.TaskDescription()
td.executable     = '/bin/date'
td.arguments      = ['-O']


That sandbox usually has a set of files similar to the example shown below.  The `<task.uid>.out` and `<task.uid>.err` files will have captured the task's stdout and stderr streams, respectively:

In [33]:

%cd /home/merzky/radical.pilot.sandbox/rp.session.rivendell.merzky.019489.0008/pilot.0000/task.000000
!ls
!cat task.000000.err


/mnt/home/merzky/radical.pilot.sandbox/rp.session.rivendell.merzky.019489.0008/pilot.0000/task.000000
task.000000.err      task.000000.launch.out  task.000000.out
task.000000.exec.sh  task.000000.launch.sh   task.000000.prof
/bin/date: invalid option -- 'O'
Try '/bin/date --help' for more information.


A very common problem for task failures is an invalid environment setup: scientific applications frequently requires software modules to be loaded, virtual environments to be activated, etc.  Those actions are specified in the task description's `pre_exec` statements.  You may want to investigate `<task.uid>.exec.sh` in the task sandbox to check if the environment setup is indeed as you expect it to be.

## Investigate RADICAL-Pilot Failures

If the investigation of the task sandbox did not yield any clues as to the origin of the failure, but your task still ends up in `FAILED` state or RP itself fails in any other way, we suggest the following sequence of commands, in that order, to investigate the problem further.

First, check the client side session sandbox for any ERROR log messages or error messages in general:


In [39]:

%cd /home/merzky/j/rp.3/rp.session.rivendell.merzky.019489.0008
! grep 'ERROR' *log
! ls -l *.out *.err


/mnt/home/merzky/radical/radical.pilot.3/rp.session.rivendell.merzky.019489.0008
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 control_pubsub.err
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 control_pubsub.out
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 log_pubsub.err
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 log_pubsub.out
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 pmgr_launching.0000.err
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 pmgr_launching.0000.out
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 pmgr_launching_queue.err
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 pmgr_launching_queue.out
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 stager.0000.err
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 stager.0000.out
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 stager_request_queue.err
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 stager_request_queue.out
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 stager_response_pubsub.err
-rw-rw-r-- 1 merzky merzky 0 May 12 19:02 stager_response_pubsub.out
-rw-rw-r-- 1 merzky merzky 0 

You would expect no `ERROR` lines to show up in the log files, and all stdout/stderr files of the RP components to be empty.

The next step is to repeat that process in the pilot sandbox:

In [38]:

%cd /home/merzky/radical.pilot.sandbox/rp.session.rivendell.merzky.019489.0008/pilot.0000/
! grep 'ERROR' *log
! ls -l *.out *.err 


/mnt/home/merzky/radical.pilot.sandbox/rp.session.rivendell.merzky.019489.0008/pilot.0000
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent.0.bootstrap_2.err
-rw-rw-r-- 1 merzky merzky    1 May 12 19:02 agent.0.bootstrap_2.out
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent.0.err
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent.0.out
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_executing.0000.err
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_executing.0000.out
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_executing_queue.err
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_executing_queue.out
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_schedule_pubsub.err
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_schedule_pubsub.out
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_scheduling.0000.err
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_scheduling.0000.out
-rw-rw-r-- 1 merzky merzky    0 May 12 19:02 agent_scheduling_queue.err
-rw-rw-r-- 1 merzky 


Here you will always find `bootstrap_0.out` to be populated with the output of RP's shell bootstrapper.  If no other errors in the log or stdio files show up, you may want to look at that `bootstrap_0.out` output to see if and why the pilot bootstrapping failed.


## Ask for Help from the RADICAL Team

If neither of the above steps provided any insight into the causes of the observed failures, please execute the following steps:

  - create a tarball of the client sandbox
  - create a tarball of the session sandbox
  - open an issue at https://github.com/radical-cybertools/radical.pilot/issues/new and attach both tarballs
  - describe the observed problem and include the following additional information:
    - output of the `radical-stack` command
    - information of any change to the resource configuration of the target resource
    
We will likely be able to infer the problem causes from the provided sandbox tarballs and will be happy to help you in correcting those, or we will ask for forther information about the environment your application is running in.