Issue with multiple subagents on Summit #2186

mturilli · 2020-07-15T23:38:55Z

Stack:

  python               : 3.7.0

  radical.entk         : 1.4.1.post1
  radical.pilot        : 1.4.0-v1.4.0-49-gfac8e06@hotfix-prte_profiling
  radical.saga         : 1.4.0-v1.4.0-10-ga942e0f@devel
  radical.utils        : 1.4.0-v1.3.1-73-ga7accc7@devel

Currently, suagent support is broken. Debugging is ongoing, here some preliminary results.

Issue with prte.py

PRTE sets a logical cvd_id_mode on Summit while a physical one on Lassen. The switch to a physicall mode broke PRRTE on Summit. Workaround:

        lm_info = {
                   'dvm_uri'     : dvm_uri,
                   'version_info': prte_info,
                  # 'cvd_id_mode' : 'physical'
                   'cvd_id_mode' : 'logical'
                  }

issues with bootstrap_2.sh:

The function module is not available on the batch/work nodes therefore module load commands fail.
Radical base is undefined

Workaround for #1 and #2 in src/radical/pilot/configs/resource_ornl.json:

...
"pre_bootstrap_1" : [
    ". /sw/summit/lmod/lmod/init/profile",
    "export RADICAL_BASE=/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/",
    "export RADICAL_UTILS_BASE=/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/",
...

Ongoing

We can successfully run bootstrap_2.sh by hand from the batch node but when executed by RP, the subagent(s) does not come online.

The text was updated successfully, but these errors were encountered:

andre-merzky · 2020-07-16T07:09:16Z

Thanks Matteo! Some notes:

RADICAL_BASE should always be exported in the bootstrapper. I had the impression we do that actually, need to check where the code went.

The module env sourcing is nothing we can fix short of including your workaround into the resource config. It's a bit messy, but at least the path seems to be somewhat generic (i.e., stable) (fingers crossed).

Finally, the cvd_id_mode is only used by jsrun, and we should move it's setting into the jsrun LM. I think it's not set because the launch_method_hook is not executed for the agent_launch_method - lets discuss on the devel call if we should maybe execute it here also.

andre-merzky · 2020-07-16T07:13:31Z

another node: can you use #!/bin/sh -x as shebang on bootstrap_2.sh? That should help tracking the script's progress down. Sorry, I did not think of that yesterday.

BTW: another issue we saw: there is a naming mixup for the bootstrap stages: pre_bootstrap_1 is applied to bootstrap_2 - that should be fixed (the config should change, bootstrap_1 is reserved for the upcoming partition step).

mturilli · 2020-07-17T03:20:01Z

I did the following:

Run bootstrap_2.sh with -x and -xv. Nothing relevant was observed, same problem as before.
Run agent.n.sh with -x. Nothing relevant was observed, same problem as before.
Manually run bootstrap_2.sh with -x from the batch node via the command created by agent_0.py. Both subagents were created, apparently successfully. Note that notwithstanding their creation, tasks did not execute.
Manually run bootstrap_2.sh on the worknodes, line by line. Also in this case, subagents were created, apparently successfully.
Eliminated exec from bootstrap_2.sh. No errors where observed but also in this case subagents where not created.
Eliminated exec from agent.n.sh. I was not able to test this because of the time taken to start agent_0 on Summit when creating a new virtualenv.

Debugging the agent is difficult for the following reasons:

Runtime creation of multiple executables via bootstrap_0.sh and agent_0.py. Both codes do a lot and are somewhat organic. They end up hiding multiple parameters used to create the executables in variables that need to be hunted down.
Logs do not contain information about what file created them. This is a major problem and costed multiple hours of grepping and dead ends to find out where the code to debug resides.
Multiple levels of caching: both in compiled python that needs to be manually deleted and in the pre-existing environment for the agent.
Currently, starting the agent on Summit without a pre-exiting virtual env takes more than 1 hour. I am now waiting to see whether 2 hours are enough.

Analysis of the logs (apparently) produced by prrte seems to indicate that the issue with starting the sub-agents does not depend on a failure to create the DVM. Logs indicated that it is successfully bootstrapped. I did not have the time to test whether the DVM is actually working.

andre-merzky · 2020-07-17T06:57:41Z

Thanks for the update, Matteo! Good point about including a comment on the source of created scripts. I also don't think that the startup problem is related to the DVM. I would guess that an environmental issue is at play (again), since identical commands behave diifferently in interactive and non-interactive mode...

So another thing you could test: add this to bootstrap_2.sh near the top: env | sort > env.bs2. Then compare the resulting files for the interactive and non-interactive runs. Maybe manually eliminate differences one by one until the behavior is the same.

mturilli · 2020-07-23T23:36:23Z

I did the following:

Run bootstrap_2.sh with -x and -xv. Nothing relevant was observed, same problem as before.
Run agent.n.sh with -x. Nothing relevant was observed, same problem as before.
Manually run bootstrap_2.sh with -x from the batch node via the command created by agent_0.py. Both subagents were created, apparently successfully. Note that notwithstanding their creation, tasks did not execute.
Manually run bootstrap_2.sh on the worknodes, line by line. Also in this case, subagents were created, apparently successfully.
Eliminated exec from bootstrap_2.sh. No errors where observed but also in this case subagents where not created.
Eliminated exec from agent.n.sh. I was not able to test this because of the time taken to start agent_0 on Summit when creating a new virtualenv.

Analysis of the logs (apparently) produced by prrte seems to indicate that the issue with starting the sub-agents does not depend on a failure to create the VDM. Logs indicated that it is successfully bootstrapped.

I have:

Verified that the VDM works correctly.
Added -vvv to all ssh call
Created dedicated log files.
Replaced Popen with place holder commands.

Upon inspection, it become apparent that the ssh commands used to launch the agent were not executed. Andre checked whether there were differences with the code we used to successfully run multi-agents for the previous paper. It turned out that in the transition between Python2 and 3, we had lost some code, specifically the one that executes the ssh in agent_0.py.

We have:

Committed a patch with the missing code
Run a test with the new code base.

The subagents are now coming up but they are trying to write to a write-only file system (on Summit):

Traceback (most recent call last):
  File "/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/rp.session.login2.mturilli1.018466.0008/pilot.0000/rp_install/bin/radical-pilot-agent", line 12, in <module>
    import radical.utils as ru
  File "/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/rp.session.login2.mturilli1.018466.0008/pilot.0000/rp_install/lib/python3.7/site-packages/radical/utils/__init__.py", line 14, in <module>
    from .plugin_manager import PluginManager
  File "/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/rp.session.login2.mturilli1.018466.0008/pilot.0000/rp_install/lib/python3.7/site-packages/radical/utils/plugin_manager.py", line 15, in <module>
    from .logger    import Logger
  File "/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/rp.session.login2.mturilli1.018466.0008/pilot.0000/rp_install/lib/python3.7/site-packages/radical/utils/logger.py", line 45, in <module>
    from   .debug     import import_module    as ru_import_module
  File "/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/rp.session.login2.mturilli1.018466.0008/pilot.0000/rp_install/lib/python3.7/site-packages/radical/utils/debug.py", line 16, in <module>
    from .ids     import generate_id
  File "/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/rp.session.login2.mturilli1.018466.0008/pilot.0000/rp_install/lib/python3.7/site-packages/radical/utils/ids.py", line 89, in <module>
    _BASE        = get_radical_base('utils')
  File "/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/rp.session.login2.mturilli1.018466.0008/pilot.0000/rp_install/lib/python3.7/site-packages/radical/utils/misc.py", line 880, in get_radical_base
    rec_makedir(base)
  File "/gpfs/alpine/scratch/mturilli1/csc343/radical.pilot.sandbox/rp.session.login2.mturilli1.018466.0008/pilot.0000/rp_install/lib/python3.7/site-packages/radical/utils/misc.py", line 893, in rec_makedir
    os.makedirs(target)
  File "/gpfs/alpine/csc343/scratch/mturilli1/radical.pilot.sandbox/ve.ornl.summit_prte.1.4.1/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/ccs/home/mturilli1/.radical/utils/'

I am going to address that and report back. Meanwhile, here some notes about why I found debugging the Agent time consuming:

Runtime creation of multiple executables via bootstrap_0.sh and agent_0.py. Both codes do a lot and are somewhat organic. They end up hiding multiple parameters used to create the executables in variables that need to be hunted down. There is probably scope to add comments about links among parts of code that reside in different files. For example, explicitly linking agent_0.py to launch_methods/* would have helped.
Logs do not contain information about what file created them. This is a major problem and costed multiple hours of grepping and dead ends to find out where the code to debug resides.
Multiple levels of caching: both in compiled python that needs to be manually deleted and in the pre-existing environment for the agent.
The semantics of the logs is unclear: I assumed for long time that the log reported in <session_id>.log about the ssh command was indicating that the ssh command had been executed. It was instead just logging the command string, not its execution. Logs about the actual execution of that command were instead missing.
Together, the lack of information about the file in which the log has been generated and the missing semantics about what action the log indicates, point to the need to redefine the log format. I would be happy to propose a schema based on one of the (many) existing standards, especially one for which parsers exist already.

mturilli · 2020-08-19T17:45:53Z

This is now working.

andre-merzky · 2020-08-19T18:50:57Z

Thanks again for digging through this!

mturilli added type:bug priority:critical comp:agent labels Jul 15, 2020

mturilli assigned andre-merzky and mturilli Jul 15, 2020

mturilli mentioned this issue Jul 29, 2020

GPU Scaling overhead on Summit radical-collaboration/ExTASY#1

Closed

9 tasks

mturilli closed this as completed Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with multiple subagents on Summit #2186

Issue with multiple subagents on Summit #2186

mturilli commented Jul 15, 2020 •

edited

Loading

andre-merzky commented Jul 16, 2020

andre-merzky commented Jul 16, 2020

mturilli commented Jul 17, 2020 •

edited by andre-merzky

Loading

andre-merzky commented Jul 17, 2020

mturilli commented Jul 23, 2020

mturilli commented Aug 19, 2020

andre-merzky commented Aug 19, 2020

Issue with multiple subagents on Summit #2186

Issue with multiple subagents on Summit #2186

Comments

mturilli commented Jul 15, 2020 • edited Loading

andre-merzky commented Jul 16, 2020

andre-merzky commented Jul 16, 2020

mturilli commented Jul 17, 2020 • edited by andre-merzky Loading

andre-merzky commented Jul 17, 2020

mturilli commented Jul 23, 2020

mturilli commented Aug 19, 2020

andre-merzky commented Aug 19, 2020

mturilli commented Jul 15, 2020 •

edited

Loading

mturilli commented Jul 17, 2020 •

edited by andre-merzky

Loading