Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ssh_exchange_identification: Connection closed by remote host (RP v0.40) #1214

Closed
mingtaiha opened this issue Feb 13, 2017 · 38 comments
Closed
Assignees
Milestone

Comments

@mingtaiha
Copy link
Contributor

When I try to run on Comet, I always get a few units which fail. The failing units give the following error:

ssh_exchange_identification: Connection closed by remote host

While this issue has been addressed in issue #1105, the solution was to upgrade to a newer version of RP. However, I am using the experiment/aimes stack of RADICAL Cybertools to run XSEDE/OSG experiments. Is there a solution to this issue without using a new version of RP?

This is the RP stack I am running.

python            : 2.7.6
virtualenv        : /home/mingtha/ve/ve.exec_model_exp
radical.utils     : v0.44.RC1-11-ge108e90@experiment-aimes
saga-python       : v0.44-137-g98a692ed@experiment-aimes
radical.pilot     : v0.40.1-241-g8334834d@experiment-aimes
radical.analytics : v0.1-187-g56ecce7@experiment-aimes_nodata
@vivek-bala
Copy link
Contributor

vivek-bala commented Feb 13, 2017

IIRC this is because of multiple CUs (and hence processes) running on the same node. If you increase the number of cores per CU, you shouldn't see this issue. I don't exactly remember what the number was (might have been 4 cores per CU), but I remember being able to avoid this issue by using an entire node for 1 CU (=24 cores per CU).

@iparask
Copy link
Contributor

iparask commented Feb 13, 2017

At least 4 per node.

@andre-merzky
Copy link
Member

Yeah, @vivek-bala is right, we are running out of ssh connections to start CUs when we run too many concurrent CUs per node.

But the agent's orte startup methods should actually resolve this already! For that, you should use xsede.comet_orte instead of xsede.comet as resource label. That should also be available in the experiment/aimes branch. I don't think it has seen much exposure there, so please let us know how that goes!

@vivek-bala
Copy link
Contributor

Hey @iparask , just to confirm: the limit is 4 CUs per node (= 6 cores per CU)?

@iparask
Copy link
Contributor

iparask commented Feb 13, 2017

I was doing 6 CUs per node (4 cores per CU)

@ibethune ibethune added this to the Future Release milestone Feb 14, 2017
@ibethune
Copy link
Contributor

Just to be clear, if this concerns the experiment/aimes stack, then it's not a concern for the 0.45 release (correct me if wrong, please)!

@andre-merzky
Copy link
Member

This is likely also affecting v0.45, in the sense that the ssh startup method has the same limitations there. But the xsede.comet_orte config is also available in v0.45, so the solution should be the same. It is not the default though, as it is less tested (on comet). I don't have an opinion if the default should be changed -- lets wait for Ming to confirm if it works/helps in the first place...

@vivek-bala
Copy link
Contributor

In v0.45, ORTE is the default (xsede.comet). The ssh label is xsede.comet_ssh.

@mingtaiha
Copy link
Contributor Author

I tried running using xsede.comet_orte config, but I got the following error from agent_0.log in the RP sandbox on Comet:

Traceback (most recent call last):
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 146, in __init__
    rp.agent.LM.lrms_config_hook(lm, self._cfg, self, self._log))
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/base.py", line 153, in lrms_config_hook
    return impl.lrms_config_hook(name, cfg, lrms, logger)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/orte.py", line 46, in lrms_config_hook
    raise Exception("Couldn't find orte-dvm")
Exception: Couldn't find orte-dvm
2017-02-14 17:00:50,139: agent_0             : MainProcess                     : MainThread     : ERROR   : TERM : agent_0 except in start
Traceback (most recent call last):
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 713, in start
    self._initialize_parent()
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 412, in _initialize_parent
    self.initialize_parent()
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/agent_0.py", line 112, in initialize_parent
    session=self._session)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 214, in create
    return impl(cfg, session)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/slurm.py", line 20, in __init__
    LRMS.__init__(self, cfg, session)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 146, in __init__
    rp.agent.LM.lrms_config_hook(lm, self._cfg, self, self._log))
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/base.py", line 153, in lrms_config_hook
    return impl.lrms_config_hook(name, cfg, lrms, logger)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/orte.py", line 46, in lrms_config_hook
    raise Exception("Couldn't find orte-dvm")
Exception: Couldn't find orte-dvm

I also get the following error in the agent_0.err file:

Traceback (most recent call last):
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/bin/radical-pilot-agent", line 42, in <module>
    rp.agent.bootstrap_3(agent_name)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/bootstrap_3.py", line 48, in bootstrap_3
    agent.start(spawn=False)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 717, in start
    ru.cancel_main_thread()
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/utils/threads.py", line 285, in cancel_main_thread
    thread.interrupt_main()
TypeError: 'int' object is not callable

@andre-merzky
Copy link
Member

Hmm, this can't be devel, that would have barfed earlier on the incorrect resource label, so I guess this is experiment/aimes? Either way, we are updating the ompi installation there, see #1218.

@andre-merzky
Copy link
Member

PS.: the TypeError: 'int' object is not callable is an artifact of Python's way to terminate threads:
https://bugs.python.org/issue23395

@mturilli
Copy link
Contributor

It should be indeed the experiment/aimes branch. Ming's experiments on XSEDE should be:

  • comparable to those I have been running on OSG;
  • usable for the AIMES-Experience paper;
  • useful to get the split-module branch to work on XSEDE.

@andre-merzky
Copy link
Member

I updated the ORTE installation on comet, and the xsede resource config. Can you please try again? thanks!

@ibethune ibethune modified the milestones: 0.45, Future Release Feb 17, 2017
@mingtaiha
Copy link
Contributor Author

I just tried it again, but I get the following errors. From the JSON file, the pilots enter the AGENT_EXECUTING stage. The errors seem to indicate that the agent is looking for a ORTE specific commands in my sandbox but is unable to find them. I will move my sandbox into another directory and give that a try. What do you it could be Andre?

From agent_0.log:

Traceback (most recent call last):
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/orte.py", line 168, in lrms_shutdown_hook
    raise Exception("Couldn't find orte-submit")
Exception: Couldn't find orte-submit

From agent_1.log:

Traceback (most recent call last):
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/base.py", line 114, in create
    return impl(cfg, session)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/orte.py", line 24, in __init__
    LaunchMethod.__init__(self, cfg, session)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/base.py", line 60, in __init__
    raise RuntimeError("Launch command not found for LaunchMethod '%s'" % self.name)
RuntimeError: Launch command not found for LaunchMethod 'ORTE'
2017-02-16 18:48:54,486: agent_1             : agent.executing.0               : MainThread     : ERROR   : LaunchMethod cannot be used: Launch command not found for LaunchMethod 'ORTE'!
Traceback (most recent call last):
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/base.py", line 114, in create
    return impl(cfg, session)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/orte.py", line 24, in __init__
    LaunchMethod.__init__(self, cfg, session)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/lm/base.py", line 60, in __init__
    raise RuntimeError("Launch command not found for LaunchMethod '%s'" % self.name)

From agent_executing.0.child.log

Traceback (most recent call last):
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/executing/popen.py", line 167, in _handle_unit
    raise RuntimeError("no launcher (mpi=%s)" % cu['description']['mpi'])
RuntimeError: no launcher (mpi=False)
2017-02-16 18:48:55,256: agent.executing.0.child: agent.executing.0               : MainThread     : ERROR   : worker <bound method Popen.work of <Popen(agent.executing.0, started)>> failed
Traceback (most recent call last):
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1354, in run
    self._workers[state](things)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/executing/popen.py", line 151, in work
    self._handle_unit(unit)
  File "/home/mingtha/radical.pilot.sandbox/ve_comet/rp_install/lib/python2.7/site-packages/radical/pilot/agent/executing/popen.py", line 184, in _handle_unit
    % (str(e), traceback.format_exc())
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

@andre-merzky
Copy link
Member

Thanks Ming, I'll look into it!

@andre-merzky
Copy link
Member

So this is caused by switch from orte-submit to orterun in the orte LM. I will need to cherry-pick some commits to get this working again in experiment/aimes with the current ORTE stack.

@ibethune
Copy link
Contributor

ibethune commented Feb 20, 2017

So given that the new OpenMPI stack is installed on Comet, please confirm if the example scripts are working with 0.45.RC2 (and if so this ticket can be retargetted to the next release to solve problems relating the experiment/aimes branch code)

@mingtaiha
Copy link
Contributor Author

Sure. I'm already doing the testing so I'll let you know

@mingtaiha
Copy link
Contributor Author

@andre-merzky, is there anything I can help with?

@andre-merzky
Copy link
Member

nah - but thanks for the kind reminder... :)

@ibethune
Copy link
Contributor

So according to the testing spreadsheet, everything except MPI units ( #1239 ) is working on Comet with the new OpenMPI installation, so let's leave this ticket to get the split-module branch working on XSEDE.

@ibethune ibethune modified the milestones: Future Release, 0.45 Feb 21, 2017
@ibethune ibethune added this to the 0.46 milestone Mar 2, 2017
@ibethune ibethune removed this from the Future Release milestone Mar 2, 2017
@mturilli
Copy link
Contributor

mturilli commented Mar 3, 2017

Ping. Ming reports that on the experiment/aimes branch, orte fails also for non MPI-units. This is now blocking Ming's experiments for his paper and the AIMES Experience one. We may want to have a look at it relatively soon.

@andre-merzky
Copy link
Member

Hey Ming, Matteo - I am not getting jobs through the queue on comet unfortunately. Will stay on it. I assume that this is either a configuration issue, or we hit a process limit when creating the orterun children. The first is hopefully easy to fix, the second will probably mean that we need to switch to ortelib on comet.

@mingtaiha
Copy link
Contributor Author

mingtaiha commented Mar 7, 2017

Hey Andre, I am using the following stack to run my experiments:

python            : 2.7.6
radical.analytics : v0.1-137-g05f47f3@devel
radical.pilot     : split-18-gbf2e0a3b@devel
radical.utils     : v0.45-2-g82050c5@devel
saga-python       : split-4-gaa285ca4@devel
virtualenv        : /home/mingtha/ve/ve.exec_model_exp_devel

I was able to run my experiments successfully SuperMIC, but does not work on Comet (ORTE). I have a virtualenv which I source in the pre_exec in order to use radical.synapse. When the units begin Executing, however, there is an ImportError and cannot import radical.synapse. I was originally able to do with SSH in RP 0.40. Is there a way to get around it?

@iparask
Copy link
Contributor

iparask commented Mar 7, 2017

Install it to radical.pilot.sandbox/ve_comet. It should work

@mingtaiha
Copy link
Contributor Author

I would like not to touch the sandbox on which RP runs if possible lest the dependencies of radical.synapse do not mesh well with those of RP.

@andre-merzky
Copy link
Member

Ming,

please run the following commands on comet:

module load python
source ~/radical.pilot.sandbox/ve_comet/bin/activate
module use --append /home/amerzky/ompi/modules/
module load openmpi/2017_02_17_6da4dbb
pip install orte_cffi

Let me know if that gives any errors. Once done, please use the feature/comet_ortelib branch, and you should be able to use the xsede.comet_ortelib resource tag.

Re synapse: I usually create a separate virtualenv for synapse (~/ve_synapse/ or whatever), and activate that one in the unit's pre-exec. Now, ortelib does not support pre- and post-exec, as we discussed, so in that case, I usually create a small shell script like:

#!/bin/sh
module load python
.$HOME/ve_synapse/bin/activate
radical-synapse-sample $*

and then call that via

cud = rp.ComputeUniteDescription()
cud.executable = '/bin/sh'
cud.arguments = ['-i', '$HOME/profiles/95.json']

or whatever I want to emulate via synapse.

@mingtaiha
Copy link
Contributor Author

I can't run module use --append /home/amerzky/ompi/modules/ because I don't have ompi/modules folder. Where did you get the ompi/modules?

@vivek-bala
Copy link
Contributor

IIUC, /home/amerzky/ompi/modules/ is available in Andre's account. You don't need to have it in your specific account. module use --append /home/amerzky/ompi/modules/ just tells the system to look inside /home/amerzky/ompi/modules/ for any user-created modules.

@marksantcroos
Copy link
Contributor

I can't run module use --append /home/amerzky/ompi/modules/

Did you get an error actually? Or you didn't try?

@mingtaiha
Copy link
Contributor Author

I should have been more clear. I subbed my home directory in place of Andre's. So Vivek's comment addressed my problem.

@mingtaiha mingtaiha reopened this Mar 11, 2017
@mingtaiha
Copy link
Contributor Author

I tested for 16 CUs on Comet and this branch works, and am going to submit 256 CUs to see how the branch performs. However, I now get the following problem on Stampede. It seems that Python was not loaded on Stampede

/home1/03662/tg829619/bin/ve.synapse/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory\n

@andre-merzky
Copy link
Member

Can you please open a new ticket for stampede, please, and report your stack there?

@mingtaiha
Copy link
Contributor Author

Done. See 1276

@andre-merzky
Copy link
Member

Great. Let us know how things scale on comet, and if we can close this ticket then. Thanks!

@mingtaiha
Copy link
Contributor Author

I can run 256 CUs on Comet. We can close this ticket after feature/comet_ortelib is merged back to devel?

@mingtaiha
Copy link
Contributor Author

I managed to get up to 1024 CUs on Comet. When will feature/comet_ortelib be merged back to devel?

@andre-merzky
Copy link
Member

This is merged now - thanks for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants