Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RP 0.45.RC1 ORTE failure on comet #1218

Closed
vivek-bala opened this issue Feb 14, 2017 · 23 comments
Closed

RP 0.45.RC1 ORTE failure on comet #1218

vivek-bala opened this issue Feb 14, 2017 · 23 comments
Assignees
Labels
Milestone

Comments

@vivek-bala
Copy link
Contributor

The CUs start executing and then fail with the following error:

/home/marksant/openmpi/installed/rhc/bin/orterun: Error: unknown option "--hnp"
Type '/home/marksant/openmpi/installed/rhc/bin/orterun --help' for usage.
@andre-merzky
Copy link
Member

@marksantcroos : Mark, I guess this needs a new deployment of ompi. Should I use the same OMPI commit we use for STATIC-DEVEL on titan?

@marksantcroos
Copy link
Contributor

Mark, I guess this needs a new deployment of ompi.

Yes, indeed.

Should I use the same OMPI commit we use for STATIC-DEVEL on titan?

Thats a safe bet.

@SrinivasMushnoori
Copy link

SrinivasMushnoori commented Feb 15, 2017

Error is reproducible. Additionally, I see the following text while the job is still waiting on queue:

2017-02-15 15:42:27,005: radical.saga.cpi    : MainProcess                     : PilotLauncherWorker-1: ERROR   : set name 2: tmp_EKmNpf.slurm
2017-02-15 15:43:33,322: radical.saga.cpi    : MainProcess                     : PilotLauncherWorker-1: ERROR   : set name 2: tmp_EKmNpf.slurm
2017-02-15 15:44:27,442: radical.saga.cpi    : MainProcess                     : PilotLauncherWorker-1: ERROR   : set name 2: tmp_EKmNpf.slurm
2017-02-15 15:45:27,622: radical.saga.cpi    : MainProcess                     : PilotLauncherWorker-1: ERROR   : set name 2: tmp_EKmNpf.slurm

@andre-merzky
Copy link
Member

andre-merzky commented Feb 16, 2017

Sorry for the delay, but the comet batch scheduler doesn't like me right now, so tests are still pending. But if you want to give it a go, please check out the fix/issue_1218 branch, which contains a config change to use a new OMPI installation. I'll ping back as soon as I get confirmation that this works.

@vivek-bala
Copy link
Contributor Author

I can confirm the examples started working on comet (haven't finished all the examples yet). But I also get the same message as Srinivas posted.

        |2017-02-16 14:02:59,815: radical.saga.cpi    : MainProcess                     : PilotLauncherWorker-1: ERROR   : set name 2: tmp_uYEH0j.slurm

Note that its labelled as an ERROR but does not lead to cancellation/termination.

@andre-merzky
Copy link
Member

andre-merzky commented Feb 16, 2017

that message has been removed in a different pull request, and should be gone in the next release candidate. See https://github.com/radical-cybertools/saga-python/pull/616

@vivek-bala
Copy link
Contributor Author

Ok, all the examples worked on comet from fix/issue_1218 branch.

This is surprising. The mpi example worked, even though we don't have an mpi4py module (which the example uses) built against the rp openmpi. Any ideas how/why?

@marksantcroos
Copy link
Contributor

Probably because of the module load (that Andre since then removed)?

@vivek-bala
Copy link
Contributor Author

Yes, but that would use the system mpi4py which is probably linked to some mpi library other than the rp openmpi. Shouldn't this cause a conflict? I remember running into such conflict on BW when I was using it. Maybe this isn't the case anymore.

@marksantcroos
Copy link
Contributor

No, that loaded mpi4py linked against openmpi.

@vivek-bala
Copy link
Contributor Author

I don't think it is. Maybe I'm missing something. Please see:

[vivek91@comet-ln2 ~]$ module use --append /home/amerzky/ompi/modules
[vivek91@comet-ln2 ~]$ module load python
[vivek91@comet-ln2 ~]$ module load openmpi/2017_02_15
[vivek91@comet-ln2 ~]$ python
Python 2.7.10 (default, Feb  1 2016, 14:30:50) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mpi4py
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named mpi4py
>>> 
[vivek91@comet-ln2 ~]$ module load mpi4py
[vivek91@comet-ln2 ~]$ python
Python 2.7.10 (default, Feb  1 2016, 14:30:50) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mpi4py
>>> mpi4py.__file__
'/opt/mpi4py/lib/python2.7/site-packages/mpi4py/__init__.py'

I am assuming the same when done within a CU.

@vivek-bala
Copy link
Contributor Author

Tested it within a CU:

cudesc:

# Change to working directory for unit
cd /home/vivek91/radical.pilot.sandbox/rp.session.radical.vivek.017213.0009-pilot.0000/unit.000020
# Environment variables
export RP_SESSION_ID=rp.session.radical.vivek.017213.0009
export RP_PILOT_ID=pilot.0000
export RP_AGENT_ID=agent_0
export RP_SPAWNER_ID=agent_0.AgentExecutingComponent.0.child
export RP_UNIT_ID=unit.000020

# The command to run
/home/amerzky/ompi//installed/2017_02_15/bin/orterun  --hnp "2841313280.0;tcp://10.22.252.156,198.202.117.135,10.21.252.156,192.168.36.231:44134;ud://63541.1024.1"  -np 2 -host comet-14-03,comet-14-03 python "helloworld_mpi.py"
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

stdout:

1/1/comet-14-03
mpi4py: /opt/mpi4py/lib/python2.7/site-packages/mpi4py/__init__.py
1/1/comet-14-03
mpi4py: /opt/mpi4py/lib/python2.7/site-packages/mpi4py/__init__.py
[ORTE] Task: 0 is launched! (Job ID: [43355,72])
[ORTE] Task: 0 returned: 0 (Job ID: [43355,72]

@andre-merzky
Copy link
Member

The pilot env should not leak into the CU env, right? So if the pilot pre-exec gets a CU to work, I would consider that a bug, really :/

But maybe the system module works because (a) the dynamic linker finds our openmpi libs suitable, or (b) the openmpi versions are, by chance, sufficiently compatible for our test?

@vivek-bala
Copy link
Contributor Author

The pilot env should not leak into the CU env, right? So if the pilot pre-exec gets a CU to work, I would consider that a bug, really :/

Yea, that seems to be the case here.

@marksantcroos
Copy link
Contributor

Should I use the same OMPI commit we use for STATIC-DEVEL on titan?

Thats a safe bet.

How did you interpret that actually? I see the script mentions another tag, I consider 6da4dbb last known good.

@andre-merzky
Copy link
Member

Oh, I see - my bad I guess...

@andre-merzky
Copy link
Member

That is confusing though: DEVEL_STATIC on titan points to /lustre/atlas1/bip103/world-shared/openmpi/jan18-static-nodebug-nodstore which is on c8768e3 - which is different yet... :/

Can you reconfirm which of the three (4c9f7af, 6da4dbb, c8768e3), if any, we should use?

@marksantcroos
Copy link
Contributor

For recording purposes, the current DEVEL-STATIC on Titan points to jan18-static-nodebug-nodstore.
Configured with an additional --disable-pmix-dstore, to address open-mpi/ompi#2737.
6da4dbb is the last known good commit (which is semantically similar to c8768e3, sorry for that confusion, might have reused a build dir or so). (and for completeness: 4c9f7af was the previous "stable" commit from last November year)
We can't use latest master because of open-mpi/ompi#2998, thats being worked on.

@andre-merzky
Copy link
Member

Thanks for the clarification, that helps a lot.

So at this point the recommended ompi installation is 6da4dbb with --disable-pmix-dstore. I'll update the deployment script and documentation, and will update the comet install. For titan we'll tread carefully - if things work now for Alessio, I'd say we stick to what we have, clean out the other deployed versions, and then test carefully with the above one before switching.

Thanks!

@andre-merzky
Copy link
Member

The pilot env should not leak into the CU env, right? So if the pilot pre-exec gets a CU to work, I would consider that a bug, really :/
Yea, that seems to be the case here.

Hmm, wait: my understanding from the above was that the mpi example still works after removing the module load mpi4py from the pilot pre_exec -- so there is nothing there which can leak anymore? What am I missing?

@vivek-bala
Copy link
Contributor Author

Ah. I assumed that its leaking since I didn't have to do the following in the CUs (in any of the examples):

module use --append /home/amerzky/ompi/modules
module load python
module load openmpi/2017_02_15

but maybe /home/amerzky/ompi//installed/2017_02_15/bin/orterun is able to pick/set the correct environment(?).

@ibethune
Copy link
Contributor

0.45.RC2 should now use the new OpenMPI installation, please try again with that and let us know how it goes.

@vivek-bala
Copy link
Contributor Author

I don't face this issue anymore with RC2. Consider this resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants