-
Notifications
You must be signed in to change notification settings - Fork 0
sge tight mpich2 integration
Loose Integration vs. Tight Integration - TBD
Reuti wrote up a detailed HOWTO for enabling tight integration of the MPICH2 library with SGE at MPICH2 Integration in Grid Engine
Reuti's examples assume you're using a generic Linux environment with SGE
installed under /usr/sge
and that you will install your own new
separate copy of the MPICH2 library somewhere.
For Rocks, installing your own copy of MPICH2 under /share/apps
is probably the easiest location to use since it is normally accessible to all
compute nodes.
However, if you prefer to just use the MPICH2 that's already installed under
/opt/mpich2/gnu
by the HPC Roll, here are the Rocks-specific
steps (read Reuti's HOWTO first though to understand what is being referred
to):
Tested on Rocks 5.2 with HPC Roll (sge-V62u2_1-1 + mpich2-ethernet-gnu-1.0.8p1-0)
Tested on Rocks 5.3 with HPC Roll (sge-V62u4-1 + mpich2-ethernet-gnu-1.1.1p1-0)
1. As root, check that $SGE_ROOT
is setup correctly on your
frontend to point to your SGE installation:
# echo $SGE_ROOT /opt/gridengine
2. Download the mpich2_62.tgz
archive containing Reuti's tight
integration scripts and extract them out under the $SGE_ROOT
directory:
# cd $SGE_ROOT # wget http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-62.tgz # tar xvzf mpich2-62.tgz # ls
You should see 4 new directories of:
mpich2_gforker mpich2_mpd mpich2_smpd mpich2_smpd_rsh
3. As described in the "Tight Integration of the mpd startup method" section
of Reuti's HOWTO, compile and install the start_mpich2
helper
program:
# cd $SGE_ROOT/mpich2_mpd/src # ./aimk # ./install.sh
When the install.sh
script asks Do you want beginn with the
installation
, answer with y
(for Yes).
4. Edit the provided mpich2.template
SGE parallel environment
(PE) configuration file and fill in the correct values for
<the_number_of_slots>
, <your_sge_root>
, and
<your_mpich2_root>
:
# cd $SGE_ROOT/mpich2_mpd # vi mpich2.template
slots 9999 start_proc_args /opt/gridengine/mpich2_mpd/startmpich2.sh -catch_rsh \ $pe_hostfile /opt/mpich2/gnu stop_proc_args /opt/gridengine/mpich2_mpd/stopmpich2.sh -catch_rsh \ /opt/mpich2/gnu
5. Save this updated mpich2.template
file then add it as a new PE
to SGE:
# qconf -Ap mpich2.template
Verify that this newly-created mpich2_mpd
PE definition is
correct (especially double-check all the filepaths):
# qconf -sp mpich2_mpd
pe_name mpich2_mpd slots 9999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridengine/mpich2_mpd/startmpich2.sh -catch_rsh \ $pe_hostfile /opt/mpich2/gnu stop_proc_args /opt/gridengine/mpich2_mpd/stopmpich2.sh -catch_rsh \ /opt/mpich2/gnu allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE
6. Add this new mpich2_mpd
PE to the pe_list
line of
the SGE cluster queue(s) you want to use for running MPICH2 jobs.
For example, if you want to add it to the default all.q
queue:
# qconf -mq all.q
pe_list make mpich mpi orte mpich2_mpd
# qconf -sq all.q
7. Change the execd_params
line of the global SGE configuration
so that commands like qdel
can properly kill off spawned
processes:
# qconf -mconf
execd_params ENABLE_ADDGRP_KILL=TRUE
# qconf -sconf
8. According to
"http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257043",
Reuti says you have to also edit one line of the provided
startmpich2.sh
script to make it work correctly with Rocks:
# vi $SGE_ROOT/mpich2_mpd/startmpich2.sh
Jump down to line 176 where it says:
NODE=`hostname`
and change it to:
NODE=`hostname --short`
and save the updated file.
9. Since Reuti's scripts are written in bash shell, IF you don't normally
specify -S /bin/bash
when running qsub
OR normally
put in a #$ -S /bin/bash
line in all of your own SGE submission
scripts for mpich2_mpd
jobs, Reuti's HOWTO says you need to
additionally make the following configuration change for your SGE cluster
queue(s) so that SGE will correctly execute Reuti's scripts using the bash
shell:
# qconf -mq all.q
shell /bin/bash shell_start_mode unix_behavior
- NOTE: Some older Rocks installs (especially Rocks 5.2 with some version of
- the Service Pack roll installed) have a bug in their default SGE global
- configuration which will cause Reuti's scripts to fail when run by SGE.
- Check the SGE global configuration:
# qconf -sconf
and see if you have the following 3 lines present near the end:
qrsh_command /usr/bin/ssh rsh_command /usr/bin/ssh rlogin_command /usr/bin/ssh
If you have those 3 lines present near the end of the file, type qconf
-mconf
to edit the SGE global configuration and delete only those 3
lines. Do NOT change anything else in the file.
There should already be the following 2 lines:
rlogin_command builtin rsh_command builtin
listed earlier in the file. Those are correct and should remain intact. Do NOT change those earlier lines.
Type qconf -sconf
to show the updated SGE global configuration
file and double-check that those 3 erroneous lines are gone.
If you don't delete those 3 erroneous lines, submitted SGE jobs using the
mpich2_mpd
PE will fail to run properly on the compute nodes.
This bug was corrected in January 2010:
http://marc.info/?l=npaci-rocks-discussion&m=126411729709528
10. With the above done, the one important thing not mentioned in Reuti's
HOWTO for getting it to work in Rocks is to copy the now-customized contents
of the $SGE_ROOT/mpich2_mpd
directory to all of your compute
nodes.
# cd $SGE_ROOT # rocks iterate host compute command="scp -rp ./mpich2_mpd # %:/opt/gridengine/."
or
# cd $SGE_ROOT # scp -rp ./mpich2_mpd compute-0-0:/opt/gridengine/. # ... (repeat for each compute node -- using a script?) ...
For future compute node install/rebuilds, you'll probably want to have some sort of post-install script or create a custom RPM or Rocks roll to make it easier to install using the normal Rocks node building processes.
TBD
11. As a regular user, create a valid $HOME/.mpd.conf
file if you
don't already have one:
$ touch $HOME/.mpd.conf $ chmod 600 $HOME/.mpd.conf $ echo "MPD_SECRETWORD=mr45-j9z" >> $HOME/.mpd.conf
Substitute your own unique secretword
for the
mr45-j9z
part. This was the example secretword
listed in the
Installer's Guide.
Don't use your actual Linux shell account password since this
.mpd.conf
file will be stored in cleartext in your NFS-served
user home directory area. This secretword
will only be used for
MPICH2 jobs. If you want something that looks like a random password-like
alphanumeric text string, type the command mkpasswd -l 8 -s 0
to
generate such a string.
12. To follow the rest of Reuti's HOWTO examples of using the provided
mpich2_mpd.sh
SGE submission script with the
mpihello
program so you can see how this mpich2_mpd
PE works, download and compile the mpihello.c
program, and modify
the appropriate line in the mpich2_mpd.sh
script to point to the
Rocks-provided MPICH2:
$ cd $HOME $ wget http://gridengine.sunsource.net/howto/mpich2-integration/mpihello.tgz $ tar xvzf mpihello.tgz $ /opt/mpich2/gnu/bin/mpicc -o mpihello mpihello.c $ cp -p $SGE_ROOT/mpich2_mpd/mpich2_mpd.sh . $ vi ./mpich2_mpd.sh export MPICH2_ROOT=/opt/mpich2/gnu $ qsub -pe mpich2_mpd 4 -S /bin/bash ./mpich2_mpd.sh $ qstat
Once qstat
shows that the job is running, check the contents of
the mpich2_mpd.sh.o##
and mpich2_mpd.sh.po##
output
files to see that the mpd
daemons are launched as described in
Reuti's HOWTO.
While the job is running, you should also be able to view the running process tree on the compute nodes by using a command like:
$ ssh compute-0-0 ps -e f -o pid,ppid,pgrp, command --cols=120
as described in Reuti's HOWTO to verify that the MPICH2 processes are all
child processes of the master sge_shepherd
process controlled by
SGE.
13. For an alternate example test, compile the sample
mpi-verify.c
with the MPICH2 version of mpicc
:
$ /opt/mpich2/gnu/bin/mpicc -o ./mpi-verify /opt/mpi-tests/src/mpi-verify.c
And create a sample SGE job submission script like the following:
$ vi run-myjob.qsub
#!/bin/bash #$ -pe mpich2_mpd 4 #$ -N myjob #$ -cwd #$ -j y #$ -S /bin/bash MYPROG="./mpi-verify" export MPICH2_ROOT="/opt/mpich2/gnu" export PATH="$MPICH2_ROOT/bin:$PATH" export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID" $MPICH2_ROOT/bin/mpiexec -machinefile $TMPDIR/machines -n $NSLOTS $MYPROG exit 0
REMINDER: For your own SGE job submission scripts that will use the
mpich2_mpd
PE, don't forget to always include the line:
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
so that the $MPD_CON_EXT
environment variable is available for
SGE to use to manage all the mpd
processes when using the
mpich2_mpd
PE.
14. Submit the test job to SGE:
$ qsub run-myjob.qsub
and run:
$ qstat
to track its progress in the queue. Also check the contents of the job's
myjob.o###
and myjob.po###
output files.
If the job seems to just be getting stuck in the SGE queue in the
qw
(queue waiting) state and not entering the r
(running) state, some useful SGE diagnostic commands to help you figure out
what's wrong:
frontend$ qstat -j frontend$ qstat -j JOBID# frontend$ qstat -f frontend$ tail -25 $SGE_ROOT/default/spool/qmaster/messages
compute-0-0$ tail -25 $SGE_ROOT/default/spool/compute-0-0/messages
Should the queue get completely stuck in a E
(error) state, it
may be necessary for a SGE administrator (such as root) to manually clear the
queue's error state:
frontend# qstat -explain E frontend# qdel JOBID# frontend# qmod -cq all.q
NOTE: According to "http://trac.mcs.anl.gov/projects/mpich2/milestone/mpich2-1.3", the upcoming MPICH2 1.3 version will use a different default MPI process manager
(hydra
instead of mpd
) and is planned to include
better support for tight integration with SGE, which I assume some future
version of Rocks will take advantage of.
© 2014 www.rocksclusters.org. All Rights Reserved.