sge tight mpich2 integration

nadyawilliams edited this page Jun 11, 2014 · 2 revisions

Table of Contents

Introduction

Loose Integration vs. Tight Integration - TBD

Reuti wrote up a detailed HOWTO for enabling tight integration of the MPICH2 library with SGE at MPICH2 Integration in Grid Engine

Reuti's examples assume you're using a generic Linux environment with SGE installed under /usr/sge and that you will install your own new separate copy of the MPICH2 library somewhere.

For Rocks, installing your own copy of MPICH2 under /share/apps is probably the easiest location to use since it is normally accessible to all compute nodes.

However, if you prefer to just use the MPICH2 that's already installed under /opt/mpich2/gnu by the HPC Roll, here are the Rocks-specific steps (read Reuti's HOWTO first though to understand what is being referred to):

Tested on Rocks 5.2 with HPC Roll (sge-V62u2_1-1 + mpich2-ethernet-gnu-1.0.8p1-0)

Tested on Rocks 5.3 with HPC Roll (sge-V62u4-1 + mpich2-ethernet-gnu-1.1.1p1-0)

Install the Tight Integration Scripts

1. As root, check that $SGE_ROOT is setup correctly on your frontend to point to your SGE installation:

 # echo $SGE_ROOT
 /opt/gridengine

2. Download the mpich2_62.tgz archive containing Reuti's tight integration scripts and extract them out under the $SGE_ROOT directory:

 # cd $SGE_ROOT
 # wget http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-62.tgz
 # tar xvzf mpich2-62.tgz
 # ls

You should see 4 new directories of:

 mpich2_gforker
 mpich2_mpd
 mpich2_smpd
 mpich2_smpd_rsh

3. As described in the "Tight Integration of the mpd startup method" section of Reuti's HOWTO, compile and install the start_mpich2 helper program:

 # cd $SGE_ROOT/mpich2_mpd/src
 # ./aimk
 # ./install.sh

When the install.sh script asks Do you want beginn with the installation, answer with y (for Yes).

Setup the mpich2_mpd PE

4. Edit the provided mpich2.template SGE parallel environment (PE) configuration file and fill in the correct values for <the_number_of_slots>, <your_sge_root>, and <your_mpich2_root>:

 # cd $SGE_ROOT/mpich2_mpd
 # vi mpich2.template
   slots              9999
   start_proc_args    /opt/gridengine/mpich2_mpd/startmpich2.sh -catch_rsh \
                      $pe_hostfile /opt/mpich2/gnu
   stop_proc_args     /opt/gridengine/mpich2_mpd/stopmpich2.sh -catch_rsh \
                      /opt/mpich2/gnu

5. Save this updated mpich2.template file then add it as a new PE to SGE:

 # qconf -Ap mpich2.template

Verify that this newly-created mpich2_mpd PE definition is correct (especially double-check all the filepaths):

 # qconf -sp mpich2_mpd
   pe_name            mpich2_mpd
   slots              9999
   user_lists         NONE
   xuser_lists        NONE
   start_proc_args    /opt/gridengine/mpich2_mpd/startmpich2.sh -catch_rsh \
                      $pe_hostfile /opt/mpich2/gnu
   stop_proc_args     /opt/gridengine/mpich2_mpd/stopmpich2.sh -catch_rsh \
                      /opt/mpich2/gnu
   allocation_rule    $round_robin
   control_slaves     TRUE
   job_is_first_task  FALSE
   urgency_slots      min
   accounting_summary FALSE

6. Add this new mpich2_mpd PE to the pe_list line of the SGE cluster queue(s) you want to use for running MPICH2 jobs.

For example, if you want to add it to the default all.q queue:

 # qconf -mq all.q
   pe_list               make mpich mpi orte mpich2_mpd
 # qconf -sq all.q

7. Change the execd_params line of the global SGE configuration so that commands like qdel can properly kill off spawned processes:

 # qconf -mconf
   execd_params                ENABLE_ADDGRP_KILL=TRUE
 # qconf -sconf

8. According to "http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257043", Reuti says you have to also edit one line of the provided startmpich2.sh script to make it work correctly with Rocks:

 # vi $SGE_ROOT/mpich2_mpd/startmpich2.sh

Jump down to line 176 where it says:

 NODE=`hostname`

and change it to:

 NODE=`hostname --short`

and save the updated file.

9. Since Reuti's scripts are written in bash shell, IF you don't normally specify -S /bin/bash when running qsub OR normally put in a #$ -S /bin/bash line in all of your own SGE submission scripts for mpich2_mpd jobs, Reuti's HOWTO says you need to additionally make the following configuration change for your SGE cluster queue(s) so that SGE will correctly execute Reuti's scripts using the bash shell:

 # qconf -mq all.q
   shell                 /bin/bash
   shell_start_mode      unix_behavior
  • NOTE: Some older Rocks installs (especially Rocks 5.2 with some version of
  • the Service Pack roll installed) have a bug in their default SGE global
  • configuration which will cause Reuti's scripts to fail when run by SGE.
  • Check the SGE global configuration:
 # qconf -sconf

and see if you have the following 3 lines present near the end:

   qrsh_command       /usr/bin/ssh
   rsh_command        /usr/bin/ssh
   rlogin_command     /usr/bin/ssh

If you have those 3 lines present near the end of the file, type qconf -mconf to edit the SGE global configuration and delete only those 3 lines. Do NOT change anything else in the file.



There should already be the following 2 lines:

   rlogin_command     builtin
   rsh_command        builtin

listed earlier in the file. Those are correct and should remain intact. Do NOT change those earlier lines.

Type qconf -sconf to show the updated SGE global configuration file and double-check that those 3 erroneous lines are gone.

If you don't delete those 3 erroneous lines, submitted SGE jobs using the mpich2_mpd PE will fail to run properly on the compute nodes.

This bug was corrected in January 2010:

http://marc.info/?l=npaci-rocks-discussion&m=126411729709528

Distribute mpich2_mpd to All Compute Nodes

10. With the above done, the one important thing not mentioned in Reuti's HOWTO for getting it to work in Rocks is to copy the now-customized contents of the $SGE_ROOT/mpich2_mpd directory to all of your compute nodes.

 # cd $SGE_ROOT
 # rocks iterate host compute command="scp -rp ./mpich2_mpd
 # %:/opt/gridengine/."

or

 # cd $SGE_ROOT
 # scp -rp ./mpich2_mpd compute-0-0:/opt/gridengine/.
 # ... (repeat for each compute node -- using a script?) ...

For future compute node install/rebuilds, you'll probably want to have some sort of post-install script or create a custom RPM or Rocks roll to make it easier to install using the normal Rocks node building processes.

TBD

Test the mpich2_mpd PE

11. As a regular user, create a valid $HOME/.mpd.conf file if you don't already have one:

 $ touch $HOME/.mpd.conf
 $ chmod 600 $HOME/.mpd.conf
 $ echo "MPD_SECRETWORD=mr45-j9z" >> $HOME/.mpd.conf

Substitute your own unique secretword for the mr45-j9z part. This was the example secretword listed in the Installer's Guide.

Don't use your actual Linux shell account password since this .mpd.conf file will be stored in cleartext in your NFS-served user home directory area. This secretword will only be used for MPICH2 jobs. If you want something that looks like a random password-like alphanumeric text string, type the command mkpasswd -l 8 -s 0 to generate such a string.

12. To follow the rest of Reuti's HOWTO examples of using the provided mpich2_mpd.sh SGE submission script with the mpihello program so you can see how this mpich2_mpd PE works, download and compile the mpihello.c program, and modify the appropriate line in the mpich2_mpd.sh script to point to the Rocks-provided MPICH2:

 $ cd $HOME
 $ wget http://gridengine.sunsource.net/howto/mpich2-integration/mpihello.tgz
 $ tar xvzf mpihello.tgz
 $ /opt/mpich2/gnu/bin/mpicc -o mpihello mpihello.c
 $ cp -p $SGE_ROOT/mpich2_mpd/mpich2_mpd.sh .
 $ vi ./mpich2_mpd.sh
   export MPICH2_ROOT=/opt/mpich2/gnu
 $ qsub -pe mpich2_mpd 4 -S /bin/bash ./mpich2_mpd.sh
 $ qstat

Once qstat shows that the job is running, check the contents of the mpich2_mpd.sh.o## and mpich2_mpd.sh.po## output files to see that the mpd daemons are launched as described in Reuti's HOWTO.

While the job is running, you should also be able to view the running process tree on the compute nodes by using a command like:

 $ ssh compute-0-0 ps -e f -o pid,ppid,pgrp, command --cols=120

as described in Reuti's HOWTO to verify that the MPICH2 processes are all child processes of the master sge_shepherd process controlled by SGE.

13. For an alternate example test, compile the sample mpi-verify.c with the MPICH2 version of mpicc:

 $ /opt/mpich2/gnu/bin/mpicc -o ./mpi-verify /opt/mpi-tests/src/mpi-verify.c

And create a sample SGE job submission script like the following:

 $ vi run-myjob.qsub
 #!/bin/bash
 #$ -pe mpich2_mpd 4
 #$ -N myjob
 #$ -cwd
 #$ -j y
 #$ -S /bin/bash
 MYPROG="./mpi-verify"
 export MPICH2_ROOT="/opt/mpich2/gnu"
 export PATH="$MPICH2_ROOT/bin:$PATH"
 export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
 $MPICH2_ROOT/bin/mpiexec -machinefile $TMPDIR/machines -n $NSLOTS $MYPROG
 exit 0

REMINDER: For your own SGE job submission scripts that will use the mpich2_mpd PE, don't forget to always include the line:

 export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"

so that the $MPD_CON_EXT environment variable is available for SGE to use to manage all the mpd processes when using the mpich2_mpd PE.

14. Submit the test job to SGE:

 $ qsub run-myjob.qsub

and run:

 $ qstat

to track its progress in the queue. Also check the contents of the job's myjob.o### and myjob.po### output files.

If the job seems to just be getting stuck in the SGE queue in the qw (queue waiting) state and not entering the r (running) state, some useful SGE diagnostic commands to help you figure out what's wrong:

 frontend$ qstat -j
 frontend$ qstat -j JOBID#
 frontend$ qstat -f
 frontend$ tail -25 $SGE_ROOT/default/spool/qmaster/messages
 compute-0-0$ tail -25 $SGE_ROOT/default/spool/compute-0-0/messages

Should the queue get completely stuck in a E (error) state, it may be necessary for a SGE administrator (such as root) to manually clear the queue's error state:

 frontend# qstat -explain E
 frontend# qdel JOBID#
 frontend# qmod -cq all.q

NOTE: According to "http://trac.mcs.anl.gov/projects/mpich2/milestone/mpich2-1.3", the upcoming MPICH2 1.3 version will use a different default MPI process manager (hydra instead of mpd) and is planned to include better support for tight integration with SGE, which I assume some future version of Rocks will take advantage of.