Skip to content
nadyawilliams edited this page Jun 11, 2014 · 2 revisions

Table of Contents

General Information / Documentation

Tight MPICH2 Integration with SGE

Tight MPICH Integration with SGE

Rocks comes with a default parallel environment for SGE named mpich that facilitates tight integration of MPICH1. Unfortunately it's not quite complete. For the ch_p4 mpich device, the environment variable MPICH_PROCESS_GROUP must be set to no on both the frontend and compute nodes in order for SGE to maintain itself as process group leader. These are the steps I did to get it to work in Rocks 4.1 (I opted for the 2nd solution described in Tight MPICH Integration in Grid Engine, Nmichaud@jhu.edu, 14:11, 15 February 2006 EST)

1. Edit /opt/gridengine/default/common/sge_request and add the following line at the end:

 -v MPICH_PROCESS_GROUP=no

2. For a default Rocks setup, SGE calls /opt/gridengine/mpi/startmpi when starting an mpi job. This in turn calls /opt/gridengine/mpi/rsh. Both these files must be changed. However, each compute node has its own copy of these files. Instead of editing it on the frontend and copying to all the compute nodes, I found it easier to place my own copy in a subdirectory of /share/apps called mpi and then change the mpich parallel environment to call my own copies of startmpi and stopmpi (and by extension rsh). This way my one copy is exported to all the nodes and I don't have to worry about keeping them in sync.

3. Edit /share/apps/mpi/startmpi. Change the line:

 rsh_wrapper=$SGE_ROOT/mpi/rsh

to:

 rsh_wrapper=/share/apps/mpi/rsh

4. Edit /share/apps/mpi/rsh. Change the following lines:

echo $SGE_ROOT/bin/$ARC/qrsh -inherit -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -inherit -nostdin $rhost $cmd
else
echo $SGE_ROOT/bin/$ARC/qrsh -inherit $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -inherit $rhost $cmd
to:
echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
else
echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd

5. Finally, run qconf -mp mpich. Change it from:

pe_name          mpich
slots            9999
user_lists       NONE
xuser_lists      NONE
start_proc_args  /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args   /opt/gridengine/mpi/stopmpi.sh
allocation_rule  $fill_up
control_slaves   TRUE
job_is_first_task FALSE
urgency_slots     min
to:
pe_name          mpich
slots            9999
user_lists       NONE
xuser_lists      NONE
start_proc_args  /share/apps/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args   /share/apps/mpi/stopmpi.sh
allocation_rule  $fill_up
control_slaves   TRUE
job_is_first_task FALSE
urgency_slots     min

Prologue/Epilogue

This is a place for information about the prologue and epilogue scripts that run before and after a job, respectively.

I have found that SGE is not particularly good at cleaning up after MPI jobs; it does not keep track of which other nodes are being used and killing leftover user processes on those nodes. If anyone has a good solution for this, I'd love to see it.

(Note: MPI tight integration is suppose to fix this, see http://web.archive.org/web/20080916135356/http://gridengine.sunsource.net/howto/lam-integration/lam-integration.html)

Add Frontend as a SGE Execution Host in Rocks

To setup the frontend node to also be a SGE execution host which queued jobs can be run on (like the compute nodes), do the following:

Quick Setup

 # cd /opt/gridengine
 # ./install_execd    (accept all of the default answers)
 # qconf -mq all.q    (if needed, adjust the number of slots for [Frontend.local=4] and other parameters)
 # /etc/init.d/sgemaster.Frontend stop
 # /etc/init.d/sgemaster.Frontend start
 # /etc/init.d/sgeexecd.Frontend stop
 # /etc/init.d/sgeexecd.Frontend start

Detailed Setup

1. As root, make sure $SGE_ROOT, etc. are setup correctly on the frontend:

 # env | grep SGE

It should return back something like:

 SGE_CELL=default
 SGE_ARCH=lx26-amd64
 SGE_EXECD_PORT=537
 SGE_QMASTER_PORT=536
 SGE_ROOT=/opt/gridengine

If not, source the file /etc/profile.d/sge-binaries.[c]sh or check if the SGE Roll is properly installed and enabled:

 # rocks list roll
 NAME          VERSION ARCH   ENABLED
 sge:          5.2     x86_64 yes

2. Run the install_execd script to setup the frontend as a SGE execution host:

 # cd $SGE_ROOT
 # ./install_execd 

Accept all of the default answers as suggested by the script.

  • NOTE: For the following examples below, the text Frontend should be substituted with the actual "short hostname" of your frontend (as reported by the command hostname -s).
For example, if running the command hostname on your frontend returns back the "FQDN long hostname" of:
 # hostname
 mycluster.mydomain.org

then hostname -s should return back just:

 # hostname -s
 mycluster

3. Verify that the number of job slots for the frontend is equal to the number of physical processors/cores on your frontend that you wish to make available for queued jobs by checking the value of the slots parameter of the queue configuration for all.q:

 # qconf -sq all.q | grep slots
 slots                 1,[compute-0-0.local=4],[Frontend.local=4]

The [Frontend.local=4] means that SGE can run up to 4 jobs on the frontend. Be aware that since the frontend is normally used for other tasks besides running compute jobs, it is recommended that not all the installed physical processors/cores on the frontend be available to be scheduled by SGE to avoid overloading the frontend.

For example, on a 4-core frontend, to configure SGE to use only up to 3 of the 4 cores, you can modify the slots for Frontend.local from 4 to 3 by typing:

 # qconf -mattr queue slots '[Frontend.local=3]' all.q

If there are additional queues besides the default all.q one, repeat the above for each queue.

Read "man queue_conf" for a list of resource limit parameters such as s_cpu, h_cpu, s_vmem, and h_vmem that can be adjusted to prevent jobs from overloading the frontend.

  • NOTE: For Rocks 5.2 or older, the frontend may have been default configured during installation with only 1 job slot ([Frontend.local=1]) in the default all.q queue, which will only allow up to 1 queued job to run on the frontend. To check the value of the slots parameter of the queue configuration for all.q, type:
 # qconf -sq all.q | grep slots
 slots                 1,[compute-0-0.local=4],[Frontend.local=1] 

If needed, modify the slots for <frontend>.local from 1 to 4 (or up to the maximum number of physical processors/cores on your frontend that you wish to use) by typing:

 # qconf -mattr queue slots '[Frontend.local=4]' all.q
  • NOTE: For Rocks 5.3 or older, create the file /opt/gridengine/default/common/host_aliases to contain both the .local hostname and the FQDN long hostname of your frontend:
 # vi $SGE_ROOT/default/common/host_aliases
 Frontend.local Frontend.mydomain.org
  • NOTE: For Rocks 5.3 or older, edit the file /opt/gridengine/default/common/act_qmaster to contain the .local hostname of your frontend:
 # vi $SGE_ROOT/default/common/act_qmaster
 Frontend.local
  • NOTE: For Rocks 5.3 or older, edit the file /etc/init.d/sgemaster.<frontend>:
 # vi /etc/init.d/sgemaster.Frontend

and comment out the line:

 /bin/hostname --fqdn > $SGE_ROOT/default/common/act_qmaster

by inserting a # character at the beginning, so it becomes:

 #/bin/hostname --fqdn > $SGE_ROOT/default/common/act_qmaster

in order to prevent the file /opt/gridengine/default/common/act_qmaster from getting overwritten with incorrect data every time sgemaster.Frontend is run during bootup.

4. Restart both qmaster and execd for SGE on the frontend:

 # /etc/init.d/sgemaster.Frontend stop
 # /etc/init.d/sgemaster.Frontend start
 # /etc/init.d/sgeexecd.Frontend stop
 # /etc/init.d/sgeexecd.Frontend start

And everything will start working. :)

References:

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2009-September/042678.html

category:applications category:SGE

Clone this wiki locally