sun gridengine problems

nadyawilliams edited this page Jun 11, 2014 · 2 revisions

Table of Contents

Can't Get install_execd to run on Nodes

  • Make sure that all nodes are defined correctly in the /opt/gridengine/default/common/hosts_aliases file.
  • Make sure that head node name is defined in /opt/gridengine/default/common/act_qmaster on all execute nodes. Headnode name should be defined as headnode_name.local, and not as a FQDN (such as headnode_name.mynetwork.com) which is BROKEN syntax.

Can't Get Certain Nodes to Work as Submit Hosts

  • Make certain that all desired nodes are defined as submit hosts by running qconf -sh on the head node. If desired nodes are missing, do something like this:
        for node in 1 2 3 5 6; do  qconf -as compute-1-${node}; done

Problems when changing qmaster machine name

After re-running /opt/gridengine/install-qmaster to change name of qmaster system from cluster-rcn.mynetwork.com to cluster-rcn.local. Got following message near end of install:

Grid Engine qmaster and scheduler startup
-----------------------------------------

Starting qmaster and scheduler daemon. Please wait ...
   starting sge_qmaster
      starting sge_schedd
      error: commlib error: endpoint is not unique error (endpoint "cluster-rcn.mynetwork.com/schedd/1" is already connected)
      error: getting configuration: unable to contact qmaster using port 536 on host "cluster-rcn.mynetwork.com"
      error: there is already a client endpoint cluster-rcn.mynetwork.com/schedd/1 connected to qmaster service
      critical error: scheduler already running
      Hit <RETURN> to continue >>

Turns out there were old sge_qmaster and sge_execd processes still running. killed them manually, and then install script ran successfully to conclusion. during the running of /opt/gridengine/install_qmaster, defined cluster-rcn.local as the master host, no shadow hosts, said 'y' to multiple DNS domains, defined ".local" as the desired domain, took all other defaults.

Problems Using ssh with SGE

Errors in the SGE configuration file - use 'qconf -mconf' to fix them: they were:

  • telnet, rsh, and qlogin daemons should have been set to /usr/sbin/sshd (they were set to /usr/bin/sshd).
  • X11 forwarding can be turned on for interactive sessions by appending -X to the /usr/sbin/ssh commands used for qlogin, qsh, or qrsh commands.

Hung Jobs on One or Multiple Nodes

Hung jobs being left in the job queue on the qmaster, and processes were still running on that node even after a 'qdel -u <username></username>' was given. Had to restart /etc/init.d/sgeexecd on the node - for some reason it had died. Tried to restart sgeexecd on the node, error came up "not a submit host". Did a 'qconf -as <nodename></nodename>' from the head node to fix this, and then successfully restarted /etc/init.d/sgeexecd on the affected node. Deleted jobs then were killed automatically, and node proceeded to function normally.

Nodes are Dropped from Queue List

One by one the cluster nodes weren't accepting jobs from the head node after it was rebooted. To discover what the problem was, I did a:

 qstat -explain a -q low.q | less

and found that sge_execd was in an unknown state on most of the nodes. Did a:

 cluster-fork 'cd /etc/init.d; ./sgeexecd softstop; ./sgeexecd start'

And that cleared up all nodes except c1-15. Did a:

 ssh  c-15
 cd $SGE_ROOT; ./install_execd (accepted all defaults). 

At the end of the install, did a softstop and start as above - worked.

Sun Grid Engine won't work with LDAP

After setting up an existing SGE node with LDAP, you may find that Sun Grid Engine still does not work with users from your LDAP server.

When submitting jobs as an LDAP user (in this example jordan), you may see errors like:

  03/09/2008 17:45:08|qmaster|cluster1|W|job 34.1 failed on host compute-0-6.local general assumedly before job because: can’t get password entry for user “jordan”.
  Either the user does not exist or NIS error!

One solution is to restart /etc/init.d/sgeexecd on each client that you have recently pointed to LDAP, but the problem actually lies outside of SGE. Check the caching settings on the LDAP clients and server. The problem can also stem from unexpected sources, such as the permissions settings on the users' home directories.

category:Troubleshooting category:SGE