-
Notifications
You must be signed in to change notification settings - Fork 0
sun gridengine problems
- Make sure that all nodes are defined correctly in the /opt/gridengine/default/common/hosts_aliases file.
- Make sure that head node name is defined in /opt/gridengine/default/common/act_qmaster on all execute nodes. Headnode name should be defined as headnode_name.local, and not as a FQDN (such as headnode_name.mynetwork.com) which is BROKEN syntax.
- Make certain that all desired nodes are defined as submit hosts by running
qconf -sh
on the head node. If desired nodes are missing, do something like this:
for node in 1 2 3 5 6; do qconf -as compute-1-${node}; done
After re-running /opt/gridengine/install-qmaster to change name of qmaster system from cluster-rcn.mynetwork.com to cluster-rcn.local. Got following message near end of install:
Grid Engine qmaster and scheduler startup ----------------------------------------- Starting qmaster and scheduler daemon. Please wait ... starting sge_qmaster starting sge_schedd error: commlib error: endpoint is not unique error (endpoint "cluster-rcn.mynetwork.com/schedd/1" is already connected) error: getting configuration: unable to contact qmaster using port 536 on host "cluster-rcn.mynetwork.com" error: there is already a client endpoint cluster-rcn.mynetwork.com/schedd/1 connected to qmaster service critical error: scheduler already running Hit <RETURN> to continue >>
Turns out there were old sge_qmaster and sge_execd processes still running. killed them manually, and then install script ran successfully to conclusion. during the running of /opt/gridengine/install_qmaster, defined cluster-rcn.local as the master host, no shadow hosts, said 'y' to multiple DNS domains, defined ".local" as the desired domain, took all other defaults.
Errors in the SGE configuration file - use 'qconf -mconf' to fix them: they were:
- telnet, rsh, and qlogin daemons should have been set to /usr/sbin/sshd (they were set to /usr/bin/sshd).
- X11 forwarding can be turned on for interactive sessions by appending -X to the /usr/sbin/ssh commands used for qlogin, qsh, or qrsh commands.
Hung jobs being left in the job queue on the qmaster, and processes were still running on that node even after a 'qdel -u <username></username>' was given. Had to restart /etc/init.d/sgeexecd on the node - for some reason it had died. Tried to restart sgeexecd on the node, error came up "not a submit host". Did a 'qconf -as <nodename></nodename>' from the head node to fix this, and then successfully restarted /etc/init.d/sgeexecd on the affected node. Deleted jobs then were killed automatically, and node proceeded to function normally.
One by one the cluster nodes weren't accepting jobs from the head node after it was rebooted. To discover what the problem was, I did a:
qstat -explain a -q low.q | less
and found that sge_execd was in an unknown state on most of the nodes. Did a:
cluster-fork 'cd /etc/init.d; ./sgeexecd softstop; ./sgeexecd start'
And that cleared up all nodes except c1-15. Did a:
ssh c-15 cd $SGE_ROOT; ./install_execd (accepted all defaults).
At the end of the install, did a softstop and start as above - worked.
After setting up an existing SGE node with LDAP, you may find that Sun Grid Engine still does not work with users from your LDAP server.
When submitting jobs as an LDAP user (in this example jordan), you may see errors like:
03/09/2008 17:45:08|qmaster|cluster1|W|job 34.1 failed on host compute-0-6.local general assumedly before job because: can’t get password entry for user “jordan”. Either the user does not exist or NIS error!
One solution is to restart /etc/init.d/sgeexecd on each client that you have recently pointed to LDAP, but the problem actually lies outside of SGE. Check the caching settings on the LDAP clients and server. The problem can also stem from unexpected sources, such as the permissions settings on the users' home directories.
© 2014 www.rocksclusters.org. All Rights Reserved.