Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the frequency of polling job status #35

Open
lonsbio opened this issue Oct 20, 2014 · 7 comments
Open

About the frequency of polling job status #35

lonsbio opened this issue Oct 20, 2014 · 7 comments

Comments

@lonsbio
Copy link
Collaborator

lonsbio commented Oct 20, 2014

From Albert.Z...@gmail.com on 2012-06-14T07:44:57Z

I'm using bpipe on TORQUE server.

I see bpipe will constantly check the job status by calling 'bpipe-torque.sh status <job_id>'. From my point of view, bpipe is checking too much (every second).

So far I didn't see a way to set this frequency. I'm asking here is it possible to add a parameter for this in the next release of bpipe?

Original issue: http://code.google.com/p/bpipe/issues/detail?id=35

@lonsbio
Copy link
Collaborator Author

lonsbio commented Oct 20, 2014

From Albert.Z...@gmail.com on 2012-06-13T14:47:30Z

The downside of too frequent checking results in large log files.

@lonsbio
Copy link
Collaborator Author

lonsbio commented Oct 20, 2014

From Albert.Z...@gmail.com on 2012-06-20T13:24:02Z

And relatively high CPU usage...

@lonsbio
Copy link
Collaborator Author

lonsbio commented Oct 20, 2014

From ssade...@gmail.com on 2012-06-21T17:40:22Z

I definitely agree this would be good to make customizable (and very easy to do!).

I wonder if you see this as a single pipeline-wide configuration setting or something that might change per-command?

Also, perhaps rather than a fixed interval, perhaps some kind of exponential backoff might be appropriate? The idea being if it is a very short command, or a command that fails, it is good to avoid very large latency in getting the status, especially since in some systems the status for jobs that have finished may not persist very long. So I'm thinking about two values, a "minimum" poll interval and a "maximum" and Bpipe will do an exponential backoff between the two values.

Let me know any thoughts, and thanks for the suggestion!

Status: Started

@lonsbio
Copy link
Collaborator Author

lonsbio commented Oct 20, 2014

From Albert.Z...@gmail.com on 2012-06-21T22:13:44Z

You know if we want to run bpipe on TORQUE server, then we must create a file "bpipe.config" with one line: executor="torque". I think in my use case, add one more line looks like: frequency=600 (which means checking job's status every 10mins) should be good enough.

Yea I know in the "bpipe-torque.sh" the function status() is using "qstat" to checking the job status. And sometimes the systems won't keep the status for completed jobs very long. In fact, for the server my lab is using, it is configured that remove completed jobs immediately out of queue. So qstat in my case doesn't work at all. I asked the server support guys for help, and their major concern is the server's work load. As each qstat checking will set up a connection with the job scheduler.

To solve this problem, I modify the status() function to use "tracejob" instead of "qstat" to keep track of the job status. Following is my status() code:


get the status of a job given its id

status () {

make sure we have a job id on the command line

if [[ $# -ge 1 ]]
then
# look at the output of tracejob
trace_output=tracejob -a -l -m "$1"
trace_success=$?
if [[ $trace_success == 0 ]]
then
# XXX what to do if the awk fails?
job_state=echo "$trace_output" | grep 'COMPLETE'
if [[ -z $job_state ]]
then
job_state=echo "$trace_output" | grep 'Run'
if [[ -z $job_state ]]
then
echo WAITING
else
echo RUNNING
fi
else
job_state=echo "$trace_output" | awk 'match($0, /Exit_status=([0-9]+)/, a) {print a[1]}'
echo "COMPLETE $job_state"
fi
exit $SUCCESS
else
exit $TRACE_FAILED
fi
else
echo "$program_name ERROR: status requires a job identifier"
exit $STATUS_MISSING_JOBID
fi
}


So how does tracejob work is instead of initiate a connection with the job scheduler on the server and checking the queue status, it will read the server log files which will be kept for several days to weeks even after the job is done. But the downside of it is only head node of the server have access to those log files. However, this works for my specific situation.

Any thoughts?

@lonsbio
Copy link
Collaborator Author

lonsbio commented Oct 20, 2014

From ssade...@gmail.com on 2012-06-22T19:57:54Z

Thanks for the followup thoughts - I will discuss this with the author of the Torque support in Bpipe (it wasn't me) and get him to follow up.

@lonsbio
Copy link
Collaborator Author

lonsbio commented Oct 20, 2014

From bjp...@unimelb.edu.au on 2012-06-25T01:23:55Z

I agree that polling qstat is not ideal, and certainly no good if your Torque installation removes jobs immediately.

Our sys admins were kind enough to extend the time jobs are retained for a while after they complete (personally I think this is a reasonable thing to do, but it depends on the system you are using).

As you say, tracejob can be used as a workaround, but also as you say, it has issues with privileges.

The Torque manual says:

"To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. "

For example, regular users on our system cannot use tracejob.

We have found that 10 seconds is reasonable for polling qstat, as long as the job records are kept for a short time after the jobs have completed.

Perhaps we should bite the bullet and look at using a library such as DRMAA ( http://www.drmaa.org/ ) to launch jobs? I've had it on my whiteboard for months now :)

@lonsbio
Copy link
Collaborator Author

lonsbio commented Oct 20, 2014

From bjp...@unimelb.edu.au on 2012-06-25T21:54:32Z

I've been advised by some knowledgable folks of a couple of things that might help.

  1. qsub supports an "x" flag which can be used in addition to the -I (capital i) for interactive jobs.

This allows you to execute a command and wait for it to complete:

$ cat hostname.sh
#!/bin/bash
hostname

$ qsub -Ix ./hostname.sh
qsub: waiting for job 1026513 to start
qsub: job 1026513 ready

bruce009

qsub: job 1026513 completed

  1. Moab's showq supports a -c option which shows information about complete jobs for JOBCPURGETIME (default 5 minutes), including the exit code (and it has a --xml option). Of course this requires that the site is using moab in addition to torque.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant