-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the frequency of polling job status #35
Comments
From Albert.Z...@gmail.com on 2012-06-13T14:47:30Z The downside of too frequent checking results in large log files. |
From Albert.Z...@gmail.com on 2012-06-20T13:24:02Z And relatively high CPU usage... |
From ssade...@gmail.com on 2012-06-21T17:40:22Z I definitely agree this would be good to make customizable (and very easy to do!). I wonder if you see this as a single pipeline-wide configuration setting or something that might change per-command? Also, perhaps rather than a fixed interval, perhaps some kind of exponential backoff might be appropriate? The idea being if it is a very short command, or a command that fails, it is good to avoid very large latency in getting the status, especially since in some systems the status for jobs that have finished may not persist very long. So I'm thinking about two values, a "minimum" poll interval and a "maximum" and Bpipe will do an exponential backoff between the two values. Let me know any thoughts, and thanks for the suggestion! Status: Started |
From Albert.Z...@gmail.com on 2012-06-21T22:13:44Z You know if we want to run bpipe on TORQUE server, then we must create a file "bpipe.config" with one line: executor="torque". I think in my use case, add one more line looks like: frequency=600 (which means checking job's status every 10mins) should be good enough. Yea I know in the "bpipe-torque.sh" the function status() is using "qstat" to checking the job status. And sometimes the systems won't keep the status for completed jobs very long. In fact, for the server my lab is using, it is configured that remove completed jobs immediately out of queue. So qstat in my case doesn't work at all. I asked the server support guys for help, and their major concern is the server's work load. As each qstat checking will set up a connection with the job scheduler. To solve this problem, I modify the status() function to use "tracejob" instead of "qstat" to keep track of the job status. Following is my status() code: get the status of a job given its idstatus () { make sure we have a job id on the command lineif [[ $# -ge 1 ]] So how does tracejob work is instead of initiate a connection with the job scheduler on the server and checking the queue status, it will read the server log files which will be kept for several days to weeks even after the job is done. But the downside of it is only head node of the server have access to those log files. However, this works for my specific situation. Any thoughts? |
From ssade...@gmail.com on 2012-06-22T19:57:54Z Thanks for the followup thoughts - I will discuss this with the author of the Torque support in Bpipe (it wasn't me) and get him to follow up. |
From bjp...@unimelb.edu.au on 2012-06-25T01:23:55Z I agree that polling qstat is not ideal, and certainly no good if your Torque installation removes jobs immediately. Our sys admins were kind enough to extend the time jobs are retained for a while after they complete (personally I think this is a reasonable thing to do, but it depends on the system you are using). As you say, tracejob can be used as a workaround, but also as you say, it has issues with privileges. The Torque manual says: "To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. " For example, regular users on our system cannot use tracejob. We have found that 10 seconds is reasonable for polling qstat, as long as the job records are kept for a short time after the jobs have completed. Perhaps we should bite the bullet and look at using a library such as DRMAA ( http://www.drmaa.org/ ) to launch jobs? I've had it on my whiteboard for months now :) |
From bjp...@unimelb.edu.au on 2012-06-25T21:54:32Z I've been advised by some knowledgable folks of a couple of things that might help.
This allows you to execute a command and wait for it to complete: $ cat hostname.sh $ qsub -Ix ./hostname.sh bruce009 qsub: job 1026513 completed
|
From Albert.Z...@gmail.com on 2012-06-14T07:44:57Z
I'm using bpipe on TORQUE server.
I see bpipe will constantly check the job status by calling 'bpipe-torque.sh status <job_id>'. From my point of view, bpipe is checking too much (every second).
So far I didn't see a way to set this frequency. I'm asking here is it possible to add a parameter for this in the next release of bpipe?
Original issue: http://code.google.com/p/bpipe/issues/detail?id=35
The text was updated successfully, but these errors were encountered: