Integration with batchJobs #46

stephens999 · 2015-05-26T14:47:03Z

We've had various discussions about how to provide better support for long-running jobs.
To me it seems that by making use of batchJobs, and particularly its waitForJobs function,
we should be able to get something that works with relatively little code.

Currently we have, in run_dsc, the code:

runScenarios(dsc,scenariosubset,seedsubset)
runMethods(dsc,scenariosubset,methodsubset,seedsubset)
runOutputParsers(dsc)
runScores(dsc,scenariosubset,methodsubset)

The simplest approach that I can see would involve submitting jobs to do each
of these functions, and using waitForJobs to wait between each job set.

runScenarios (by submitting to batchJobs)
waitForJobs()
runMethods (again through batchJobs, a second registry of jobs this)
waitForJobs()
runOutputparsers (again through batchJobs, a third registry)
waitForJobs()
runScores (batchJobs, a fourth registry)
waitForJobs()

@ramanshah is there a reason you can see that this would not work?

ramanshah · 2015-05-26T14:59:08Z

I've thought about this strategy. Here are my reasons for hesitation:

Some clusters have an enormous amount of latency between job submission and job execution. I've done a lot of my past research on clusters where the wait between job submission and the beginning of job execution tends to run in the 1-4 day range. Quadrupling such latency would be painful.
I know for some fields (e.g. in quantitative finance; Rick's experiences seem to agree with this) that a really slow "brute force" methodology involving some pedantic Monte Carlo simulation is often the benchmark for the more clever methodology. This might also be true for our genomic work but I'm not sure. A single extremely slow method in a dsc would hold up all of the faster methods.
As you have mentioned in other issues, it is likely that input parsers and the use of multiple pre/post processing steps could make the maximum number of global barriers (waitForJobs()) even larger.

If you feel these aren't important, we can definitely do it this way. Your suggestion is probably the simplest implementation.

stephens999 · 2015-05-26T15:21:22Z

I think 1 is presumably going to depend on the cluster environment, but it isn't a problem I have come across in practice with the clusters we are using.

For 2 this scenario is indeed not out of the question, but easily dealt with: first run your dsc for all the fast methods. Then add the slow method and run that.

For 3, I agree, but actually suspect that in most use cases it will be
the methods that are the rate-determining step, not the waitForJobs() on parsers etc.

I think the issue is urgent enough, and this approach simple enough, that we would be best off implementing it first, and seeing what our next bottleneck turns out to be.

ramanshah · 2015-05-26T15:22:24Z

Sounds good.

ramanshah · 2015-05-27T15:28:08Z

Probably addresses #23 as well.

ramanshah mentioned this issue Jul 16, 2015

Make dscr minimally ready for broader collaboration #49

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with batchJobs #46

Integration with batchJobs #46

stephens999 commented May 26, 2015

ramanshah commented May 26, 2015

stephens999 commented May 26, 2015

ramanshah commented May 26, 2015

ramanshah commented May 27, 2015

Integration with batchJobs #46

Integration with batchJobs #46

Comments

stephens999 commented May 26, 2015

ramanshah commented May 26, 2015

stephens999 commented May 26, 2015

ramanshah commented May 26, 2015

ramanshah commented May 27, 2015