Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph of dependent jobs? #4

Open
mllg opened this issue Nov 13, 2013 · 9 comments
Open

Graph of dependent jobs? #4

mllg opened this issue Nov 13, 2013 · 9 comments

Comments

@mllg
Copy link
Member

mllg commented Nov 13, 2013

SRC: https://code.google.com/p/batchjobs/issues/detail?id=19

For some experiments it MIGHT be useful to be able to specify a graph of dependent jobs, similar to how targets are defined in a Makefile.

This means, that for some jobs to starts, the results of others have to be fully completed. The solution for this probably is simple topological sorting wrt to preconditions.

But I want to collect more use cases, before we look into this again.

@seandavi
Copy link

Is this still on the radar? I'm currently using the python tool, Snakemake, for job submission and dependency management. There are now many such workflow systems available now, but none in R that I know of.

@ramanshah
Copy link

I second this - I am working on bringing parallelism to the dscr project (https://github.com/stephens999/dscr). I hope to use BatchJobs to abstract away the serial/multicore/cluster contexts. In our dscr workflows, we cache objects at many stages, the costly parts of the computations can vary, and intermediate objects can often be re-used, so dependency management of some sort is looking crucial.

The cluster engines I've investigated (TORQUE, SGE, SLURM) all appear to allow the user to specify dependencies based on completion of previous jobs (specified by the scheduler's job ID). My hope is for BatchJobs to be able to receive dependencies from the user, encode them in the registry, and emit the appropriate dependency clauses to the cluster system.

If dependency management is considered to fit into the overall goals of BatchJobs, I'd be happy to look deeper into the implementation and work on a pull request.

@seandavi
Copy link

Leaving dependency management to the scheduler has some disadvantages, including the inability to test for error conditions on exit of dependent jobs.

@ramanshah
Copy link

Interesting point - is there a good alternative?

@seandavi
Copy link

Managing the dependencies in R is much more flexible. The first pass is to
simply make a graph of the job dependencies and then track completed jobs.
A second step might include hooks to check for the appropriate completion
of jobs. A third might might include automated dependency checking to
determine if a job needs to be re-run (based on inputs changing, etc.)

Sean

On Fri, May 22, 2015 at 12:23 PM, Raman Shah notifications@github.com
wrote:

Interesting point - is there a good alternative?


Reply to this email directly or view it on GitHub
#4 (comment).

@krlmlr
Copy link

krlmlr commented Jan 7, 2016

👍

@seandavi: At least LSF (which I'm primarily interested in) can define the dependency conditional on exit status. Isn't this true for other schedulers?

If the scheduler knows about dependencies, this is by far the easiest approach. The workflow you suggested -- reimplementing this in R -- sounds a bit like reimplementing make or SCons or whatnot. It's more flexible, for sure, but also much more tedious and error-prone. Also, if we do our own scheduling, this requires a process that is running constantly and uses "busy waiting" to be able to schedule runnable jobs.

Checking if a job needs to be re-run can be done as part of the job itself:

if (digest::digest(input) == digest::digest(last_good_input)) quit(0)

@ramanshah: Do you have any updates?

@krlmlr
Copy link

krlmlr commented Jan 7, 2016

@mllg: My use case is a web of data pipelines: Each stage processes data and creates artifacts, some of which are processed in subsequent stages. Currently I'm using make (with an autogenerated Makefile), but BatchJobs scales so much better :-)

@seandavi: Of course, for the "multicore" schedulers we'll need our own dependency handling. Which, again, could happen with an autogenerated Makefile.

@ramanshah
Copy link

@krlmlr I left the position where I was working on this problem as part of my day job, so there won't likely be substantial news from me anymore. The group seems to have interest in building the benchmarking framework on top of a different foundation, possibly snake_make or an implementation of the Common Workflow Language. Basically, the project involves executing a highly heterogeneous multi-step dependency graph, which is not really the kind of problem that BatchJobs excels at, so we started going in a different direction as of last fall. But @road2stat, who is in the driver's seat for this project now, may have other thoughts.

@seandavi
Copy link

seandavi commented Jan 8, 2016

@ramanshah, what I am interested in is what you describe. There are many frameworks for doing this kind of thing:

https://github.com/pditommaso/awesome-pipeline

It would be great to do something in R related to common-workflow-language. I'd definitely be interested in working with you and @road2stat on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants