Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dask task graph across submitted calculations to enable further optimization #169

Open
spencerahill opened this issue Apr 3, 2017 · 6 comments

Comments

@spencerahill
Copy link
Owner

Haven't thought carefully about this yet, but it would be great if we could shed that dependency and do everything through dask.

Obviously dask.array isn't the right fit. But from the very top of their docs home page:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.

It looks like dask.Bag is more what we're looking for: "Dask.Bag implements a operations like map, filter, fold, and groupby on collections of Python objects."

Ultimately all we use multiprocess for is a map:

        pool = multiprocess.Pool()
        return pool.map(lambda calc:
                        _compute_or_skip_on_error(calc, compute_kwargs),
                        calcs)

Bonus points if we were able to get dask's awesome task progress visualizers working :)

@spencerkclark
Copy link
Collaborator

I agree, this is ultimately where we want to be -- in particular I think it would be valuable to integrate all components of all submitted aospy calculations into a single task graph (including array operations decomposed by dask), which the dask-scheduler could then optimize appropriately. This would enable better support for larger-data calculations, because we wouldn't be limiting ourselves to doing each calculation on a single node/core.

@darothen
Copy link

darothen commented Apr 3, 2017

If I can jump in here -

I'm very interested in trying out aospy to replace my Snakemake workflow (like I mentioned at the workshop at Columbia, it's a traditional pipeline on steroids), and this is one of the features I'd be really interested in leveraging. It's hard to build pipelines that include steps which require computing with a cluster... it would be nicer if dask was doing all the task management in the workflow.

I'm still reading code, but out-of-the-box it seems this would be a really large project. But, I think there's an intermediate step - why not target joblib as a drop-in replacement for where multiprocess is used? Once that's working, you can almost trivially run your analyses on a dask cluster - see dask's "Joblib Integration page with scikit-learn examples. In fact, if I understand things correctly, the way that joblib is used to power grid searches in scikitlearn is pretty much identical to how you're handling multiprocessing here, so you'd get all those spiffy dask visualizers and whatnot immediately!

Not volunteering myself for this project just yet - I have to learn much more about aospy before I can bite off such a significant project. But maybe towards the summer?

@spencerahill
Copy link
Owner Author

If I can jump in here

Absolutely!

why not target joblib as a drop-in replacement for where multiprocess is used? Once that's working, you can almost trivially run your analyses on a dask cluster

Thanks, this is really intriguing. I wasn't even aware of joblib. @spencerkclark has worked more on this part, so he can hopefully provide a better sense of how this would look.

Not volunteering myself for this project just yet

No worries and no rush. If/when you dig into aospy more, don't hesitate to reach out w/ any questions or comments.

@spencerkclark
Copy link
Collaborator

@darothen thanks for jumping in here! It's great to have your input.

out-of-the-box it seems this would be a really large project.

I fully agree; it would take some serious work, and possibly some redesign of aospy's internals to be able to fully utilize dask. I still need to play around with dask a little more to get a sense for what the best approach here (in aospy) would be.

why not target joblib as a drop-in replacement for where multiprocess is used?

This is a really nice intermediate option, which removes the multiprocess dependency (albeit introducing a dependency on joblib), that enables integration with a dask.distributed scheduler (and associated status page!). I have a crude version working locally, but need to iron out some details regarding logging, and how to best enable the user to set their scheduler options in aospy (e.g. the scheduler_host address). I'll try and get a PR up in the next few days.

@spencerahill
Copy link
Owner Author

I have a crude version working locally

That was fast 😄

Looking forward to the PR! Happy to swap multiprocess for joblib as a dependency. And hopefully in getting this intermediate solution working, the route forward on a deeper dask integration will become clearer.

@spencerahill spencerahill changed the title Is it possible to replace multiprocess functionality using dask? Create dask task graph across submitted calculations to enable further optimization Apr 24, 2017
@spencerahill
Copy link
Owner Author

The multiprocessing removal was taken care of by #172. Re-naming this issue to reflect the outstanding (harder) part: optimizing the full task graph across calculations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants