Create dask task graph across submitted calculations to enable further optimization #169

spencerahill · 2017-04-03T04:18:50Z

Haven't thought carefully about this yet, but it would be great if we could shed that dependency and do everything through dask.

Obviously dask.array isn't the right fit. But from the very top of their docs home page:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.

It looks like dask.Bag is more what we're looking for: "Dask.Bag implements a operations like map, filter, fold, and groupby on collections of Python objects."

Ultimately all we use multiprocess for is a map:

        pool = multiprocess.Pool()
        return pool.map(lambda calc:
                        _compute_or_skip_on_error(calc, compute_kwargs),
                        calcs)

Bonus points if we were able to get dask's awesome task progress visualizers working :)

spencerkclark · 2017-04-03T11:53:39Z

I agree, this is ultimately where we want to be -- in particular I think it would be valuable to integrate all components of all submitted aospy calculations into a single task graph (including array operations decomposed by dask), which the dask-scheduler could then optimize appropriately. This would enable better support for larger-data calculations, because we wouldn't be limiting ourselves to doing each calculation on a single node/core.

darothen · 2017-04-03T14:11:30Z

If I can jump in here -

I'm very interested in trying out aospy to replace my Snakemake workflow (like I mentioned at the workshop at Columbia, it's a traditional pipeline on steroids), and this is one of the features I'd be really interested in leveraging. It's hard to build pipelines that include steps which require computing with a cluster... it would be nicer if dask was doing all the task management in the workflow.

I'm still reading code, but out-of-the-box it seems this would be a really large project. But, I think there's an intermediate step - why not target joblib as a drop-in replacement for where multiprocess is used? Once that's working, you can almost trivially run your analyses on a dask cluster - see dask's "Joblib Integration page with scikit-learn examples. In fact, if I understand things correctly, the way that joblib is used to power grid searches in scikitlearn is pretty much identical to how you're handling multiprocessing here, so you'd get all those spiffy dask visualizers and whatnot immediately!

Not volunteering myself for this project just yet - I have to learn much more about aospy before I can bite off such a significant project. But maybe towards the summer?

spencerahill · 2017-04-03T18:45:49Z

If I can jump in here

Absolutely!

why not target joblib as a drop-in replacement for where multiprocess is used? Once that's working, you can almost trivially run your analyses on a dask cluster

Thanks, this is really intriguing. I wasn't even aware of joblib. @spencerkclark has worked more on this part, so he can hopefully provide a better sense of how this would look.

Not volunteering myself for this project just yet

No worries and no rush. If/when you dig into aospy more, don't hesitate to reach out w/ any questions or comments.

spencerkclark · 2017-04-04T11:42:39Z

@darothen thanks for jumping in here! It's great to have your input.

out-of-the-box it seems this would be a really large project.

I fully agree; it would take some serious work, and possibly some redesign of aospy's internals to be able to fully utilize dask. I still need to play around with dask a little more to get a sense for what the best approach here (in aospy) would be.

why not target joblib as a drop-in replacement for where multiprocess is used?

This is a really nice intermediate option, which removes the multiprocess dependency (albeit introducing a dependency on joblib), that enables integration with a dask.distributed scheduler (and associated status page!). I have a crude version working locally, but need to iron out some details regarding logging, and how to best enable the user to set their scheduler options in aospy (e.g. the scheduler_host address). I'll try and get a PR up in the next few days.

spencerahill · 2017-04-04T16:22:33Z

I have a crude version working locally

That was fast 😄

Looking forward to the PR! Happy to swap multiprocess for joblib as a dependency. And hopefully in getting this intermediate solution working, the route forward on a deeper dask integration will become clearer.

spencerahill · 2017-04-24T17:08:00Z

The multiprocessing removal was taken care of by #172. Re-naming this issue to reflect the outstanding (harder) part: optimizing the full task graph across calculations.

spencerahill added discussion main script labels Apr 3, 2017

spencerahill added the performance label Apr 4, 2017

spencerkclark mentioned this issue Apr 4, 2017

Potential bottlenecks for large input datasets #171

Open

spencerkclark mentioned this issue Apr 4, 2017

Use dask.bag to parallelize computations rather than multiprocess #172

Merged

spencerahill changed the title ~~Is it possible to replace multiprocess functionality using dask?~~ Create dask task graph across submitted calculations to enable further optimization Apr 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dask task graph across submitted calculations to enable further optimization #169

Create dask task graph across submitted calculations to enable further optimization #169

spencerahill commented Apr 3, 2017

spencerkclark commented Apr 3, 2017

darothen commented Apr 3, 2017

spencerahill commented Apr 3, 2017

spencerkclark commented Apr 4, 2017

spencerahill commented Apr 4, 2017

spencerahill commented Apr 24, 2017

Create dask task graph across submitted calculations to enable further optimization #169

Create dask task graph across submitted calculations to enable further optimization #169

Comments

spencerahill commented Apr 3, 2017

spencerkclark commented Apr 3, 2017

darothen commented Apr 3, 2017

spencerahill commented Apr 3, 2017

spencerkclark commented Apr 4, 2017

spencerahill commented Apr 4, 2017

spencerahill commented Apr 24, 2017