Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util.apply_parallel does not have "compute('processes')" embedded #5409

Open
alexdesiqueira opened this issue May 25, 2021 · 4 comments
Open
Labels

Comments

@alexdesiqueira
Copy link
Member

Description

Saw this question on Stack Overflow today. The answer goes:

I looked at the source code, and apply_parallel relies on this Dask command
(...)
But I found that it needs .compute('processes') at the end of it to guarantee multiple cpu's. So now I'm just using Dask itself

Is there something missing on our end, or is this something we would solve with documentation?

Way to reproduce

The example available on that SO question...

from numpy import random, ones, zeros_like
from skimage.util import apply_parallel

def f3(im):
    print('here')
    for _ in range(10000):
        u=random.random(100000) 

    return zeros_like(im)

if __name__=='__main__':

    im=ones((2,4))

    f = lambda img: f3(img)
    im2=apply_parallel(f,im,chunks=1)

... and the code that "solves" it:

import dask.array as da
im2 = da.from_array(im,chunks=2)
proc = im2.map_overlap(f, depth=0).compute(scheduler='processes')
@alexdesiqueira
Copy link
Member Author

It seems that that could be solved using compute=True in apply_parallel:

compute : bool, optional
    If ``True``, compute eagerly returning a NumPy Array.
    If ``False``, compute lazily returning a Dask Array.
    If ``None`` (default), compute based on array type provided

@grlee77
Copy link
Contributor

grlee77 commented May 26, 2021

Yes, if compute is False (or None when the input is a Dask array) you will need to call compute on the value returned by apply_parallel.

im2 = im2.compute(scheduler='processes')

Depending on the function being accelerated, you may want to try scheduler='threaded' as well as it should have less overhead. If the code being run does not release the GIL, though, only 1 thread will be used. Using processes gets around limitations related to the GIL, but has the overhead of launching multiple Python processes.

If you use compute=True, apply parallel will call compute internally. To control the scheduler used in that case, one can use dask.config.set.

@rfezzani
Copy link
Member

@alexdesiqueira and @grlee77, can we consider this issue solved or do we plan any other action?

@grlee77
Copy link
Contributor

grlee77 commented Oct 18, 2021

I think it can be resolved by improving docs. @alexdesiqeira started a PR for that in #5407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants