Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow rank filters to work with nogil #5399

Merged
merged 2 commits into from
May 17, 2021

Conversation

klaussfreire
Copy link
Contributor

Description

Allow rank filters to work with nogil

For reviewers

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.

@grlee77
Copy link
Contributor

grlee77 commented May 16, 2021

Thanks @klaussfreire, but I think these are already nogil. Does it seem that these are not releasing the gil? One way to try running these in parallel is via skimage.util.apply_parallel. I haven't tested whether or not that accelerates things much for these rank filters, but suspect it would provide some benefit on large enough images.

@hmaarrfk
Copy link
Member

hmaarrfk commented May 16, 2021

@grlee77 I was going through this and trying to think about how best to respond to this PR.

The function declaration takes a function pointer: kernel. That one is declared nogil but _core is not declared nogil

cdef void _core(void kernel(dtype_t_out*, Py_ssize_t, Py_ssize_t[::1], double,
                            dtype_t, Py_ssize_t, Py_ssize_t, double,
                            double, Py_ssize_t, Py_ssize_t) nogil,
                dtype_t[:, ::1] image,
                char[:, ::1] selem,
                char[:, ::1] mask,
                dtype_t_out[:, :, ::1] out,
                signed char shift_x, signed char shift_y,
                double p0, double p1,
                Py_ssize_t s0, Py_ssize_t s1,
                Py_ssize_t n_bins) except *:

Finally, callers to core are not declared nogil and they don't call _core with:

with nogil:
    _core(....)

Therefore, I think that this PR is indeeded adding multi-threaded capabilities to rank filters.

@hmaarrfk
Copy link
Member

Personally, I prefer a code organization where low level agorithmic functions are declared as nogil and high level functions call the low level function with with nogil

In this case, I would prefer _core being decalred as with nogil and callers like _autolevel should issue the with nogil call:

https://github.com/scikit-image/scikit-image/blob/main/skimage/filters/rank/generic_cy.pyx#L432

@grlee77
Copy link
Contributor

grlee77 commented May 16, 2021

Yes, I just tried running a couple of the rank filter functions with apply_parallel and it does seem to use only 1 core and ends up being slower than just calling the function directly. This is in contrast with denoise_nl_means, for example, which shows large acceleration when using apply_parallel.

@grlee77
Copy link
Contributor

grlee77 commented May 16, 2021

The function declaration takes a function pointer: kernel. That one is declared nogil but _core is not declared nogil

Yes, I see that now. I was not looking closely enough!

@grlee77
Copy link
Contributor

grlee77 commented May 16, 2021

In this case, I would prefer _core being decalred as with nogil and callers like _autolevel should issue the with nogil call:

right, but I think the issue here is that _core calls numpy functions like np.empty and np.diff, so it can't currently be nogil.

@klaussfreire
Copy link
Contributor Author

In this case, I would prefer _core being decalred as with nogil and callers like _autolevel should issue the with nogil call:

right, but I think the issue here is that _core calls numpy functions like np.empty and np.diff, so it can't currently be nogil.

Exactly

@grlee77
Copy link
Contributor

grlee77 commented May 16, 2021

I compiled this branch, but using apply_parallel still doesn't seem to result in the expected multithreaded CPU usage. I don't imediately see the issue, though. @klaussfreire, were you able to observe an improvement in your application with this change?

@klaussfreire
Copy link
Contributor Author

klaussfreire commented May 16, 2021

I compiled this branch, but using apply_parallel still doesn't seem to result in the expected multithreaded CPU usage. I don't imediately see the issue, though. @klaussfreire, were you able to observe an improvement in your application with this change?

Yes. I have a program that uses local entropy to do HDR stretching on several images in parallel, and it does work effectively there.

Of course a single invocation isn't parallelized with this patch. Parallelization of a single invocation would require OpenMP or some other parallelization strategy inside _core, but in my use case (several invocations on several large images) it works fine.

Edit: let me gather some timings...

@hmaarrfk
Copy link
Member

i also have a feeling that it won't show up with our tests that use small images (and thus things are fast).

@klaussfreire
Copy link
Contributor Author

With patch:

(astro) claudiofreire@linux:~/Pictures/Astro/Cederblad122/2021-05-12> time python -m cvastrophoto combine --mode vvrgb --args lum_w=95,rgb_w=44 --luma-rops nr:tv:weight=0.0002,levels=2 stretch:hdr:steps=9 --color-rops color:wb:wb_set=qhy163m-orion5561 color:scnr nr:tv:weight=0.0003,levels=2 stretch:hdr:steps=9 -- cederblad122.vrgb.tiff cederblad122.{l.deconv,r,g,b}.tiff    

real    10m55.821s
user    72m40.140s
sys     1m41.398s

Without patch

(astro) claudiofreire@linux:~/Pictures/Astro/Cederblad122/2021-05-12> time python -m cvastrophoto combine --mode vvrgb --args lum_w=95,rgb_w=44 --luma-rops nr:tv:weight=0.0002,levels=2 stretch:hdr:steps=9 --color-rops color:wb:wb_set=qhy163m-orion5561 color:scnr nr:tv:weight=0.0003,levels=2 stretch:hdr:steps=9 -- cederblad122.vrgb.tiff cederblad122.{l.deconv,r,g,b}.tiff

real    52m44.981s
user    58m24.521s
sys     1m47.336s

@grlee77
Copy link
Contributor

grlee77 commented May 17, 2021

I tried again and did see the speedup this time, so I must have just not refreshed something properly last time. I can confirm a substantial speedup for a simple example like below that uses apply_parallel with rank.median.

Computation time is 1.58 s when run with in the normal fashion, but 324 ms when run with apply parallel, so about a 5x speedup on the 10-core CPU I tested it on.

import dask
import numpy as np
from skimage.filters import rank
from skimage import data, img_as_float
from skimage.util import apply_parallel

# tile a small image to be able to see the benefit of multithreading
img = img_as_float(data.camera())
img = np.tile(img, (4, 8))
# use original non-tiled shape as the chunk size
chunks = data.camera().shape

selem = np.ones((5, 5), dtype=np.uint8)
depth = selem.shape[0] // 2
img_uint = (np.clip(img, 0, 1) * 255).astype(np.uint8)
extra_keywords = dict(selem=selem)
with dask.config.set(scheduler='threads'):  # scheduler='processes'):
     %timeit rank_median = apply_parallel(rank.median, img_uint, chunks=chunks, depth=depth, extra_keywords=extra_keywords, dtype=img.dtype, compute=True)
# %timeit rank.median(img_uint, **extra_keywords)

@grlee77
Copy link
Contributor

grlee77 commented May 17, 2021

I just happened to be trying apply_parallel with some skimage.restoration functions today and noticed that two of those suffer from the same issue, so I will go ahead and open a similar PR for those.

Copy link
Contributor

@grlee77 grlee77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klaussfreire, this looks good to me. I confirmed multithreading seems to be working properly now for both 2d and 3d (tested with rank.median and apply_parallel).

Copy link
Member

@jni jni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @klaussfreire!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants