Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Accelerate fundamental functions with OpenCL using gputools #4215

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

Dalordish
Copy link

@Dalordish Dalordish commented Oct 8, 2019

I have been working with @jni to dispatch certain operations transparently to gputools if openCL is configured correctly on the host machine, and the user opts in. (see #3810 (comment) )

The only operation that has been dispatched in this PR is the gabor filter, although many other functions, especially those that use the convolve function, for which the shim is already there, could easily be accelerated in the same way. (GPU code in _shared, gabor swapped over to using gpu in filters)

On @jni's recommendation I've vendored GPUtools (https://github.com/maweigert/gputools) because it is quite immature, and the lead author is not very responsive. However, it is mature enough to be useful for our purposes.

Motivation:

It is a recurring issue that skimage is slower than a few other libraries that offer similar functionality, partly by design, favouring instead a stable ecosystem, convenient and pythonic API, and good documentation. A good increase in speed could be had by dispatching common and critical functions to faster, hardware accelerated code with a transparent shim, thereby gaining lots of 'low hanging fruit' speedups, without changing the interface or increasing complexity singificantly.

Additionally, gputools uses openCL, which means speedups can occur across all sorts of hardware (as opposed to CUDA, which is exclusively nVidia). This means even low-power laptop computers could potentially see a big speed increase.

Other critical operations

Creating a dispatch for operations that are used in most operations, like convolve and fft would mean that a huge portion of code could be sped up without much extra maintainence actually needing to be done. This means you get most of the benefit without having to write or maintain your own openCL code.

Minimally disruptive

Because of the way the PR is written, there should be nothing that is backwards-compatibility breaking. Using any of the gpu functions is opt-in, with an environment variable check at import time. Even once the decision to use gpu functions is made, all operations should automatically fallback in cases where gputools doesn't support the same options. In future, different levels of 'eagerness' could be defined in an env var, since many of the incompatibilities are options users may not care that much about, like certain boundary padding strategies.

Benchmarks

I have created a benchmark for the gabor filter along the same way that sobel is tested. Note that asv needs to be run with the --launch-method=spawn flag, or it will hang when using gputools function. (airspeed-velocity/asv#883) (https://asv.readthedocs.io/en/stable/commands.html#asv-run))

On my laptop with integrated graphics (linux: Ubuntu 18.04, cpu: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz ,num_cpu 4, ram 7905384), using the same benchmark used for the sobel filter, GPU acceleration takes 18.5 seconds, whereas using the gabor filter with ndi takes 53 seconds. This showcases a clear speedup even on systems without powerful graphics cards, which means most users should see some benefit.

##Usage

The only thing that needs to happen is for the dependencies to be installed, and for the environment variable SKIMAGE_USE_GPU to be set.

"wrap":"CLK_ADDRESS_REPEAT",
"nearest":"CLK_ADDRESS_CLAMP_TO_EDGE",
"reflect":"CLK_ADDRESS_MIRRORED_REPEAT"}
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dalordish this string is totally useless, as far as I can tell. =)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, was an artifact from some mucking around that is irrelevant, fixed!

reikna==0.7.4
scikit-tensor-py3==0.4.1
scipy==1.3.1
six==1.12.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to make a new requirements/gpu.txt file and change the == pinnings to >=.

from scipy.ndimage import convolve, binary_erosion, generate_binary_structure

from scipy.ndimage import binary_erosion, generate_binary_structure
from scipy.ndimage import convolve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dalordish can you undo changes to this file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -491,7 +491,7 @@ def test_vertical_mask_line(grad_func):
expected[1:-1, 4:7] = 0 # but line and neighbors masked

result = grad_func(hgrad, mask)
assert_allclose(result, expected)
assert_allclose(result, expected,atol=1e-07,rtol=1e-08)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

return out

else:
convolve = ndi.convolve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scikit-image/core I want to draw your attention to this file, as well as the changes in skimage.filters._gabor.py, which only consist of changing the imported convolve version from ndimage to skimage._shared.gpu. (@Dalordish I would just make this skimage._shared, so the function is just skimage._shared.convolve.) imho it is worthwhile building this layer.

Copy link
Contributor

@grlee77 grlee77 Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be nice to implement a context manager as was done for the upcoming scipy.fft module so that users could run selected code blocks on GPU vs CPU with code similar to the following.

with skimage.set_backend('gputools'):
    out = gabor(image, 0.1)

or

with skimage.set_backend('cpu'):
    out = gabor(image, 0.1)

scipy.fft also includes a set_global_backend function that can be used to set the backend globally. I think having something like that here would be nice as an alternative to only reading the global state once based on an environment variable.

@lagru
Copy link
Member

lagru commented Oct 8, 2019

Thank you for your work on this! This looks interesting.

On @jni's recommendation I've vendored GPUtools (https://github.com/maweigert/gputools) because it is quite immature, and the lead author is not very responsive.

Could you explain what we gain from vendorizing this package and the connection to its maturity level? What would be the disadvantage of including this as an optional dependency or a git submodule?

@jni
Copy link
Member

jni commented Oct 9, 2019

Could you explain what we gain from vendorizing this package and the connection to its maturity level? What would be the disadvantage of including this as an optional dependency or a git submodule?

@lagru I've found with Martin (whom I know personally) that even with nice PRs it's unclear whether you'll get something in for months on end. This is fine — I have zero expectation of free work from FOSS maintainers who might have moved on to bigger and better things! — but it means that even if we fix the things we need to fix in gputools, we can't count on the fixes making their way upstream on any specific timeframe, let alone onto PyPI. I've even offered to Martin to take over maintenance of gputools but I have not yet got a response (several months later).

Vendoring is one solution that gives us full control over the package. Thinking more about it, perhaps what we need to do is make a hard fork of it, with a name that we control on PyPI. Suggestions welcome. =)

I don't want to use git submodules, as they are not widely used and I find them hard to work with.

@jni jni changed the title Gpu gabor speedup WIP Accelerate fundamental functions with OpenCL using gputools Oct 14, 2019
@grlee77
Copy link
Contributor

grlee77 commented Oct 14, 2019

I am in favor of working towards supporting other computational backends such as OpenCL on the GPU.

I know I have mentioned it in the past, but for GPU hardware CuPy is very nice for supporting functions like gabor that are built on simple numpy/scipy functions. The NumPy API there is much closer to complete than it is for PyOpenCL (which underlies gputools).

A few positive points for CuPy:

  1. It has had > 100 contributors and has a number of active maintainers (some of whom are paid to work on it).
  2. It is a dependency for the Chainer machine learning framework, so this gives an additional community of users that rely on it and contribute back to the toolkit.
  3. It has pretty extensive coverage of the NumPy API with support for a growing subset of SciPy
  4. Performance is generally pretty good.

The biggest drawback for CuPy is that it has been limited specifically to NVIDIA hardware. However, recently a PR implementing ROCm/HIP support was merged. HIP converts CUDA code to C++ code that can be compiled for either NVIDIA or AMD-based GPUs. Not everything in CuPy will currently work on AMD hardware, but it ican in principle be extended as needed. CUDA + HIP is not as flexible as OpenCL in that it is still limited to either AMD or NVIDIA GPUs while OpenCL can work on a wider range of devices, including x86 CPUs.

The biggest drawback for gputools is the fairly limited support / level of active development as discussed above. Even the PyOpenCL and the Reikna libraries it is built on top of are are maintained by very few people.

As a point of comparison, here is a quick benchmark of gabor using CuPy (may require the master branch to have cupyx.scipy.ndimage.convolve available) adapted from the benchmark in this PR:

import time
import numpy as np
import skimage.filters

image = np.random.random((4000, 4000))
image[:200, :200] += 1
image[300:, 300] += 0.5
currtime = time.time()
output = skimage.filters.gabor(image, 0.1)
delta = time.time() - currtime
print('taken (CPU)', delta)

# define a version of gabor that can take cupy arrays as input
def gabor(image, frequency, theta=0, bandwidth=1, sigma_x=None,
          sigma_y=None, n_stds=3, offset=0, mode='reflect', cval=0):
    from skimage.filters import gabor_kernel
    from cupy import get_array_module

    g = gabor_kernel(frequency, theta, bandwidth, sigma_x, sigma_y, n_stds,
                     offset)
    xp = get_array_module(image)

    if xp == np:
        from scipy.ndimage import convolve
        from skimage._shared.utils import check_nD
        check_nD(image, 2)
    else:
        from cupyx.scipy.ndimage import convolve
        image = xp.asarray(image)
        if image.ndim != 2:
            raise ValueError("a 2D image is required")
        g = xp.asarray(g)
    filtered_real = convolve(image, g.real, mode=mode, cval=cval)
    filtered_imag = convolve(image, g.imag, mode=mode, cval=cval)

    return filtered_real, filtered_imag

import cupy
imageg = cupy.asarray(image)
output = gabor(imageg, 0.1)  # dry run to compile CUDA kernels

currtime = time.time()
output = gabor(imageg, 0.1)
delta = time.time() - currtime
# print(output)
print('taken (GPU)', delta)

On my system I get
taken (CPU): 35.08 seconds
take (gputools GPU from this branch: NVIDIA 1080 Ti): 1.78 seconds
taken (CuPy GPU: NVIDIA 1080 Ti): 1.11 seconds (on GPU) or 1.28 s (including GPU/Host transfers)

So for this demo case, gputools/OpenCL is not too much slower than a CuPy/CUDA-based implementation. That is nice to see.

Note 1: It is possible to get another factor of ~two times faster using a CuPy branch with complex dtype support for cupyx.scipy.ndimage.convolve, but this is not yet available in the master branch.

Note 2: the if/else pattern in the demo above can be avoided in practice, but I wanted to make a simple standalone function for illustration here.

@grlee77
Copy link
Contributor

grlee77 commented Oct 14, 2019

Re maintenance, I just had a nice conversation with @haesleinhuepf, author of CLIJ, which was inspired by gputools, but goes far beyond it (currently it has ~200 kernels implemented, vs ~20 for gputools). He proposed to collaborate on a joint set of OpenCL kernels, for which we automatically generate Python, Java, Matlab, etc bindings. (Note: on our side we would only worry about the Python ones, of course!) I think this could work quite well. @royerloic would probably also be interested in joining this initiative.

This sounds like a reasonable approach. I think having multiple maintainers contributing to the underlying GPU code is going to be key to long term maintainability.

@stefanv
Copy link
Member

stefanv commented Oct 15, 2019

Once we have two different code paths (opencl, cpu) for calculating the same result, we should test those results against one another.

@stefanv
Copy link
Member

stefanv commented Oct 15, 2019

@grlee77 For me, the killer in the CuPy approach is the increased complexity it will introduce into our build stack to support more platforms than just NVIDIA. And, of course, we will need to.

@grlee77
Copy link
Contributor

grlee77 commented Oct 15, 2019

@grlee77 For me, the killer in the CuPy approach is the increased complexity it will introduce into our build stack to support more platforms than just NVIDIA. And, of course, we will need to.

Yeah, I don't think it is a good idea to add CuPy as a backend within skimage itself at the moment. I am thinking more along the lines of just maintaining a 3rd party package (or even creating a cupyx.skimage module in CuPy itself) that has a small subset of skimage available and supports having the inputs/outputs be GPU arrays. I implemented a handfull of skimage functions on CuPy for a research project, but it could be worth making something like that more broadly available. For things which are currently Cython-based in skimage, the conversion is much less trivial and would need substantially more time to implement.

@stefanv
Copy link
Member

stefanv commented Oct 15, 2019

If people are going around making small subsets of skimage accelerated, shouldn't we think about a backend system so that we can at least a) make it visible to the user which backend is being run, should they be interested and b) support its usage through the standard API?

(For the first, you can imagine logging to a specific logger every time an skimage function is called with a declaration of the function name and backend being executed.)

@jni
Copy link
Member

jni commented Oct 15, 2019

For reference, here is the CLIJ paper, which presents some nice benchmarks on both intel integrated graphics on laptops and an nvidia gpu: https://www.biorxiv.org/content/10.1101/660704v2

@grlee77 I like your context manager + global setting proposal. Re CuPy, I'm less convinced that I want to spend any effort at all supporting it. My view is that this is a classic market failure where people are maximising their short-term interest while handing nvidia a monopoly on gpu array computing, which is undoubtedly a bad thing. I actually chatted with some Chainer/CuPy folks at SciPy Japan and they acknowledged the whole thing, but that they ultimately had to get on with their business rather than make a political stand. (And that AMD were, at the time that they approached them, not interested in supporting scientific computing, because they were raking it in with bitcoin mining rigs.) Also, as a user of light laptops, I want my code to be portable everywhere, rather than requiring nvidia hardware to even run.

IF the whole backend switching can be made super transparent from skimage's perspective (ie ~0 additional maintenance burden), then I of course would not object to a 3rd-party CuPy backend package. I'm not sure from a software engineering perspective just how feasible that is.

Finally, regarding your experimental CuPy ported functions: of course you should publish those as open source and announce on the scikit-image mailing list! One of the great things about open source is experimenting in the open and then seeing what kind of crazy stuff people come up with! =)

@haesleinhuepf
Copy link
Contributor

haesleinhuepf commented Oct 15, 2019

Hi all,

thanks for inviting me to this discussion. Just to explain my rationale and motivation of bringing clij technology to the pythonverse. Primary goal of the cl-ij-ma-py project is making code-snippets exchangable between the ImageJ, Matlab and Python communities. I'm a bridge builder.

So this is ImageJ Java / Groovy / Jython:

clij.op().automaticThreshold(input, output, "Otsu");

This is Matlab:

clatlab.op().automaticThreshold(blurred, thresholded, "Otsu");

And finally python:

clijx.op.automaticThreshold(blurred, thresholded, "Otsu")

The current plan is having this syntax in clij2 because it enables convenient auto-completion in Fijis script editor and Matlab:

clij2.automaticThreshold(blurred, thresholded, "Otsu")

I guess you see that it's basically the same code. By making code exchangable between the communities, we can spare efforts in documentation and maintenance. Furthermore, as behind always the same OpenCL code is running, we reduce difference in results between the communities as discussed here. The life science community likes the idea pretty much:
image

The major imponderable at the moment are the installation routines for clijpy, as you need to install a JVM under the hood of python. For some people this bridge may eben be a no-go. In order to get a native python opencl kernel executor which is compatible with clij kernels, one needs to translate this (99% finished) class to python. It basically defines a lot of pre-compiler statements in order to translate clijs image-type agnostic OpenCL to real OpenCL that can be executed by most Intel/AMD/NVidia GPUs produced since 2012.
I think it's not too much work but I lack of python coding and deployment skills. So I'd definitely need help on this part.

Last but not least: We thought a lot about having a context-switch like thingy in ImageJ. We were wondering if it makes sense that ImageJ detects a powerful GPU and then runs code on it instead of old implementations for the CPU. However, we decided against it for the sake of reproducibility. Some ImageJ operations are inconsistent, e.g. between 2D and 3D and we didn't want to replicate these inconsistencies. But we also didn't want to break old workflows. Furthermore, most GPUs don't support double-precision and thus, minor differences between CPU and OpenCL impementations will always be present. Thus, we ask the users to actually translate their workflows from native ImageJ to clij. As you can spare hundreds of processing hours, nobody really complained about the translation efforts yet.

Let me know what you think. I'm happy to provide any level of detail to clij internals if you would like to learn more.

Cheers,
Robert

@jni
Copy link
Member

jni commented Oct 15, 2019

@haesleinhuepf We'd probably want a more heavyweight modification of the API for Python and especially for scikit-image. We prefer functional style APIs, and different functions for different methods, e.g. in scikit-image we have filters.threshold_otsu(), filters.threshold_li(), etc. Generally, it's output = function_name(input, param1=x, param2=y). When there is performance benefits to reusing an output array, we do output = function_name(input, param1=x, param2=y, out=output).

@haesleinhuepf
Copy link
Contributor

If it's possible to put a clij-like layer in front, that sounds good. Again, my goal is enabling the communities exchanging code without hustle 😉

@royerloic
Copy link

I would love to see all the OpenCL kernels that @haesleinhuepf wrote accessible from python. One approach would be to have a 'pythonic' API leveraging the already excellent PyOpenCL, and then have on top something that would bridge the syntax to the 'common' CLIJ API that Robert was outlining above. II think this deserves its own pip installable repo. Then, we can think about how to provide these functions to scikit-image, which is a whole topic on its own. In a nutshell, my suggestion is to tackle the different problems in isolation:

i) Bring CLIJ kernels to the python world (own pip installable, PyOpenCL compatible),
ii) Add a convenience/compatibility layer to have as an option the same API that CLIJ provides for other languages.
iii) Figure out how GPU powered algorithms can be integrated into scikit-image,
iv) Integrate CLIJ as possible GPU-backend in scikit-image (whatever that may mean)

Note: Most people here in this thread are interested in (iii) so sorry for the spamming.
And, (i) and (ii) are work but pretty straightforward.

Now, my own two cents about (iii) is that you need an extensible and optional backend machinery to be protect scikit-image from the rather (unfortunately) chaotic world of GPU computing. I would definitely not have GPU 'backends' be installed by default but perhaps simply by installing packages, e.g. : pip install scikit-image-clij
This would somehow give the opportunity to use a certain flavor of GPU acceleration. The truth is that there will be quite some flux for some time so, would not exclude that several such extensions (scikit-image-pytorch, scikit-image-halide, scikit-image-cupy, ...) would need to exist concurrently. Maybe, one day, there will be one way to code for GPUs, and then this could be really integrated in the main repo, but I really don't think we are there.

I love the idea of using context managers to switch 'backends'.
Changing the particular imported modules might also be a good approach.
Cupy uses the second approach and makes it very seamless to switch from numpy
to cupy...

My own preference is OpenCL because in practice this is the most portable form
of GPU computing out there... Despite recent insults to that noble open standard
by Apple (who, ironically, created it in the first place).

Finally, I would just to state again, that the speed that GPU processing affords is so
significant, that I have found that for my work it is simply not an option to not use it.
We are talking about speeds up by 2 orders of magnitude (between 50x ands 100x,
sometimes even more when you really know what you are doing). This is just to remind
ourselves of why we should care about GPUs at all...

All very exciting in any case...

@haesleinhuepf
Copy link
Contributor

I have found that for my work it is simply not an option to not use it.

Agreed! And a speedup factor of 200 means: What others do in 3 months, you do on a single day... before lunch 🤯😉

@grlee77
Copy link
Contributor

grlee77 commented Oct 15, 2019

Now, my own two cents about (iii) is that you need an extensible and optional backend machinery to be protect scikit-image from the rather (unfortunately) chaotic world of GPU computing. I would definitely not have GPU 'backends' be installed by default but perhaps simply by installing packages, e.g. : pip install scikit-image-clij
This would somehow give the opportunity to use a certain flavor of GPU acceleration. The truth is that there will be quite some flux for some time so, would not exclude that several such extensions (scikit-image-pytorch, scikit-image-halide, scikit-image-cupy, ...) would need to exist concurrently. Maybe, one day, there will be one way to code for GPUs, and then this could be really integrated in the main repo, but I really don't think we are there.

We implemented a way for 3rd parties to implement backends to scipy.fft by using uarray. SciPy bundles its own default backend implemented using NumPy and its own copy of the pypocketfft library.

Backends were also implemented for two 3rd party libraries: PyFFTW (see pyFFTW/pyFFTW#269, pyFFTW/pyFFTW#274) and CuPy (see cupy/cupy#2355). The 3rd party backends live within these third party libraries and there is no PyFFTW or CuPy-specific code needed in SciPy itself.

It seems like a similar approach could be applied to scikit-image. For a specific example of what defining the backend looks like, see specifically the following file defining a scipy.fft backend for pyFFTW: https://github.com/pyFFTW/pyFFTW/blob/d09015b8cfc461b640a4496afb5bc200dd689813/pyfftw/interfaces/scipy_fft.py. There is basically a decorator line added for each function to be registered with the backend.

The corresponding code on the scipy side that defines the backend and context managers for backend selection lives here.

The use of uarray adds a small amount of functional call overhead (~1 microsecond), but this should not typically be a significant fraction of the run time for functions used in image processing.

@stefanv
Copy link
Member

stefanv commented Oct 16, 2019

I bring you a beautiful historic artifact: PR 14!

We can look at uarray or some custom registration mechanism, but we need to define first what we need. I suggest at least the following attributes:

  1. It must be possible to get a log of executions, so that you can monitor which backends are being used.
  2. All functions most be overridable (we don't manually want to curate which functions do or do not make sense to speed up).
  3. We determine the user facing API; this is never determined by external libraries.
  4. All external implementations must conform to the scikit-image test suite. Not implementing full functionality (e.g., N-D, all values of a flag, etc.) is okay, in which case the plugin should raise NotImplementedException and the next plugin should be tried.
  5. Return images must be coercible into NumPy arrays.

Questions that need to be answered:

  1. Should registration be automatic or explicit? E.g., would you ask that users use a context manager, or that they call a registration routine to install a faster interface. The benefit of having it happen automatically, is that the same skimage code can execute on multiple platforms, just at different speeds. The disadvantage is that, just by looking at the code, you may not know why things are going wrong or giving different results.

  2. How do we determine the resolution order of plugins?

@jni
Copy link
Member

jni commented Oct 16, 2019

@stefanv nice list!

a beautiful historic artifact

My favourite thing about that PR is that it remains open. 😂 The state of open source in a nutshell!

It must be possible to get a log of executions

This seems like a tall order... We don't do that for e.g. io, do we? I mean, yes, it's easy to produce a log, but it's not necessarily easy to do it without affecting performance.

Other than that, I very much like the principles. I'll make a side note, however: the functions that we are most interested in overriding (convolve, affine_transform, fft) are actually in SciPy, not scikit-image. That makes it a taller order to implement, and I wonder if we can override those principles for functions that we use, rather than provide... Just thinking out loud, not advocating for this.

@stefanv
Copy link
Member

stefanv commented Oct 16, 2019

Internally we have a bit more flexibility, yes.

W.r.t. io, you can query which function is going to be used to read/write/display your image. Since we're only overriding three of them or so, it's an easier problem. Once you start overriding random parts of the library, having an easy way to follow execution flow becomes more important.

If we had delayed computation, inspecting would have been trivial (just look at the task graph), but I'm afraid we don't :)

@haesleinhuepf
Copy link
Contributor

haesleinhuepf commented Oct 24, 2019

Hey @jni

@haesleinhuepf We'd probably want a more heavyweight modification of the API for Python and especially for scikit-image. We prefer functional style APIs, and different functions for different methods, e.g. in scikit-image we have filters.threshold_otsu(), filters.threshold_li(), etc. Generally, it's output = function_name(input, param1=x, param2=y). When there is performance benefits to reusing an output array, we do output = function_name(input, param1=x, param2=y, out=output).

I've been thinking about this. Keeping this well-known scikit-image API is obviously beneficial for users, However I was wondering: What would be the image data type? Is it a numpy-like array or a pointer to GPU-memory? I'm afraid only the pointer-way would bring best speedup. In clij, we push images to the GPU, keep only the pointer to the data and process whole workflows before we pull the result back and convert it to an ImageJ- or Matlab-compatible image type. Have you been thinking about this? Is there any python-type magic I'm not aware of?

@jni
Copy link
Member

jni commented Oct 25, 2019

Is there any python-type magic I'm not aware of?

Yes: the NumPy folks have been doing a lot of work to make non-NumPy arrays behave nicely with code designed for NumPy. See e.g. NEP-13, NEP-18, NEP-22/NEP-30, as well as the __array__ special method, which makes an array-like object easy to cast back to NumPy with np.asarray. So the idea would be functions to return whatever GPU-native array (e.g. a pyopencl.array.Array) it wants to return, but make that object NumPy-like in every way possible. We'd update the scikit-image API to return not necessarily a NumPy array, but a whatever the input was array, and you could chain the methods arbitrarily without exiting the GPU. Since right now skimage will simply break with non NumPy arrays, this isn't even a breaking change and wouldn't have to go through a deprecation cycle!

This is a little hand-wavy, but not extremely hand-wavy. I think it's feasible!

@royerloic
Copy link

Is there any python-type magic I'm not aware of?

Yes: the NumPy folks have been doing a lot of work to make non-NumPy arrays behave nicely with code designed for NumPy. See e.g. NEP-13, NEP-18, NEP-22/NEP-30, as well as the __array__ special method, which makes an array-like object easy to cast back to NumPy with np.asarray. So the idea would be functions to return whatever GPU-native array (e.g. a pyopencl.array.Array) it wants to return, but make that object NumPy-like in every way possible. We'd update the scikit-image API to return not necessarily a NumPy array, but a whatever the input was array, and you could chain the methods arbitrarily without exiting the GPU. Since right now skimage will simply break with non NumPy arrays, this isn't even a breaking change and wouldn't have to go through a deprecation cycle!

This is a little hand-wavy, but not extremely hand-wavy. I think it's feasible!

That's so cool... 😎

@stefanv
Copy link
Member

stefanv commented Oct 25, 2019

Note that it will not be possible to update the scikit-image API to "return whatever the input type was". Imagine, e.g., you receive AlienArray as input, but need to perform optimized Cython operations on the underlying data. 1. You will not be able to get a pointer from AlienArray and 2. Even if we can do our operation we don't know anything about AlienArray, so we cannot guarantee that we can even construct one.

This is the reason why the skimage API is: we accept anything we can understand, but we determine what gets produced as output. This is also the only way to guarantee that routines execute optimally: if we had to provide support for all output array types, we would not be able to select the most efficient code path.

@jni
Copy link
Member

jni commented Oct 25, 2019

@stefanv

Imagine, e.g., you receive AlienArray as input, but need to perform optimized Cython operations on the underlying data.

The context is not for Cython operations, it's for OpenCL operations.

  1. You will not be able to get a pointer from AlienArray and

In the absolute worst case, assuming AlienArray can be coerced to a NumPy array, that's what we'll do.

  1. Even if we can do our operation we don't know anything about AlienArray, so we cannot guarantee that we can even construct one.

I'm not suggesting we construct one, I'm suggesting for a subset of arrays and a subset of functions, we dispatch appropriately. Functions with Cython internals are not in this category.

@stefanv
Copy link
Member

stefanv commented Oct 25, 2019

I'm not suggesting we construct one, I'm suggesting for a subset of arrays and a subset of functions, we dispatch appropriately. Functions with Cython internals are not in this category.

What I'm trying to emphasize is that it would be a break from the way the rest of the skimage API works (and motivating why it functions that way). Having an inconsistent API, dependent on the underlying implementation, is not all that appealing.

To do what you're suggesting, you will need a way to examine input classes and know their capabilities, unless you will just assume that OpenCL will know what to do with them, which I don't think is a given.

Note that the current pipeline model does not preclude what is needed here. You take an OpenCL-compatible array as input, and you produce an OpenCL array as output, if you want. You can even leave it on the GPU. That way, you can still construct fast pipelines. My nitpick was that you should never promise that you will return the input type: just that you will return something that can be interpreted as a NumPy array and that may or may not still live on the GPU.

@jni
Copy link
Member

jni commented Oct 25, 2019

My nitpick was that you should never promise that you will return the input type: just that you will return something that can be interpreted as a NumPy array and that may or may not still live on the GPU.

Oooooooh! Yes, sorry, my phrasing was far too restrictive. This was kind of what I meant. We are in vehement agreement. =P

My feeling is that right now, we guarantee that we return a NumPy array. My hope is that in the future, we will only guarantee it when the input is a NumPy array, otherwise, we only guarantee an array-like. Indeed, in some cases that's what we do already (see the work @GenevieveBuckley did making sure our functions worked with dask arrays), but this is undocumented.

@stefanv
Copy link
Member

stefanv commented Oct 25, 2019

We are in vehement agreement. =P

😂 We do enjoy our intense discussions around vehement agreements, don't we.

@haesleinhuepf
Copy link
Contributor

haesleinhuepf commented Mar 18, 2020

Hey @jni, @royerloic, @stefanv and others,

I started implementing a prototype for the python-side for CLIJ. I'm trying to set it up as community project, called clEsperanto. Loic and Juan already know about it.

I consider you as python software architecture and image processing experts. Thus, I was wondering if one of you guys could guide us a bit in implementing clEsperanto in a good way - potentially compatible with scikit-image. We also have a discussion on goals and potential ways towards implementation in the image.sc forum.

Thanks for your support!

Cheers,
Robert

@FirefoxMetzger
Copy link
Contributor

FirefoxMetzger commented Aug 30, 2020

This PR has gone a bit stale, but I would actually be pretty interested in easy access GPU acceleration for common CV operations.

TL:DR I think it would be nice to not do the "choose a backend for the entire library" approach, but to make the backend choice explicit for individual function calls, e.g., via from skimage.gpu.transform import rescale.

My thought here is that I would probably not go for the "choose a backend" philosophy that has been proposed here. For large pipelines, it makes sense to just specify the compute graph and then let the framework figure out where to place each op (this is essentially what TensorFlow is doing, and afaik PyTorch follows a similar philosophy). Typically, you specify with devices are available in a list-like structure and have a JIT-like tool that figures out which op goes where given the available devices.

Relinquishing control over op placement sacrifices control for convenience and delegates all responsibility to the library. This can be nice for large compute graphs (a compiler can be more efficient than manual optimization), but - from my personal experience with TensorFlow - this approach tends to get in the way when you start writing multi-threaded pipelines. The difficulty is that such libraries (rightfully so) make the assumption that they can use the entire device, which means that you can't really use them in your parallel sections as it will compete with itself. In this case, I'd expect manual/explicit op placement to be much more efficient.

Instead, I would probably advocate for manual op placement and putting the GPU accelerated versions into skimage.gpu.<module-name>. This way it would become quite explicit what the function should return (handle to an array-like object in GPU memory, vs. numpy array) and what will happen if a function gets called.

It will also be easier to maintain, because we don't have to worry about op placement, but can make it the user's responsibility. Not sure if my assumption is correct, but if a user worries about execution speed for their CV pipeline, they are usually competent enough to know or learn about different accelerators and choose the correct function for their needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants