-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Accelerate fundamental functions with OpenCL using gputools #4215
base: main
Are you sure you want to change the base?
Conversation
NOTE: GPUTOOLS doesn't seem to expose median, had to edit the package
skimage/_shared/gpu/convolve.py
Outdated
"wrap":"CLK_ADDRESS_REPEAT", | ||
"nearest":"CLK_ADDRESS_CLAMP_TO_EDGE", | ||
"reflect":"CLK_ADDRESS_MIRRORED_REPEAT"} | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dalordish this string is totally useless, as far as I can tell. =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, was an artifact from some mucking around that is irrelevant, fixed!
reikna==0.7.4 | ||
scikit-tensor-py3==0.4.1 | ||
scipy==1.3.1 | ||
six==1.12.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to make a new requirements/gpu.txt
file and change the ==
pinnings to >=
.
skimage/filters/edges.py
Outdated
from scipy.ndimage import convolve, binary_erosion, generate_binary_structure | ||
|
||
from scipy.ndimage import binary_erosion, generate_binary_structure | ||
from scipy.ndimage import convolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dalordish can you undo changes to this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
skimage/filters/tests/test_edges.py
Outdated
@@ -491,7 +491,7 @@ def test_vertical_mask_line(grad_func): | |||
expected[1:-1, 4:7] = 0 # but line and neighbors masked | |||
|
|||
result = grad_func(hgrad, mask) | |||
assert_allclose(result, expected) | |||
assert_allclose(result, expected,atol=1e-07,rtol=1e-08) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
return out | ||
|
||
else: | ||
convolve = ndi.convolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scikit-image/core I want to draw your attention to this file, as well as the changes in skimage.filters._gabor.py
, which only consist of changing the imported convolve version from ndimage
to skimage._shared.gpu
. (@Dalordish I would just make this skimage._shared
, so the function is just skimage._shared.convolve
.) imho it is worthwhile building this layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be nice to implement a context manager as was done for the upcoming scipy.fft
module so that users could run selected code blocks on GPU vs CPU with code similar to the following.
with skimage.set_backend('gputools'):
out = gabor(image, 0.1)
or
with skimage.set_backend('cpu'):
out = gabor(image, 0.1)
scipy.fft
also includes a set_global_backend
function that can be used to set the backend globally. I think having something like that here would be nice as an alternative to only reading the global state once based on an environment variable.
Thank you for your work on this! This looks interesting.
Could you explain what we gain from vendorizing this package and the connection to its maturity level? What would be the disadvantage of including this as an optional dependency or a git submodule? |
@lagru I've found with Martin (whom I know personally) that even with nice PRs it's unclear whether you'll get something in for months on end. This is fine — I have zero expectation of free work from FOSS maintainers who might have moved on to bigger and better things! — but it means that even if we fix the things we need to fix in gputools, we can't count on the fixes making their way upstream on any specific timeframe, let alone onto PyPI. I've even offered to Martin to take over maintenance of gputools but I have not yet got a response (several months later). Vendoring is one solution that gives us full control over the package. Thinking more about it, perhaps what we need to do is make a hard fork of it, with a name that we control on PyPI. Suggestions welcome. =) I don't want to use git submodules, as they are not widely used and I find them hard to work with. |
I am in favor of working towards supporting other computational backends such as OpenCL on the GPU. I know I have mentioned it in the past, but for GPU hardware CuPy is very nice for supporting functions like A few positive points for CuPy:
The biggest drawback for CuPy is that it has been limited specifically to NVIDIA hardware. However, recently a PR implementing ROCm/HIP support was merged. HIP converts CUDA code to C++ code that can be compiled for either NVIDIA or AMD-based GPUs. Not everything in CuPy will currently work on AMD hardware, but it ican in principle be extended as needed. CUDA + HIP is not as flexible as OpenCL in that it is still limited to either AMD or NVIDIA GPUs while OpenCL can work on a wider range of devices, including x86 CPUs. The biggest drawback for As a point of comparison, here is a quick benchmark of import time
import numpy as np
import skimage.filters
image = np.random.random((4000, 4000))
image[:200, :200] += 1
image[300:, 300] += 0.5
currtime = time.time()
output = skimage.filters.gabor(image, 0.1)
delta = time.time() - currtime
print('taken (CPU)', delta)
# define a version of gabor that can take cupy arrays as input
def gabor(image, frequency, theta=0, bandwidth=1, sigma_x=None,
sigma_y=None, n_stds=3, offset=0, mode='reflect', cval=0):
from skimage.filters import gabor_kernel
from cupy import get_array_module
g = gabor_kernel(frequency, theta, bandwidth, sigma_x, sigma_y, n_stds,
offset)
xp = get_array_module(image)
if xp == np:
from scipy.ndimage import convolve
from skimage._shared.utils import check_nD
check_nD(image, 2)
else:
from cupyx.scipy.ndimage import convolve
image = xp.asarray(image)
if image.ndim != 2:
raise ValueError("a 2D image is required")
g = xp.asarray(g)
filtered_real = convolve(image, g.real, mode=mode, cval=cval)
filtered_imag = convolve(image, g.imag, mode=mode, cval=cval)
return filtered_real, filtered_imag
import cupy
imageg = cupy.asarray(image)
output = gabor(imageg, 0.1) # dry run to compile CUDA kernels
currtime = time.time()
output = gabor(imageg, 0.1)
delta = time.time() - currtime
# print(output)
print('taken (GPU)', delta) On my system I get So for this demo case, Note 1: It is possible to get another factor of ~two times faster using a CuPy branch with complex dtype support for Note 2: the if/else pattern in the demo above can be avoided in practice, but I wanted to make a simple standalone function for illustration here. |
This sounds like a reasonable approach. I think having multiple maintainers contributing to the underlying GPU code is going to be key to long term maintainability. |
Once we have two different code paths (opencl, cpu) for calculating the same result, we should test those results against one another. |
@grlee77 For me, the killer in the CuPy approach is the increased complexity it will introduce into our build stack to support more platforms than just NVIDIA. And, of course, we will need to. |
Yeah, I don't think it is a good idea to add CuPy as a backend within skimage itself at the moment. I am thinking more along the lines of just maintaining a 3rd party package (or even creating a |
If people are going around making small subsets of skimage accelerated, shouldn't we think about a backend system so that we can at least a) make it visible to the user which backend is being run, should they be interested and b) support its usage through the standard API? (For the first, you can imagine logging to a specific logger every time an skimage function is called with a declaration of the function name and backend being executed.) |
For reference, here is the CLIJ paper, which presents some nice benchmarks on both intel integrated graphics on laptops and an nvidia gpu: https://www.biorxiv.org/content/10.1101/660704v2 @grlee77 I like your context manager + global setting proposal. Re CuPy, I'm less convinced that I want to spend any effort at all supporting it. My view is that this is a classic market failure where people are maximising their short-term interest while handing nvidia a monopoly on gpu array computing, which is undoubtedly a bad thing. I actually chatted with some Chainer/CuPy folks at SciPy Japan and they acknowledged the whole thing, but that they ultimately had to get on with their business rather than make a political stand. (And that AMD were, at the time that they approached them, not interested in supporting scientific computing, because they were raking it in with bitcoin mining rigs.) Also, as a user of light laptops, I want my code to be portable everywhere, rather than requiring nvidia hardware to even run. IF the whole backend switching can be made super transparent from skimage's perspective (ie ~0 additional maintenance burden), then I of course would not object to a 3rd-party CuPy backend package. I'm not sure from a software engineering perspective just how feasible that is. Finally, regarding your experimental CuPy ported functions: of course you should publish those as open source and announce on the scikit-image mailing list! One of the great things about open source is experimenting in the open and then seeing what kind of crazy stuff people come up with! =) |
Hi all, thanks for inviting me to this discussion. Just to explain my rationale and motivation of bringing clij technology to the pythonverse. Primary goal of the cl-ij-ma-py project is making code-snippets exchangable between the ImageJ, Matlab and Python communities. I'm a bridge builder. So this is ImageJ Java / Groovy / Jython:
This is Matlab:
And finally python:
The current plan is having this syntax in clij2 because it enables convenient auto-completion in Fijis script editor and Matlab:
I guess you see that it's basically the same code. By making code exchangable between the communities, we can spare efforts in documentation and maintenance. Furthermore, as behind always the same OpenCL code is running, we reduce difference in results between the communities as discussed here. The life science community likes the idea pretty much: The major imponderable at the moment are the installation routines for clijpy, as you need to install a JVM under the hood of python. For some people this bridge may eben be a no-go. In order to get a native python opencl kernel executor which is compatible with clij kernels, one needs to translate this (99% finished) class to python. It basically defines a lot of pre-compiler statements in order to translate clijs image-type agnostic OpenCL to real OpenCL that can be executed by most Intel/AMD/NVidia GPUs produced since 2012. Last but not least: We thought a lot about having a context-switch like thingy in ImageJ. We were wondering if it makes sense that ImageJ detects a powerful GPU and then runs code on it instead of old implementations for the CPU. However, we decided against it for the sake of reproducibility. Some ImageJ operations are inconsistent, e.g. between 2D and 3D and we didn't want to replicate these inconsistencies. But we also didn't want to break old workflows. Furthermore, most GPUs don't support double-precision and thus, minor differences between CPU and OpenCL impementations will always be present. Thus, we ask the users to actually translate their workflows from native ImageJ to clij. As you can spare hundreds of processing hours, nobody really complained about the translation efforts yet. Let me know what you think. I'm happy to provide any level of detail to clij internals if you would like to learn more. Cheers, |
@haesleinhuepf We'd probably want a more heavyweight modification of the API for Python and especially for scikit-image. We prefer functional style APIs, and different functions for different methods, e.g. in scikit-image we have |
If it's possible to put a clij-like layer in front, that sounds good. Again, my goal is enabling the communities exchanging code without hustle 😉 |
I would love to see all the OpenCL kernels that @haesleinhuepf wrote accessible from python. One approach would be to have a 'pythonic' API leveraging the already excellent PyOpenCL, and then have on top something that would bridge the syntax to the 'common' CLIJ API that Robert was outlining above. II think this deserves its own pip installable repo. Then, we can think about how to provide these functions to scikit-image, which is a whole topic on its own. In a nutshell, my suggestion is to tackle the different problems in isolation: i) Bring CLIJ kernels to the python world (own pip installable, PyOpenCL compatible), Note: Most people here in this thread are interested in (iii) so sorry for the spamming. Now, my own two cents about (iii) is that you need an extensible and optional backend machinery to be protect scikit-image from the rather (unfortunately) chaotic world of GPU computing. I would definitely not have GPU 'backends' be installed by default but perhaps simply by installing packages, e.g. : pip install scikit-image-clij I love the idea of using context managers to switch 'backends'. My own preference is OpenCL because in practice this is the most portable form Finally, I would just to state again, that the speed that GPU processing affords is so All very exciting in any case... |
Agreed! And a speedup factor of 200 means: What others do in 3 months, you do on a single day... before lunch 🤯😉 |
We implemented a way for 3rd parties to implement backends to Backends were also implemented for two 3rd party libraries: PyFFTW (see pyFFTW/pyFFTW#269, pyFFTW/pyFFTW#274) and CuPy (see cupy/cupy#2355). The 3rd party backends live within these third party libraries and there is no PyFFTW or CuPy-specific code needed in SciPy itself. It seems like a similar approach could be applied to scikit-image. For a specific example of what defining the backend looks like, see specifically the following file defining a The corresponding code on the scipy side that defines the backend and context managers for backend selection lives here. The use of |
I bring you a beautiful historic artifact: PR 14! We can look at
Questions that need to be answered:
|
@stefanv nice list!
My favourite thing about that PR is that it remains open. 😂 The state of open source in a nutshell!
This seems like a tall order... We don't do that for e.g. Other than that, I very much like the principles. I'll make a side note, however: the functions that we are most interested in overriding ( |
Internally we have a bit more flexibility, yes. W.r.t. io, you can query which function is going to be used to read/write/display your image. Since we're only overriding three of them or so, it's an easier problem. Once you start overriding random parts of the library, having an easy way to follow execution flow becomes more important. If we had delayed computation, inspecting would have been trivial (just look at the task graph), but I'm afraid we don't :) |
Hey @jni
I've been thinking about this. Keeping this well-known scikit-image API is obviously beneficial for users, However I was wondering: What would be the image data type? Is it a numpy-like array or a pointer to GPU-memory? I'm afraid only the pointer-way would bring best speedup. In clij, we push images to the GPU, keep only the pointer to the data and process whole workflows before we pull the result back and convert it to an ImageJ- or Matlab-compatible image type. Have you been thinking about this? Is there any python-type magic I'm not aware of? |
Yes: the NumPy folks have been doing a lot of work to make non-NumPy arrays behave nicely with code designed for NumPy. See e.g. NEP-13, NEP-18, NEP-22/NEP-30, as well as the This is a little hand-wavy, but not extremely hand-wavy. I think it's feasible! |
That's so cool... 😎 |
Note that it will not be possible to update the scikit-image API to "return whatever the input type was". Imagine, e.g., you receive AlienArray as input, but need to perform optimized Cython operations on the underlying data. 1. You will not be able to get a pointer from AlienArray and 2. Even if we can do our operation we don't know anything about AlienArray, so we cannot guarantee that we can even construct one. This is the reason why the skimage API is: we accept anything we can understand, but we determine what gets produced as output. This is also the only way to guarantee that routines execute optimally: if we had to provide support for all output array types, we would not be able to select the most efficient code path. |
The context is not for Cython operations, it's for OpenCL operations.
In the absolute worst case, assuming AlienArray can be coerced to a NumPy array, that's what we'll do.
I'm not suggesting we construct one, I'm suggesting for a subset of arrays and a subset of functions, we dispatch appropriately. Functions with Cython internals are not in this category. |
What I'm trying to emphasize is that it would be a break from the way the rest of the skimage API works (and motivating why it functions that way). Having an inconsistent API, dependent on the underlying implementation, is not all that appealing. To do what you're suggesting, you will need a way to examine input classes and know their capabilities, unless you will just assume that OpenCL will know what to do with them, which I don't think is a given. Note that the current pipeline model does not preclude what is needed here. You take an OpenCL-compatible array as input, and you produce an OpenCL array as output, if you want. You can even leave it on the GPU. That way, you can still construct fast pipelines. My nitpick was that you should never promise that you will return the input type: just that you will return something that can be interpreted as a NumPy array and that may or may not still live on the GPU. |
Oooooooh! Yes, sorry, my phrasing was far too restrictive. This was kind of what I meant. We are in vehement agreement. =P My feeling is that right now, we guarantee that we return a NumPy array. My hope is that in the future, we will only guarantee it when the input is a NumPy array, otherwise, we only guarantee an array-like. Indeed, in some cases that's what we do already (see the work @GenevieveBuckley did making sure our functions worked with dask arrays), but this is undocumented. |
😂 We do enjoy our intense discussions around vehement agreements, don't we. |
Hey @jni, @royerloic, @stefanv and others, I started implementing a prototype for the python-side for CLIJ. I'm trying to set it up as community project, called clEsperanto. Loic and Juan already know about it. I consider you as python software architecture and image processing experts. Thus, I was wondering if one of you guys could guide us a bit in implementing clEsperanto in a good way - potentially compatible with scikit-image. We also have a discussion on goals and potential ways towards implementation in the image.sc forum. Thanks for your support! Cheers, |
This PR has gone a bit stale, but I would actually be pretty interested in easy access GPU acceleration for common CV operations. TL:DR I think it would be nice to not do the "choose a backend for the entire library" approach, but to make the backend choice explicit for individual function calls, e.g., via My thought here is that I would probably not go for the "choose a backend" philosophy that has been proposed here. For large pipelines, it makes sense to just specify the compute graph and then let the framework figure out where to place each op (this is essentially what TensorFlow is doing, and afaik PyTorch follows a similar philosophy). Typically, you specify with devices are available in a list-like structure and have a JIT-like tool that figures out which op goes where given the available devices. Relinquishing control over op placement sacrifices control for convenience and delegates all responsibility to the library. This can be nice for large compute graphs (a compiler can be more efficient than manual optimization), but - from my personal experience with TensorFlow - this approach tends to get in the way when you start writing multi-threaded pipelines. The difficulty is that such libraries (rightfully so) make the assumption that they can use the entire device, which means that you can't really use them in your parallel sections as it will compete with itself. In this case, I'd expect manual/explicit op placement to be much more efficient. Instead, I would probably advocate for manual op placement and putting the GPU accelerated versions into It will also be easier to maintain, because we don't have to worry about op placement, but can make it the user's responsibility. Not sure if my assumption is correct, but if a user worries about execution speed for their CV pipeline, they are usually competent enough to know or learn about different accelerators and choose the correct function for their needs. |
I have been working with @jni to dispatch certain operations transparently to gputools if openCL is configured correctly on the host machine, and the user opts in. (see #3810 (comment) )
The only operation that has been dispatched in this PR is the gabor filter, although many other functions, especially those that use the convolve function, for which the shim is already there, could easily be accelerated in the same way. (GPU code in _shared, gabor swapped over to using gpu in filters)
On @jni's recommendation I've vendored GPUtools (https://github.com/maweigert/gputools) because it is quite immature, and the lead author is not very responsive. However, it is mature enough to be useful for our purposes.
Motivation:
It is a recurring issue that skimage is slower than a few other libraries that offer similar functionality, partly by design, favouring instead a stable ecosystem, convenient and pythonic API, and good documentation. A good increase in speed could be had by dispatching common and critical functions to faster, hardware accelerated code with a transparent shim, thereby gaining lots of 'low hanging fruit' speedups, without changing the interface or increasing complexity singificantly.
Additionally, gputools uses openCL, which means speedups can occur across all sorts of hardware (as opposed to CUDA, which is exclusively nVidia). This means even low-power laptop computers could potentially see a big speed increase.
Other critical operations
Creating a dispatch for operations that are used in most operations, like convolve and fft would mean that a huge portion of code could be sped up without much extra maintainence actually needing to be done. This means you get most of the benefit without having to write or maintain your own openCL code.
Minimally disruptive
Because of the way the PR is written, there should be nothing that is backwards-compatibility breaking. Using any of the gpu functions is opt-in, with an environment variable check at import time. Even once the decision to use gpu functions is made, all operations should automatically fallback in cases where gputools doesn't support the same options. In future, different levels of 'eagerness' could be defined in an env var, since many of the incompatibilities are options users may not care that much about, like certain boundary padding strategies.
Benchmarks
I have created a benchmark for the gabor filter along the same way that sobel is tested. Note that asv needs to be run with the
--launch-method=spawn
flag, or it will hang when using gputools function. (airspeed-velocity/asv#883) (https://asv.readthedocs.io/en/stable/commands.html#asv-run))On my laptop with integrated graphics (linux: Ubuntu 18.04, cpu: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz ,num_cpu 4, ram 7905384), using the same benchmark used for the sobel filter, GPU acceleration takes 18.5 seconds, whereas using the gabor filter with ndi takes 53 seconds. This showcases a clear speedup even on systems without powerful graphics cards, which means most users should see some benefit.
##Usage
The only thing that needs to happen is for the dependencies to be installed, and for the environment variable SKIMAGE_USE_GPU to be set.