-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow very large values in relabel_sequential #4612
Conversation
Co-authored-by: Juan Nunez-Iglesias <juan.nunez-iglesias@monash.edu>
Hello @VolkerH! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-05-04 08:48:01 UTC |
The docs specify that only integer arrays are supported. This change was added in gh-811 without any discussion so imho safe to revert.
Co-authored-by: Juan Nunez-Iglesias <juan.nunez-iglesias@monash.edu>
Initial PR mentioned other uses for ArrayMap. We have decided to not include these in this pull request, but focus just on |
@scikit-image/core this is ready for review! This code will allow a very easy fix for #1396, but since that will require adding a new API, which will require more discussion, we thought we should get this in first since it is already a major improvement. CC @uschmidt83 you might be interested in this change. =) |
return out.reshape(orig_shape) | ||
|
||
|
||
class ArrayMap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missed something, but is the definition of the ArrayMap
object necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rfezzani I think the docstring does a decent job explaining things? It is a way to preserve the array indexing remapping API without creating enormous arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, my question then was not clear, sorry. I reformulate: can't we use a function here instead of the class definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class implements the __array__
method for backward compatibility as relabel_sequential
returns the forward and backward transformations in the form of ndarrays that are used as lookup tables.
If we just returned simple functions for forward and backward transformations existing code that relies on arrays being returned would no longer work.
Basically we need the function to be triggered when [ ]
indexing is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I was in fact missing something =), but I think that this approach is not addressing the problem of large memory allocation: it is simply postponed to the implementation of the __array__
method!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rfezzani no, the class allows relabeling with []
using Cython. It never needs to instantiate the arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused here. If the __array__
method definition is required, the MemoryError
pointed out in the PR description is not addressed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, one can still trigger a MemoryError
if one wants to instantiate the lookup-table as a numpy array. However, instantiating the lookup-table is not required to access the functionality of relabeling. The same functionality is provided (without memory issues) using the __getitem__
and __call__
interfaces.
Example:
labels = np.array([1, 1e16], dtype=np.int64)
# 1. call relabel_sequential
relab, fw, inv = relabel_sequential(labels) # does not allocate tons of memory
# 2. use the returned forward transform to relabel the array again using fancy indexing:
_relab = fw[labels] # does not allocate tons of memory ... is handled by __getitem__
# 3. use the returned forward transform to relabel the array again using call interdace:
_relab = fw(labels) # does not allocate tons of memory ... is handled by __call__
# 4. turn the returned forward transform into a numpy array
fw_lut = np.narray(fw) # this will now try and allocate a huge array !
Use case 4.
is the only one where we still are likely to get MemoryErrors
if we have large values in the input array. So in that sense one can still potentially hit this problem in the current code. However, this functionality is exclusively for backwards compatibility, in case anyone has code that used the returned transformations as an array (and not just for fancy indexing) for whatever reason.
We provide the functionality for relabeling another array using the fw
or inv
transforms using methods 2.,3.
.
I agree that this is not immediately obvious. To address this I have the following suggestions:
- spell out very clearly in the docstring that
np.array(fw)
is not recommended, or - raise a warning in
__array__
that states that this is not recommended and for backwards compatibility only, or - remove the
__array__
method altogether, which may be a breaking change but will likely not affect a lot of use cases.
required_type = np.min_scalar_type(new_max_label) | ||
if np.dtype(required_type).itemsize > np.dtype(label_field.dtype).itemsize: | ||
offset = int(offset) | ||
in_vals = np.unique(label_field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that out_array
is almost accessible via the return_inverse
of np.unique
:
>>> import numpy as np
>>> from skimage.segmentation import relabel_sequential
>>>
>>> offset = 2
>>> label_field = np.array([[1, 1, 5, 5, 8, 99, 42],
... [1, 0, 0, 5, 99, 99, 42]])
>>> relab, fw, inv = relabel_sequential(label_field, offset=offset)
>>>
>>> in_vals, out_array = np.unique(label_field, return_inverse=True)
>>>
>>> out_array = out_array.reshape(label_field.shape) + offset - (0 in in_vals)
>>> out_array[label_field == 0] = 0
>>> relab
array([[2, 2, 3, 3, 4, 6, 5],
[2, 0, 0, 3, 6, 6, 5]])
>>> out_array
array([[2, 2, 3, 3, 4, 6, 5],
[2, 0, 0, 3, 6, 6, 5]])
These 3 lines are no more needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rfezzani Fascinating!!!
However, if we are building the map arrays anyway, this is actually faster than the overhead of running np.unique with return_inverse=True
:
In [3]: %timeit np.unique(seg)
3.83 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit np.unique(seg, return_inverse=True)
7.12 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit j.relabel_sequential(seg)
6.2 ms ± 36.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: reseg, fw, inv = j.relabel_sequential(seg)
In [7]: %timeit fw[seg]
2.33 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit np.empty_like(seg)
547 ns ± 6.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Having said that, this would have been a useful refactor before this Cython code. 😅
it introduces a clear breaking change, and we are usually more careful in this case. Can't we think of a scenario with a deprecation cycle? |
I finally took the time to read the entire PR and the comments. Would it be possible @VolkerH to return for now the arrays for the inverse map, unless a flag (for example |
I am +1 for @emmanuelle suggestion. |
@emmanuelle my idea for this PR, which I suggested to @VolkerH, was to make a mock array-like object that would minimize, if not eliminate altogether, downstream issues with the API breakage. The problem I have with the I just did a github code search for relabel_sequential (it includes some filename exclusions to avoid vendored copies of skimage), and find confirmed my suspicion that the vast majority of uses throw away the forward and inverse maps. In fact, I have to go to page 4 before I find one example (@uschmidt83's stardist), and in that example, our In all, I went through 12 pages of results, and found just 4 cases where they didn't throw out the forward and inverse maps. This perhaps indicates that we should get rid of the forward and inverse maps altogether, but that is probably best reserved for the 1.0 transition, with the Of the remaining 4 cases, one is the Finally, on page 6, I see that @Borda in PyImSegm tries to fill our LUT. This is one case where a conversion to numpy array is indeed needed. This could actually be done by creating a As a side note, I also found this example of @constantinpape avoiding So, in short, the code as is would improve performance for 117/120 users without any break whatsoever. A very simple fix covers 2 of the remaining cases, and a slightly harder but still simple fix means that 120/120 surveyed uses of this function would not see a breaking change. Given all this, I think it's a pretty compelling case to not have a 4-release long deprecation cycle... |
@emmanuelle @rfezzani I've tried to alleviate issues with API breakage. Currently, the new code covers 120/120 surveyed uses of |
Great to see this happen! For me this actually is one of the major reasons to keep using One more comment though: I find getting the forward and inverse map very useful. At least for me, there is often the situation where I relabel some small array (node labels used for a graph based segmentation) and then need to apply this relabeling to some much larger array (the full pixel-wise segmentation). Is this still part of the new API or are you deprecating forward and backward map? |
@jni I'm quite impressed by your efforts to check how the community uses this function, and also happy that it could lead to improvements for the PR. Given the fact that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing @emmanuelle and @rfezzani and thanks @jni for burning the midnight oil to do the sleuth work which substantiates my suspicion that the forward and backward maps weren't actually used widely. With the @constantinpape, there are still forward and inverse maps returned, just in the form of an |
@constantinpape (:wave:!)
To clarify my comment, the forward and backward maps will never disappear. In the next release, they will be (:crossed_fingers:) the ArrayMap objects proposed in this PR, which achieve the exact same thing as the current fw/inv maps but in memory proportional to the number of distinct labels, rather than the max label. In skimage 1.0, we may switch to the syntax |
So, I continued my search today and found someone slicing into the forward map and updating it with @VolkerH yes, we are now very close to recreating NumPy arrays. 😂 |
After searching through 24 pages, I found just one more uncovered case: boolean indexing. This now covers 240/240 surveyed uses of relabel_sequential on GitHub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really slick and well considered. Thanks to @jni for extensive work checking how this is used, and ensuring the improvements continue to serve the users.
I'm going to go ahead and merge this so we can have it for 0.17. The only CI failures appear to be pre-builds with Cython the culprit. |
Wow, thanks for doing this extensive work to ensure compatibility! |
Thanks everyone for reviewing and commenting and thanks to @jni for his work on taking this from my proof of concept to something that is ready to merge. |
Thanks, all, this is quite an epic piece of work. I love the simplicity of the C++ code too. |
Description
This PR introduces:
relabel_sequential
, andBackground:
Storage requirements in
relabel_sequential
currently scale with maxim value in array not with the size of the array:The current implementation of
relabel_sequential
uses numpy's fancy indexing in a very clever way. As it leverages numpy's array operators it is very fast. However, the current implementation requires building a LUT as a numpy array that scales (in memory requirement) with the value of the largest label. This is a long-standing problem mentioned in #1349 (comment). Depending on the exact value in the array this can lead toMemoryError
orValueError
exceptions. I have documented these failure modes in this notebook under the heading Storage Requirements.Cython implementation implements a sparse LUT using a hashmap
To address the undesired memory scaling behaviour, a new function
map_array
is introduced that maps one array to another in which the LUT is implemented as a hashmap. Numpy does not provide a hash-table data structure and using a Python dict would be too slow. Instead, anunordered_map
from the C++ STL is used with Cython. Initial benchmarking on my machine suggests that this is not dramatically slower than the fany indexing (see this notebook for my initial experiments, including alternative implementations using numba for additional background.relabel_sequential
is reimplemented using the newmap_array
.Other uses: e.g. creating value-maps
I have recently played around with some visualizations for measurements returned by
regionprops_table
to create value maps, see here for an example. This has been suggested for skimage in #1396 and can be trivially implemented usingmap_array
.There is still plenty of work to do in terms of code tidying, documentation and testing.
Thanks to @jni for some help and encouragement with this.
Checklist
./doc/examples
(new features only)./benchmarks
, if your changes aren't covered by anexisting benchmark
For reviewers
later.
__init__.py
.doc/release/release_dev.rst
.