Allow very large values in relabel_sequential #4612

VolkerH · 2020-04-24T05:31:58Z

Description

This PR introduces:

an ehancement (I might even go as far as calling it a bug-fix) to relabel_sequential, and
a generic vay to remap integer arrays (that can also remap from integer to other types)

Background:

Storage requirements in relabel_sequential currently scale with maxim value in array not with the size of the array:

The current implementation of relabel_sequential uses numpy's fancy indexing in a very clever way. As it leverages numpy's array operators it is very fast. However, the current implementation requires building a LUT as a numpy array that scales (in memory requirement) with the value of the largest label. This is a long-standing problem mentioned in #1349 (comment). Depending on the exact value in the array this can lead to MemoryError or ValueError exceptions. I have documented these failure modes in this notebook under the heading Storage Requirements.

Cython implementation implements a sparse LUT using a hashmap

To address the undesired memory scaling behaviour, a new function map_array is introduced that maps one array to another in which the LUT is implemented as a hashmap. Numpy does not provide a hash-table data structure and using a Python dict would be too slow. Instead, an unordered_map from the C++ STL is used with Cython. Initial benchmarking on my machine suggests that this is not dramatically slower than the fany indexing (see this notebook for my initial experiments, including alternative implementations using numba for additional background.

relabel_sequential is reimplemented using the new map_array.

Other uses: e.g. creating value-maps
I have recently played around with some visualizations for measurements returned by regionprops_table to create value maps, see here for an example. This has been suggested for skimage in #1396 and can be trivially implemented using map_array.

There is still plenty of work to do in terms of code tidying, documentation and testing.

Thanks to @jni for some help and encouragement with this.

Checklist

Docstrings for all functions
Gallery example in ./doc/examples (new features only)
Benchmark in ./benchmarks, if your changes aren't covered by an
existing benchmark
Unit tests
Clean style in the spirit of PEP8

For reviewers

Check that the PR title is short, concise, and will make sense 1 year
later.
Check that new functions are imported in corresponding __init__.py.
Check that new features, API changes, and deprecations are mentioned in
doc/release/release_dev.rst.

Co-authored-by: Juan Nunez-Iglesias <juan.nunez-iglesias@monash.edu>

pep8speaks · 2020-04-24T05:32:03Z

Hello @VolkerH! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file skimage/segmentation/_join.py:

Line 127:43: E226 missing whitespace around arithmetic operator
Line 127:56: E226 missing whitespace around arithmetic operator
Line 130:44: E226 missing whitespace around arithmetic operator
Line 186:60: W291 trailing whitespace
Line 271:1: W293 blank line contains whitespace
Line 292:31: W504 line break after binary operator
Line 299:31: W504 line break after binary operator
Line 301:34: W504 line break after binary operator
Line 302:27: W504 line break after binary operator

Comment last updated at 2020-05-04 08:48:01 UTC

The docs specify that only integer arrays are supported. This change was added in gh-811 without any discussion so imho safe to revert.

Co-authored-by: Juan Nunez-Iglesias <juan.nunez-iglesias@monash.edu>

VolkerH · 2020-04-27T03:04:25Z

Initial PR mentioned other uses for ArrayMap. We have decided to not include these in this pull request, but focus just on relabel_sequential.

jni · 2020-04-27T03:13:44Z

@scikit-image/core this is ready for review! This code will allow a very easy fix for #1396, but since that will require adding a new API, which will require more discussion, we thought we should get this in first since it is already a major improvement.

CC @uschmidt83 you might be interested in this change. =)

rfezzani · 2020-04-27T15:10:38Z

skimage/segmentation/_join.py

+    return out.reshape(orig_shape)
+
+
+class ArrayMap:


I may be missed something, but is the definition of the ArrayMap object necessary?

@rfezzani I think the docstring does a decent job explaining things? It is a way to preserve the array indexing remapping API without creating enormous arrays.

OK, my question then was not clear, sorry. I reformulate: can't we use a function here instead of the class definition?

The class implements the __array__ method for backward compatibility as relabel_sequential returns the forward and backward transformations in the form of ndarrays that are used as lookup tables.
If we just returned simple functions for forward and backward transformations existing code that relies on arrays being returned would no longer work.
Basically we need the function to be triggered when [ ] indexing is used.

OK, I was in fact missing something =), but I think that this approach is not addressing the problem of large memory allocation: it is simply postponed to the implementation of the __array__ method!

@rfezzani no, the class allows relabeling with [] using Cython. It never needs to instantiate the arrays.

I am confused here. If the __array__ method definition is required, the MemoryError pointed out in the PR description is not addressed...

You are correct, one can still trigger a MemoryError if one wants to instantiate the lookup-table as a numpy array. However, instantiating the lookup-table is not required to access the functionality of relabeling. The same functionality is provided (without memory issues) using the __getitem__ and __call__ interfaces.

Example:

labels = np.array([1, 1e16], dtype=np.int64) # 1. call relabel_sequential relab, fw, inv = relabel_sequential(labels) # does not allocate tons of memory # 2. use the returned forward transform to relabel the array again using fancy indexing: _relab = fw[labels] # does not allocate tons of memory ... is handled by __getitem__ # 3. use the returned forward transform to relabel the array again using call interdace: _relab = fw(labels) # does not allocate tons of memory ... is handled by __call__ # 4. turn the returned forward transform into a numpy array fw_lut = np.narray(fw) # this will now try and allocate a huge array !

Use case 4. is the only one where we still are likely to get MemoryErrors if we have large values in the input array. So in that sense one can still potentially hit this problem in the current code. However, this functionality is exclusively for backwards compatibility, in case anyone has code that used the returned transformations as an array (and not just for fancy indexing) for whatever reason.
We provide the functionality for relabeling another array using the fw or inv transforms using methods 2.,3..

I agree that this is not immediately obvious. To address this I have the following suggestions:

spell out very clearly in the docstring that np.array(fw) is not recommended, or

raise a warning in __array__ that states that this is not recommended and for backwards compatibility only, or

remove the __array__ method altogether, which may be a breaking change but will likely not affect a lot of use cases.

rfezzani · 2020-04-30T12:57:42Z

skimage/segmentation/_join.py

-    required_type = np.min_scalar_type(new_max_label)
-    if np.dtype(required_type).itemsize > np.dtype(label_field.dtype).itemsize:
+    offset = int(offset)
+    in_vals = np.unique(label_field)


It seems that out_array is almost accessible via the return_inverse of np.unique:

>>> import numpy as np >>> from skimage.segmentation import relabel_sequential >>> >>> offset = 2 >>> label_field = np.array([[1, 1, 5, 5, 8, 99, 42], ... [1, 0, 0, 5, 99, 99, 42]]) >>> relab, fw, inv = relabel_sequential(label_field, offset=offset) >>> >>> in_vals, out_array = np.unique(label_field, return_inverse=True) >>> >>> out_array = out_array.reshape(label_field.shape) + offset - (0 in in_vals) >>> out_array[label_field == 0] = 0 >>> relab array([[2, 2, 3, 3, 4, 6, 5], [2, 0, 0, 3, 6, 6, 5]]) >>> out_array array([[2, 2, 3, 3, 4, 6, 5], [2, 0, 0, 3, 6, 6, 5]])

These 3 lines are no more needed.

@rfezzani Fascinating!!!

However, if we are building the map arrays anyway, this is actually faster than the overhead of running np.unique with return_inverse=True:

In [3]: %timeit np.unique(seg) 3.83 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [4]: %timeit np.unique(seg, return_inverse=True) 7.12 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: %timeit j.relabel_sequential(seg) 6.2 ms ± 36.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [6]: reseg, fw, inv = j.relabel_sequential(seg) In [7]: %timeit fw[seg] 2.33 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [8]: %timeit np.empty_like(seg) 547 ns ± 6.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Having said that, this would have been a useful refactor before this Cython code. 😅

rfezzani · 2020-05-01T19:57:22Z

ArrayMap are fine for me, but, as @VolkerH mentioned

... in case anyone has code that used the returned transformations as an array (and not just for fancy indexing) for whatever reason.

it introduces a clear breaking change, and we are usually more careful in this case. Can't we think of a scenario with a deprecation cycle?

emmanuelle · 2020-05-02T13:59:23Z

I finally took the time to read the entire PR and the comments. Would it be possible @VolkerH to return for now the arrays for the inverse map, unless a flag (for example sparse) is True, with the default value of the flag being False for backwards compatibility reasons. And to start a deprecation cycle to adopt sparse=True as the default in two releases. This way we can get the memory-saving behaviour by setting a flag but we still abide by our compatibility/deprecation policy. What do you think @jni @rfezzani ?

rfezzani · 2020-05-02T22:25:23Z

I am +1 for @emmanuelle suggestion.

jni · 2020-05-03T13:38:34Z

@emmanuelle my idea for this PR, which I suggested to @VolkerH, was to make a mock array-like object that would minimize, if not eliminate altogether, downstream issues with the API breakage. The problem I have with the sparse=True keyword is that it commits us to four versions of useless arguments — two with sparse=None -> False, and another two with sparse=None -> True but deprecated, because we don't want to have non-sparse in the future. @JDWarner has spoken out against this kind of transition in the past and I do agree with him at this point.

I just did a github code search for relabel_sequential (it includes some filename exclusions to avoid vendored copies of skimage), and find confirmed my suspicion that the vast majority of uses throw away the forward and inverse maps. In fact, I have to go to page 4 before I find one example (@uschmidt83's stardist), and in that example, our __getitem__ implementation will continue to work.

In all, I went through 12 pages of results, and found just 4 cases where they didn't throw out the forward and inverse maps. This perhaps indicates that we should get rid of the forward and inverse maps altogether, but that is probably best reserved for the 1.0 transition, with the function.extra syntax.

Of the remaining 4 cases, one is the stardist case above (covered by current functionality). Another two require the addition of a __len__ attribute but would otherwise work (here and here)

Finally, on page 6, I see that @Borda in PyImSegm tries to fill our LUT. This is one case where a conversion to numpy array is indeed needed. This could actually be done by creating a __setitem__ method, at which point the full array is instantiated, cached, filled with the existing values, and then filled with the requested values. At any rate, it would be interesting to see what @Borda's preference would be for this PR. I think that the new ArrayMap object could simplify his code, which I believe is there as a workaround for negative labels?

As a side note, I also found this example of @constantinpape avoiding relabel_sequential because of the large array problem. It's unlikely that others avoiding our code would actually note it down so usefully. =)

So, in short, the code as is would improve performance for 117/120 users without any break whatsoever. A very simple fix covers 2 of the remaining cases, and a slightly harder but still simple fix means that 120/120 surveyed uses of this function would not see a breaking change. Given all this, I think it's a pretty compelling case to not have a 4-release long deprecation cycle...

jni · 2020-05-03T14:01:43Z

@emmanuelle @rfezzani I've tried to alleviate issues with API breakage. Currently, the new code covers 120/120 surveyed uses of relabel_sequential. I'm happy to go over another 12 pages of results tomorrow. I agree that API breakage is a concern, and this search was indeed a useful exercise that helped me find some extra corner cases! However, I also think that we can mitigate the risk and it is overall better for the vast majority of users to have improved default performance now.

constantinpape · 2020-05-03T16:12:21Z

As a side note, I also found this example of @constantinpape avoiding relabel_sequential because of the large array problem. It's unlikely that others avoiding our code would actually note it down so usefully. =)

Great to see this happen! For me this actually is one of the major reasons to keep using vigra, which (currently) has a more efficient implementation. (As a side note, in the example you found, I can't use vigra either, because it assigns the new consecutive ids in scan order and not in the order of old labels, as is necessary in the function at hand). If possible, having only the skimage dependency is a big plus, because it is way more portable than vigra.

One more comment though: I find getting the forward and inverse map very useful. At least for me, there is often the situation where I relabel some small array (node labels used for a graph based segmentation) and then need to apply this relabeling to some much larger array (the full pixel-wise segmentation). Is this still part of the new API or are you deprecating forward and backward map?

emmanuelle · 2020-05-03T20:44:51Z

@jni I'm quite impressed by your efforts to check how the community uses this function, and also happy that it could lead to improvements for the PR. Given the fact that the ArrayMap now quacks really well like a numpy array, I'm going to approve!

emmanuelle

Thank you for all your work @VolkerH and @jni!

VolkerH · 2020-05-04T00:48:24Z

Thanks for reviewing @emmanuelle and @rfezzani and thanks @jni for burning the midnight oil to do the sleuth work which substantiates my suspicion that the forward and backward maps weren't actually used widely. With the __len__ and __setitem__ we are nearly there to reimplement ndarrays :-) .
I have nothing to add to @jni 's comments.

@constantinpape, there are still forward and inverse maps returned, just in the form of an ArrayMap class and @jni has done most of the work that makes sure ArrayMap behaves pretty much the same as the arrays returned so far, minus the memory overhead (except and unavoidably when converting to an actual ndarray via the __array__ method).

jni · 2020-05-04T01:19:52Z

@constantinpape (:wave:!)

Is this still part of the new API or are you deprecating forward and backward map?

To clarify my comment, the forward and backward maps will never disappear. In the next release, they will be (:crossed_fingers:) the ArrayMap objects proposed in this PR, which achieve the exact same thing as the current fw/inv maps but in memory proportional to the number of distinct labels, rather than the max label. In skimage 1.0, we may switch to the syntax relabeled = relabel_sequential(labels) and relabeled, fw, inv = relabel_sequential.extra(labels), or similar. This has not been decided yet, just the leading of several proposals.

jni · 2020-05-04T02:22:32Z

So, I continued my search today and found someone slicing into the forward map and updating it with +=. They are relabeling over MPI and combining the fw maps somehow 🤯. The EM segmentation crowd is very creative with this function. =D My latest update fixes that use case. I also verified that np.save() works!

@VolkerH yes, we are now very close to recreating NumPy arrays. 😂

constantinpape · 2020-05-04T07:10:02Z

@jni @VolkerH Thanks for clarifying. That's totally fine for my purposes.

jni · 2020-05-04T08:51:01Z

After searching through 24 pages, I found just one more uncovered case: boolean indexing. This now covers 240/240 surveyed uses of relabel_sequential on GitHub.

JDWarner

This is really slick and well considered. Thanks to @jni for extensive work checking how this is used, and ensuring the improvements continue to serve the users.

JDWarner · 2020-05-04T20:43:18Z

I'm going to go ahead and merge this so we can have it for 0.17. The only CI failures appear to be pre-builds with Cython the culprit.

rfezzani · 2020-05-04T21:20:51Z

Good job, thank you for all your efforts @VolkerH and @jni. I am happy to see this in 0.17 ;)

grlee77 · 2020-05-04T21:26:56Z

After searching through 24 pages, I found just one more uncovered case: boolean indexing. This now covers 240/240 surveyed uses of relabel_sequential on GitHub.

Wow, thanks for doing this extensive work to ensure compatibility!

VolkerH · 2020-05-05T08:10:10Z

Thanks everyone for reviewing and commenting and thanks to @jni for his work on taking this from my proof of concept to something that is ready to merge.

skimage/segmentation/tests/test_join.py

stefanv · 2020-05-05T19:12:57Z

Thanks, all, this is quite an epic piece of work. I love the simplicity of the C++ code too.

Volker Hilsenstein added 2 commits April 24, 2020 14:41

Adding first draft of Cython relabel_sequential

6f58cc3

Co-authored-by: Juan Nunez-Iglesias <juan.nunez-iglesias@monash.edu>

fixed order of args for map_array

852bc37

jni and others added 21 commits April 24, 2020 02:36

Add ArrayMap str and repr

97daf64

Add __call__ method to ArrayMap object

98faa4d

Add missing colon

3090183

Add array representation for ArrayMap

71dea0a

Use array representation to preserve relabel_sequential doctests

72873fb

Restore ValueError checks for relabel_sequential

c784734

Fix PEP8 issues

62b11ae

Ensure 0 always maps to 0

404b1cc

Ensure dtype is correct for output

ccdbfc4

Fix incorrect dtype for array being relabeled

fc40034

The docs specify that only integer arrays are supported. This change was added in gh-811 without any discussion so imho safe to revert.

Fix dtype of output and map

36dc4a8

Inverse map dtype should actually map the input dtype

43d653f

Fix docstring for map_array, check for contiguous out

dba94c9

Add docstring for ArrayMap

48e6fb7

Rename attributes in ArrayMap

1182a2a

Add C++0x standard to the compiler options

b759d33

added missing paren in docstring

bae1eab

mention large size of returned array in __array__ docstr

731374a

adjusted sample array to fix doctest

23acc63

Co-authored-by: Juan Nunez-Iglesias <juan.nunez-iglesias@monash.edu>

added tests for dtype overflow

2b670ce

test relabel_seq... works for large input

e538f9d

VolkerH changed the title ~~WIP: Array map~~ Allow very large values in relabel_sequential Apr 27, 2020

VolkerH marked this pull request as ready for review April 27, 2020 03:03

sciunto mentioned this pull request Apr 27, 2020

2020's calendar of community management #4486

Closed

rfezzani reviewed Apr 27, 2020

View reviewed changes

rfezzani reviewed Apr 30, 2020

View reviewed changes

jni added 2 commits May 3, 2020 08:43

Add len to ArrayMap

57a89f3

Add support for setitem and scalar mapping

6cab468

emmanuelle approved these changes May 3, 2020

View reviewed changes

Add support for slicing and updating map in-place

bc26a81

Allow bool indices in arraymap getitem

7af7cf1

JDWarner approved these changes May 4, 2020

View reviewed changes

JDWarner merged commit 4bff390 into scikit-image:master May 4, 2020

rfezzani mentioned this pull request May 5, 2020

Refactor ArrayMap.__setitem__ to prevent memory allocation #4640

Closed

3 tasks

rfezzani reviewed May 5, 2020

View reviewed changes

skimage/segmentation/tests/test_join.py Show resolved Hide resolved

jni mentioned this pull request May 7, 2020

release notes for 0.17 #4649

Merged

jni mentioned this pull request Feb 15, 2021

Issue with unordered_map in remap.pyx scikit-image 0.17.2 #5232

Closed

jni mentioned this pull request Dec 8, 2023

Faster map_array function #7262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow very large values in relabel_sequential #4612

Allow very large values in relabel_sequential #4612

VolkerH commented Apr 24, 2020 •

edited

pep8speaks commented Apr 24, 2020 •

edited

VolkerH commented Apr 27, 2020

jni commented Apr 27, 2020

rfezzani Apr 27, 2020

jni Apr 28, 2020

rfezzani Apr 28, 2020

VolkerH Apr 28, 2020

rfezzani Apr 28, 2020

jni Apr 30, 2020

rfezzani Apr 30, 2020

VolkerH May 1, 2020

rfezzani Apr 30, 2020

jni May 1, 2020

rfezzani commented May 1, 2020

emmanuelle commented May 2, 2020

rfezzani commented May 2, 2020

jni commented May 3, 2020

jni commented May 3, 2020

constantinpape commented May 3, 2020 •

edited

emmanuelle commented May 3, 2020

emmanuelle left a comment

VolkerH commented May 4, 2020

jni commented May 4, 2020

jni commented May 4, 2020

constantinpape commented May 4, 2020

jni commented May 4, 2020

JDWarner left a comment

JDWarner commented May 4, 2020

rfezzani commented May 4, 2020

grlee77 commented May 4, 2020

VolkerH commented May 5, 2020

stefanv commented May 5, 2020

Allow very large values in relabel_sequential #4612

Allow very large values in relabel_sequential #4612

Conversation

VolkerH commented Apr 24, 2020 • edited

Description

Checklist

For reviewers

pep8speaks commented Apr 24, 2020 • edited

Comment last updated at 2020-05-04 08:48:01 UTC

VolkerH commented Apr 27, 2020

jni commented Apr 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfezzani commented May 1, 2020

emmanuelle commented May 2, 2020

rfezzani commented May 2, 2020

jni commented May 3, 2020

jni commented May 3, 2020

constantinpape commented May 3, 2020 • edited

emmanuelle commented May 3, 2020

emmanuelle left a comment

Choose a reason for hiding this comment

VolkerH commented May 4, 2020

jni commented May 4, 2020

jni commented May 4, 2020

constantinpape commented May 4, 2020

jni commented May 4, 2020

JDWarner left a comment

Choose a reason for hiding this comment

JDWarner commented May 4, 2020

rfezzani commented May 4, 2020

grlee77 commented May 4, 2020

VolkerH commented May 5, 2020

stefanv commented May 5, 2020

VolkerH commented Apr 24, 2020 •

edited

pep8speaks commented Apr 24, 2020 •

edited

constantinpape commented May 3, 2020 •

edited