Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow very large values in relabel_sequential #4612

Merged
merged 45 commits into from
May 4, 2020
Merged

Allow very large values in relabel_sequential #4612

merged 45 commits into from
May 4, 2020

Conversation

VolkerH
Copy link
Contributor

@VolkerH VolkerH commented Apr 24, 2020

Description

This PR introduces:

  • an ehancement (I might even go as far as calling it a bug-fix) to relabel_sequential, and
  • a generic vay to remap integer arrays (that can also remap from integer to other types)

Background:

Storage requirements in relabel_sequential currently scale with maxim value in array not with the size of the array:

The current implementation of relabel_sequential uses numpy's fancy indexing in a very clever way. As it leverages numpy's array operators it is very fast. However, the current implementation requires building a LUT as a numpy array that scales (in memory requirement) with the value of the largest label. This is a long-standing problem mentioned in #1349 (comment). Depending on the exact value in the array this can lead to MemoryError or ValueError exceptions. I have documented these failure modes in this notebook under the heading Storage Requirements.

Cython implementation implements a sparse LUT using a hashmap

To address the undesired memory scaling behaviour, a new function map_array is introduced that maps one array to another in which the LUT is implemented as a hashmap. Numpy does not provide a hash-table data structure and using a Python dict would be too slow. Instead, an unordered_map from the C++ STL is used with Cython. Initial benchmarking on my machine suggests that this is not dramatically slower than the fany indexing (see this notebook for my initial experiments, including alternative implementations using numba for additional background.

relabel_sequential is reimplemented using the new map_array.

Other uses: e.g. creating value-maps
I have recently played around with some visualizations for measurements returned by regionprops_table to create value maps, see here for an example. This has been suggested for skimage in #1396 and can be trivially implemented using map_array.

There is still plenty of work to do in terms of code tidying, documentation and testing.

Thanks to @jni for some help and encouragement with this.

Checklist

For reviewers

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.

Volker Hilsenstein added 2 commits April 24, 2020 14:41
Co-authored-by: Juan Nunez-Iglesias <juan.nunez-iglesias@monash.edu>
@pep8speaks
Copy link

pep8speaks commented Apr 24, 2020

Hello @VolkerH! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 127:43: E226 missing whitespace around arithmetic operator
Line 127:56: E226 missing whitespace around arithmetic operator
Line 130:44: E226 missing whitespace around arithmetic operator
Line 186:60: W291 trailing whitespace
Line 271:1: W293 blank line contains whitespace
Line 292:31: W504 line break after binary operator
Line 299:31: W504 line break after binary operator
Line 301:34: W504 line break after binary operator
Line 302:27: W504 line break after binary operator

Comment last updated at 2020-05-04 08:48:01 UTC

@VolkerH VolkerH changed the title WIP: Array map Allow very large values in relabel_sequential Apr 27, 2020
@VolkerH VolkerH marked this pull request as ready for review April 27, 2020 03:03
@VolkerH
Copy link
Contributor Author

VolkerH commented Apr 27, 2020

Initial PR mentioned other uses for ArrayMap. We have decided to not include these in this pull request, but focus just on relabel_sequential.

@jni
Copy link
Member

jni commented Apr 27, 2020

@scikit-image/core this is ready for review! This code will allow a very easy fix for #1396, but since that will require adding a new API, which will require more discussion, we thought we should get this in first since it is already a major improvement.

CC @uschmidt83 you might be interested in this change. =)

return out.reshape(orig_shape)


class ArrayMap:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missed something, but is the definition of the ArrayMap object necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfezzani I think the docstring does a decent job explaining things? It is a way to preserve the array indexing remapping API without creating enormous arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my question then was not clear, sorry. I reformulate: can't we use a function here instead of the class definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class implements the __array__ method for backward compatibility as relabel_sequential returns the forward and backward transformations in the form of ndarrays that are used as lookup tables.
If we just returned simple functions for forward and backward transformations existing code that relies on arrays being returned would no longer work.
Basically we need the function to be triggered when [ ] indexing is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I was in fact missing something =), but I think that this approach is not addressing the problem of large memory allocation: it is simply postponed to the implementation of the __array__ method!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfezzani no, the class allows relabeling with [] using Cython. It never needs to instantiate the arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused here. If the __array__ method definition is required, the MemoryError pointed out in the PR description is not addressed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, one can still trigger a MemoryError if one wants to instantiate the lookup-table as a numpy array. However, instantiating the lookup-table is not required to access the functionality of relabeling. The same functionality is provided (without memory issues) using the __getitem__ and __call__ interfaces.

Example:

labels =  np.array([1, 1e16], dtype=np.int64)

# 1. call relabel_sequential
relab, fw, inv = relabel_sequential(labels) # does not allocate tons of memory

# 2. use the returned forward transform to relabel the array again using fancy indexing:
_relab = fw[labels] # does not allocate tons of memory ... is handled by __getitem__

# 3. use the returned forward transform to relabel the array again using call interdace:
_relab = fw(labels) # does not allocate tons of memory ... is handled by __call__

# 4. turn the returned forward transform into a numpy array 
fw_lut = np.narray(fw) # this will now try and allocate a huge array !

Use case 4. is the only one where we still are likely to get MemoryErrors if we have large values in the input array. So in that sense one can still potentially hit this problem in the current code. However, this functionality is exclusively for backwards compatibility, in case anyone has code that used the returned transformations as an array (and not just for fancy indexing) for whatever reason.
We provide the functionality for relabeling another array using the fw or inv transforms using methods 2.,3..

I agree that this is not immediately obvious. To address this I have the following suggestions:

  1. spell out very clearly in the docstring that np.array(fw) is not recommended, or
  2. raise a warning in __array__ that states that this is not recommended and for backwards compatibility only, or
  3. remove the __array__ method altogether, which may be a breaking change but will likely not affect a lot of use cases.

required_type = np.min_scalar_type(new_max_label)
if np.dtype(required_type).itemsize > np.dtype(label_field.dtype).itemsize:
offset = int(offset)
in_vals = np.unique(label_field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that out_array is almost accessible via the return_inverse of np.unique:

>>> import numpy as np
>>> from skimage.segmentation import relabel_sequential
>>> 
>>> offset = 2
>>> label_field = np.array([[1, 1, 5, 5, 8, 99, 42],
...                         [1, 0, 0, 5, 99, 99, 42]])
>>> relab, fw, inv = relabel_sequential(label_field, offset=offset)
>>> 
>>> in_vals, out_array = np.unique(label_field, return_inverse=True)
>>> 
>>> out_array = out_array.reshape(label_field.shape) + offset - (0 in in_vals)
>>> out_array[label_field == 0] = 0
>>> relab
array([[2, 2, 3, 3, 4, 6, 5],
       [2, 0, 0, 3, 6, 6, 5]])
>>> out_array
array([[2, 2, 3, 3, 4, 6, 5],
       [2, 0, 0, 3, 6, 6, 5]])

These 3 lines are no more needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfezzani Fascinating!!!

However, if we are building the map arrays anyway, this is actually faster than the overhead of running np.unique with return_inverse=True:

In [3]: %timeit np.unique(seg)                                                     
3.83 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit np.unique(seg, return_inverse=True)                                
7.12 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit j.relabel_sequential(seg)                                          
6.2 ms ± 36.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: reseg, fw, inv = j.relabel_sequential(seg)                                 
In [7]: %timeit fw[seg]                                                            
2.33 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit np.empty_like(seg)                                                 
547 ns ± 6.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Having said that, this would have been a useful refactor before this Cython code. 😅

@rfezzani
Copy link
Member

rfezzani commented May 1, 2020

ArrayMap are fine for me, but, as @VolkerH mentioned

... in case anyone has code that used the returned transformations as an array (and not just for fancy indexing) for whatever reason.

it introduces a clear breaking change, and we are usually more careful in this case. Can't we think of a scenario with a deprecation cycle?

@emmanuelle
Copy link
Member

I finally took the time to read the entire PR and the comments. Would it be possible @VolkerH to return for now the arrays for the inverse map, unless a flag (for example sparse) is True, with the default value of the flag being False for backwards compatibility reasons. And to start a deprecation cycle to adopt sparse=True as the default in two releases. This way we can get the memory-saving behaviour by setting a flag but we still abide by our compatibility/deprecation policy. What do you think @jni @rfezzani ?

@rfezzani
Copy link
Member

rfezzani commented May 2, 2020

I am +1 for @emmanuelle suggestion.

@jni
Copy link
Member

jni commented May 3, 2020

@emmanuelle my idea for this PR, which I suggested to @VolkerH, was to make a mock array-like object that would minimize, if not eliminate altogether, downstream issues with the API breakage. The problem I have with the sparse=True keyword is that it commits us to four versions of useless arguments — two with sparse=None -> False, and another two with sparse=None -> True but deprecated, because we don't want to have non-sparse in the future. @JDWarner has spoken out against this kind of transition in the past and I do agree with him at this point.

I just did a github code search for relabel_sequential (it includes some filename exclusions to avoid vendored copies of skimage), and find confirmed my suspicion that the vast majority of uses throw away the forward and inverse maps. In fact, I have to go to page 4 before I find one example (@uschmidt83's stardist), and in that example, our __getitem__ implementation will continue to work.

In all, I went through 12 pages of results, and found just 4 cases where they didn't throw out the forward and inverse maps. This perhaps indicates that we should get rid of the forward and inverse maps altogether, but that is probably best reserved for the 1.0 transition, with the function.extra syntax.

Of the remaining 4 cases, one is the stardist case above (covered by current functionality). Another two require the addition of a __len__ attribute but would otherwise work (here and here)

Finally, on page 6, I see that @Borda in PyImSegm tries to fill our LUT. This is one case where a conversion to numpy array is indeed needed. This could actually be done by creating a __setitem__ method, at which point the full array is instantiated, cached, filled with the existing values, and then filled with the requested values. At any rate, it would be interesting to see what @Borda's preference would be for this PR. I think that the new ArrayMap object could simplify his code, which I believe is there as a workaround for negative labels?

As a side note, I also found this example of @constantinpape avoiding relabel_sequential because of the large array problem. It's unlikely that others avoiding our code would actually note it down so usefully. =)

So, in short, the code as is would improve performance for 117/120 users without any break whatsoever. A very simple fix covers 2 of the remaining cases, and a slightly harder but still simple fix means that 120/120 surveyed uses of this function would not see a breaking change. Given all this, I think it's a pretty compelling case to not have a 4-release long deprecation cycle...

@jni
Copy link
Member

jni commented May 3, 2020

@emmanuelle @rfezzani I've tried to alleviate issues with API breakage. Currently, the new code covers 120/120 surveyed uses of relabel_sequential. I'm happy to go over another 12 pages of results tomorrow. I agree that API breakage is a concern, and this search was indeed a useful exercise that helped me find some extra corner cases! However, I also think that we can mitigate the risk and it is overall better for the vast majority of users to have improved default performance now.

@constantinpape
Copy link
Contributor

constantinpape commented May 3, 2020

As a side note, I also found this example of @constantinpape avoiding relabel_sequential because of the large array problem. It's unlikely that others avoiding our code would actually note it down so usefully. =)

Great to see this happen! For me this actually is one of the major reasons to keep using vigra, which (currently) has a more efficient implementation. (As a side note, in the example you found, I can't use vigra either, because it assigns the new consecutive ids in scan order and not in the order of old labels, as is necessary in the function at hand). If possible, having only the skimage dependency is a big plus, because it is way more portable than vigra.

One more comment though: I find getting the forward and inverse map very useful. At least for me, there is often the situation where I relabel some small array (node labels used for a graph based segmentation) and then need to apply this relabeling to some much larger array (the full pixel-wise segmentation). Is this still part of the new API or are you deprecating forward and backward map?

@emmanuelle
Copy link
Member

@jni I'm quite impressed by your efforts to check how the community uses this function, and also happy that it could lead to improvements for the PR. Given the fact that the ArrayMap now quacks really well like a numpy array, I'm going to approve!

Copy link
Member

@emmanuelle emmanuelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all your work @VolkerH and @jni!

@VolkerH
Copy link
Contributor Author

VolkerH commented May 4, 2020

Thanks for reviewing @emmanuelle and @rfezzani and thanks @jni for burning the midnight oil to do the sleuth work which substantiates my suspicion that the forward and backward maps weren't actually used widely. With the __len__ and __setitem__ we are nearly there to reimplement ndarrays :-) .
I have nothing to add to @jni 's comments.

@constantinpape, there are still forward and inverse maps returned, just in the form of an ArrayMap class and @jni has done most of the work that makes sure ArrayMap behaves pretty much the same as the arrays returned so far, minus the memory overhead (except and unavoidably when converting to an actual ndarray via the __array__ method).

@jni
Copy link
Member

jni commented May 4, 2020

@constantinpape (:wave:!)

Is this still part of the new API or are you deprecating forward and backward map?

To clarify my comment, the forward and backward maps will never disappear. In the next release, they will be (:crossed_fingers:) the ArrayMap objects proposed in this PR, which achieve the exact same thing as the current fw/inv maps but in memory proportional to the number of distinct labels, rather than the max label. In skimage 1.0, we may switch to the syntax relabeled = relabel_sequential(labels) and relabeled, fw, inv = relabel_sequential.extra(labels), or similar. This has not been decided yet, just the leading of several proposals.

@jni
Copy link
Member

jni commented May 4, 2020

So, I continued my search today and found someone slicing into the forward map and updating it with +=. They are relabeling over MPI and combining the fw maps somehow 🤯. The EM segmentation crowd is very creative with this function. =D My latest update fixes that use case. I also verified that np.save() works!

@VolkerH yes, we are now very close to recreating NumPy arrays. 😂

@constantinpape
Copy link
Contributor

@jni @VolkerH Thanks for clarifying. That's totally fine for my purposes.

@jni
Copy link
Member

jni commented May 4, 2020

After searching through 24 pages, I found just one more uncovered case: boolean indexing. This now covers 240/240 surveyed uses of relabel_sequential on GitHub.

Copy link
Contributor

@JDWarner JDWarner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really slick and well considered. Thanks to @jni for extensive work checking how this is used, and ensuring the improvements continue to serve the users.

@JDWarner
Copy link
Contributor

JDWarner commented May 4, 2020

I'm going to go ahead and merge this so we can have it for 0.17. The only CI failures appear to be pre-builds with Cython the culprit.

@JDWarner JDWarner merged commit 4bff390 into scikit-image:master May 4, 2020
@rfezzani
Copy link
Member

rfezzani commented May 4, 2020

Good job, thank you for all your efforts @VolkerH and @jni. I am happy to see this in 0.17 ;)

@grlee77
Copy link
Contributor

grlee77 commented May 4, 2020

After searching through 24 pages, I found just one more uncovered case: boolean indexing. This now covers 240/240 surveyed uses of relabel_sequential on GitHub.

Wow, thanks for doing this extensive work to ensure compatibility!

@VolkerH
Copy link
Contributor Author

VolkerH commented May 5, 2020

Thanks everyone for reviewing and commenting and thanks to @jni for his work on taking this from my proof of concept to something that is ready to merge.

@stefanv
Copy link
Member

stefanv commented May 5, 2020

Thanks, all, this is quite an epic piece of work. I love the simplicity of the C++ code too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants