Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid uneccesary copies in skimage.morphology.label #2701

Merged
merged 2 commits into from
Jul 27, 2017

Conversation

grlee77
Copy link
Contributor

@grlee77 grlee77 commented Jul 12, 2017

Description

A user on the mailing list brought up an issue related to high memory usage in skimage.morphology.label
https://mail.python.org/pipermail/scikit-image/2017-July/005313.html

This PR makes minor tweaks to label_cython to avoid some unnecessary extra copies of the data.

Specifically, the following line will make three separate copies:

data = np.copy(input_corrected.flatten().astype(DTYPE))

(flatten returns a copy, astype returns a copy, and then np.copy will make another copy!)

I also renamed the variable input to input_ to avoid confusion with the built-in function of the same name. I don't think this should be a problem as label_cython is not exported as part of the public API.

For reviewers

(Don't remove the checklist below.)

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.

@grlee77
Copy link
Contributor Author

grlee77 commented Jul 13, 2017

As each of the copies above is replacing each other, I don't expect this PR improves memory usage much if at all. It should improve performance a bit.

I rescaled the cameraman image up to size 4096x4096 and ran skimage.measure.label on it. Prior to this PR it took 773 ms and afterwards it takes 626 ms (as reported by %timeit).

I think the main reason for the high memory usage reported in that mailing list thread is that the labels are stored in int64 dtype. This means that the labels are 8 times the memory usage of a uint8 input. I don't think there is much way around that without rewriting the routines to allow the label array to mach the integer dtype of the input (perhaps via fused types?).

I don't intend to pursue that in this PR, but it is probably worth still merging it for the minor performance boost.

@grlee77
Copy link
Contributor Author

grlee77 commented Jul 13, 2017

There is some sort of PIL-related failure in the doctests that don't seem related to this PR:
https://travis-ci.org/scikit-image/scikit-image/jobs/253015063#L3314

@jni
Copy link
Member

jni commented Jul 13, 2017

Note that the dtype of the labels in no way depends on the dtype of the input, but rather on the number of connected components of the input. This is bounded by input.size // 2, which is not much of a bound, especially if you are having memory issues in the first place. =P

@grlee77
Copy link
Contributor Author

grlee77 commented Jul 13, 2017

Note that the dtype of the labels in no way depends on the dtype of the input, but rather on the number of connected components of the input.

right, it is not related to input dtype. I misstated that above. I think 16 bit integers are probably sufficient for the majority of cases, but I guess the question would be whether to use a conservative value like input.size/2 to auto-select or expand the API to allow the user to select a dtype manually if desired.

@emmanuelle
Copy link
Member

I would be in favor of having a dtype keyword argument to impose the type, with some warning in the case of overflow. Another trick might be to check the number of connected components at the end of the function, and down-cast the type if int64 was not needed: it will not improve the memory usage during the execution of the function, but it's useful if the user is manipulating the image of labels later on.

@stefanv
Copy link
Member

stefanv commented Jul 25, 2017

This PR seems good to merge, and then we can start another issue for Emma's suggestion above (I think it's a good one).

@@ -343,49 +343,52 @@ def undo_reshape_array(arr, swaps):
return reshaped


def label_cython(input, neighbors=None, background=None, return_num=False,
def label_cython(input_, neighbors=None, background=None, return_num=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why rename input to input_?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input is a reserved Python keyword. I endorse this renaming.

@soupault
Copy link
Member

@grlee77 could you, please, rebase onto master so to check if the CI is green?

@soupault
Copy link
Member

Thanks @grlee77 !

@soupault soupault merged commit 2926ca7 into scikit-image:master Jul 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants