Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
SKIP: The way `skimage` processes and stores metadata #2605
So far the typical pipeline of working with images in
In the past months we had a lot of discussions and issues connected to the way we handle image metadata. To be concrete (considering the steps above):
I hope that the above ensures you that metainformation is really important, and most of the functions can greatly benefit from using it. Of course, there is a lot of metainfo entries, and we are not obliged to take care of all of them. What we should do, in my opinion, is to have a framework for handling the metadata, and a way to easily extend this support.
What we have now: See the snippet above: some of the functions have
What we could have: A metadata passed to all the relevant functions via either (a) a number of keywords (multichannel, colorspace, names of the channels (for multispectral data), isotropic/anisotropic data), or (b) as a dictionary (dictionary can store all of the fields plus the processing history).
(a) approach is quite dirty, hardly extendable (a lot of deprecation pain), but easy.
(b) approach is tidy, easily extendable, but complicates a bit the usage in a mixed (along with other libraries) environment. Allows to store and pass through the history of image processing. Allows to pass through non-handled metainfo.
@scikit-image/core I'd like to ask your opinions on this topic. I'm personally in favour of approach (b) despite the fact that it implies some breaking changes. In any case, the goal here is to work out a common solution, and make the library more convenient for all users.
A third option is what
@soupault what a great, great issue. Thanks for compiling all the above links.
I favour option (b). I think it should be relatively straightforward to incorporate even as regards deprecations, because we can use the convention that keyword arguments override the meta dictionary, and meta can be the last kwarg to any function. (This is a common way to deal e.g. with config files, which are read in but individual values can be overwritten with command-line args.)
@blink1073 That was my original preference a couple of years ago, but I think @stefanv pointed out that it is easily lost after a single round of processing. At least with a meta kwarg, you know explicitly whether it's there or not.
A couple of more things to consider:
At any rate, I think this one of just a few critical issues to figure out for an eventual 1.0 release. Thanks again @soupault for putting it together! =)
Yes, unfortunately the
Now, w.r.t. API: there's a subtle distinction between meta data and keyword arguments. Keywords tells the function "this is how I want to look at the data", whereas meta-data says "this is what I know about my data". So, I recommend that keywords override meta-data (and, thus, it also implies that keywords should still be included in the function signature: it tells the user which parts of meta-data are important to the function, and allows explicitly overriding behavior as described above).
Just to revive this discussion a bit: I had a couple of conversations at SciPy 2018 about this. I'll give one-sentence overviews, but hoping that the people I ping will chime in with more details about their ideas.
First I'd like to point out that we've moved to ImageIO for our IO in #3126. I don't know currently whether we toss out the input
Second, @JDWarner proposed
But, third, as a counterpoint, @danielballan suggested that changing what is returned based on input values is a bad habit, and he's ok with scikit-image being the lowest-level layer in the image processing stack that others can build on (ie essentially the current status quo), and that if we want to support meta maybe we can do this in a parallel namespace (
To be honest, although I am very sympathetic to the principles of type purity promoted by this proposal, the end result makes me unhappy. "Practicality beats purity", and a
The truth is that the Pythonic way to do this is indeed a more complex object that builds on NumPy arrays (e.g. XArrays), and/or duck-typing (e.g. adding a
I wrote up some things from imageio's end here: imageio/imageio#362 In short, I'd be happy to accommodate for skimage's needs, and if possible, maybe even get rid of that
I'm looking forward to see what you'll come up with. My 2ct for a new namespace; I agree that a
I'm not sure what value we can add by having yet an other custom way of handling metadata. I'm of the mindset that we should not create an other class/collection of objects that acts like
I also want to make sure we also don't get stuck on the "color conversion" issue too much. While it is an important issue, we should be able to TRUST the user of our library and not hand hold them too much. I agree it is the first issue people run into when they open their images for the first time, but it quickly becomes a trivial issue in my mind when users learn that their processing can probably be done on the gray image (or maybe just the red channel
Keep in mind that many functions in the color conv module are limited to 10 lines of code (but 30 lines of documentation). I'm willing to bet that many users probably wrote them themselves before finding scikit-image.
Finally datatype, quantization, rounding errors, vectorization are things you MUST learn (slowly) when doing data analysis. We can't shield users from that. What do we do when the user mangles the metadata dictionary and opens an issue?
Most functions only need one or two keyword arguments regarding the image's metadata. Asking a user to buy into a whole OO system is pretty costly. It would have probably turned me away from scikit-image. That is basically what turned me away from PIL's opaque images.
I've come to use
At the end of the day, I think the power of scikit-image will be to position itself as a powerful library for image processing, as opposed to metadata handling. I am of the mind that documentation and keywords like
Regarding the strange warnings, I think those would be best addressed by raising an error. I was bit by the
This is my position too. @jni accurately summarized my misgivings about breaking return type stability above. A more serious concern I have is that correctly propagating metadata through, and updating it accurately to reflect the action of a given function, is a hard problem. It also makes it harder to immediately see what a function's inputs are. If a function accepts a dictionary of metadata, how do I know which keys in that dictionary affect the function's behavior and which are just being passed through (and maybe updated)? I think it better to force the user to handle these things and pass in exactly what a function needs, so both the author and future readers of the code will understand what the dependencies are, i.e. what choices are effectively being made.
To add onto @stefanv and @jni's remarks above about a
Similarly, the suggestion of capturing the "processing history" as part of this worries me. Many projects in the workflow-management/provenance-capturing space have tried to do this with mixed results, and I think it would be risky for scikit-image to attempt to adopt this goal as part of its core functionality.
But that doesn't mean scikit-image can't do anything to help. As a concrete step, would it make sense for the reading functions to return a
Hey @danielballan, nice to see you again! =P
Yes, but I will need to solve it at some point. So the question is whether it should live in scikit-image or in various field-specific subprojects. Over on the corresponding imageio issue, I kind of argued myself into the latter corner. =P A big advantage of field-specific option is that one only needs to worry about some subset of metadata. For example, in light microscopy, one wants emission wavelengths for each channel, but these would be pretty strange to deal with in normal photographs.
Fair points. Just an idea. =) I don't necessarily mean to do something fancy. From the scikit-image perspective it could contain a list of strings containing the function name and the values of the arguments. Each function would just append itself to that list. (ie this would be intended for human consumption, not machine consumption. That's an easier problem I think.)
Yes, although as I mentioned on imageio, this is probably something to do at that level, rather than skimage.
I think @stefanv's distinction is very good here: let's never alter a function's behaviour based on
This is a very good idea to grow from. I think if we can get imageio "standard metadata" working for a few image file types, then we can start to implement metadata handling not in code, but in documentation. As we write more docs about how to deal with metadata correctly, the right approach for code might become apparent.
I'm coming along to this point of view. =)
Thanks all, this is becoming a very useful thread indeed!