Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader #1088

Merged
merged 208 commits into from Aug 4, 2020

Conversation

mraspaud
Copy link
Member

@mraspaud mraspaud commented Feb 27, 2020

This is an attempt at making DatasetIDs dynamic.

DatasetID objects are used in Satpy to index datasets in a Scene object. They are namedtuples that contain a fixed set of properties that we considered a common denominator for all datasets readable by Satpy. However, it has been shown on multiple occasions that it is becoming harder to make all the properties relevant for all the datasets. For example, polarization and level are such properties that only apply to one dataset type each (respectively SAR-C and Geocat). At the same time, more dataset types are coming into Satpy that need new properties, eg SLSTR which has nadir and oblique views that we need to include in the DatasetID if we want to discriminate the datasets efficiently.

To avoid having to extend the set of properties needed to identify all datasets in Satpy, I propose here to make the DatasetID objects dynamic in that the set of properties they use to identify datasets can be tuned per reader. At the same time we are still able to compare DatasetIDs from different datasets. Moreover, properties that are irrelevant to a given dataset are now omitted entirely. For example, wavelength doesn't make sense for a longitude dataset, and hence will not be a part of the DatasetID needed to identify longitude, even if the reader configuration defines it as a possible property for idenfying datasets.

For backwards compatibility, a default DatasetID is still available, but hopefully should be removed soon.

@mraspaud mraspaud added enhancement code enhancements, features, improvements component:dep_tree Dependency tree and dataset loading refactor component:scene labels Feb 27, 2020
@mraspaud mraspaud self-assigned this Feb 27, 2020
@mraspaud
Copy link
Member Author

PS: In the shape this PR is now, I'm expecting this to break a lot of stuff. I'm throwing it here to hopefully get insightful comments that will show if I'm going in the right direction before I spend more time on this. IE, please give some feedback!

@mraspaud mraspaud added this to In progress in PCW Spring 2020 May 4, 2020
@gerritholl
Copy link
Collaborator

I wonder if this might help to make composite requirements more flexible. I have a use case in fogpy where I have defined a composite that needs:

  1. Channel data for 7 channels, which may come from SEVIRI, ABI, FCI, or others.
  2. Cloud microphysics data (3 parameters), which may come from NWCSAF, CLAAS/CMSAF, ABI L2, or others. They don't have the same name in different sources.

Currently, the former can match either by name ("IR_108", "C02") or by wavelength (10.8) (I'm not sure if this is the complete story as I still don't have a full understanding of how compositors work; see also #714 and #161). The latter can match only by name, I think. With this PR, the cloud microphysics datasets could have a provides: field with a name matching the requires: field in the composite, as an alternative of matching non-channel datasets by name only.

@djhoese
Copy link
Member

djhoese commented May 12, 2020

I don't think dynamic datasets would help with this and I wouldn't want it to evolve to support this functionality in composites. DatasetIDs are identifiers. This requires/provides system is something separate. To full support this we'd have to add provides to all DatasetIDs from all readers. A similar approach that I could see is putting standard_name in the DatasetID and using that as a way to find something that would serve the purpose needed by the composite. If not that, then perhaps a completely separate interface for this.

@mraspaud
Copy link
Member Author

I see two benefits from the top of my head:

  1. The list of calibration levels is fixed, and trying to ask for a wrong calibration will automatically fail, independently of the reader.
  2. The list is ordered, allowing satpy to sorts dataids by predetermined calibration order.

As for serialisation, there are already a couple of methods for it implemented, the idea being that the enum is converted back to text on dump, and recreated on the fly on load.

@djhoese
Copy link
Member

djhoese commented Jul 29, 2020

The list of calibration levels is fixed

Fixed per reader though, right? What if a reader defines a new calibration that is in the middle of the existing/common calibrations? That means that calibration.reflectance from old code will no longer equal calibration.reflectance in new code. This is especially bad if you are serializing things between versions where the same reader has now added a new calibration level in this way. This is something that SIFT has run in to time and time again because we store pickled enums to a database, if we change the enums by doing anything other than appending then things can no longer be unpickled.

The list is ordered, allowing satpy to sorts dataids by predetermined calibration order.

That's cool, but it gets back to what was discussed on slack: it is hard to determine the order/sorting of things in a DataID. What is best, min/max resolution, lower/higher calibration? Having this be explicit rather than the result of some dynamic Enum ordering seems like a more maintainable solution.

As for serialisation, there are already a couple of methods for it implemented, the idea being that the enum is converted back to text on dump, and recreated on the fly on load.

This isn't useful if it only works inside the DataID class if the enum also shows up in the .attrs of the DataArray.

@mraspaud
Copy link
Member Author

mraspaud commented Jul 29, 2020

We can find a way to serialise that doesn't bring along any enum, just the name of the calibration level will be passed (also in the data attrs).

Adding flexibility by making the dataid configureable per reader makes the code more abstract in some places indeed. I'm trying to make this as clear as possible with the current code base, but I understand that the code can be harder to follow. Still, I feel it's an improvement over having hardcoded DatasetID keys.

Also, having acceptance tests that verify what sift expects from satpy would be useful to ensure that unacceptable changes don't go unnoticed.

Edit: remove grumpy comments, sorry about those.

@mraspaud
Copy link
Member Author

The list of calibration levels is fixed

Fixed per reader though, right? What if a reader defines a new calibration that is in the middle of the existing/common calibrations? That means that calibration.reflectance from old code will no longer equal calibration.reflectance in new code. This is especially bad if you are serializing things between versions where the same reader has now added a new calibration level in this way. This is something that SIFT has run in to time and time again because we store pickled enums to a database, if we change the enums by doing anything other than appending then things can no longer be unpickled.

This is now addressed by passing only the name of the enum to the dataarray attrs and when serializing the dataid.

The list is ordered, allowing satpy to sorts dataids by predetermined calibration order.

That's cool, but it gets back to what was discussed on slack: it is hard to determine the order/sorting of things in a DataID. What is best, min/max resolution, lower/higher calibration? Having this be explicit rather than the result of some dynamic Enum ordering seems like a more maintainable solution.

It is explicit in the reader configs, as we explicitly give the calibration level in order of preference. I don't see how we can make clearer without hardcoding all the use cases.

As for serialisation, there are already a couple of methods for it implemented, the idea being that the enum is converted back to text on dump, and recreated on the fly on load.

This isn't useful if it only works inside the DataID class if the enum also shows up in the .attrs of the DataArray.

As mentioned above, this has been addressed in the latest commit.

@mraspaud
Copy link
Member Author

mraspaud commented Jul 30, 2020

Also, just to clarify why I think the enums aren't so problematic here:
the implementation of the dynamic enum in this PR implements __eq__ as checking the name equality only (ie not the value). So if we define dataid1 with calibration levels [level1, level2, level3] and dataid2 with calibration levels [level1, level1.5, level2, level2.5, level3], and
dataid1 is (name='hi', calibration='level2')
dataid2 is (name='hi', calibration='level2')
they will be considered equal!

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned on slack:

  1. I have NOT read through the code.
  2. I think it is good to merge. Any additional changes can be moved to future pull requests. I think there may be too much code to review and at this point you are the one defining what "good enough" code is anyway.

We've had multiple people test this PR and I've been using it for a couple days. Everything we've tested, works. Anything we discover from now on will require an additional PR and we are more likely to find them if this is in the master branch.

Great job and thanks for taking the time and effort to get this done. I think this will really open the door for future flexibility in Satpy.

One thing: Could you update the title so someone reading the changelog will get a good idea/summary of what has changed (maybe mention DataID).

PCW Spring 2020 automation moved this from In progress to Ready to merge Aug 4, 2020
@mraspaud mraspaud changed the title Make DatasetIDs dynamic Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader Aug 4, 2020
@mraspaud
Copy link
Member Author

mraspaud commented Aug 4, 2020

Thanks all for looking at this!

@mraspaud mraspaud merged commit 367d725 into pytroll:master Aug 4, 2020
PCW Spring 2020 automation moved this from Ready to merge to Done Aug 4, 2020
@sfinkens
Copy link
Member

sfinkens commented Aug 4, 2020

🎉 Great effort! Probably the biggest PR in satpy's history.

@mraspaud mraspaud deleted the feature-dynamic-datasetids branch September 17, 2020 09:48
djhoese pushed a commit to ssec/sift that referenced this pull request May 13, 2022
This is mostly a commit from David Hoese (see
6a380df
- "Add support for Satpy 0.23+") which has been updated to match current
Satpy master, i.e., to take commit "Refactor scene to privatise some
attributes and methods"
pytroll/satpy@e8f2218
into account.

NOTE: From this commit MTG-SIFT doesn't work with Satpy versions <=
0.22 since it adapts to API changes introduced with the Satpy pull
request "Make the metadata keys that uniquely identify a DataArray
(DataID) configurable per reader #1088"
pytroll/satpy#1088
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

Saving true color GOES image requires double-resampling if calibration='radiance'