Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make combine_arrays understand non-numpy arrays #1216

Merged
merged 9 commits into from May 28, 2020

Conversation

gerritholl
Copy link
Collaborator

@gerritholl gerritholl commented May 25, 2020

Make combine_arrays understand non-numpy arrays in attributes. I believe this should fix #1215.

Added a regression test triggering the ValueError raised when trying to
read a FCI composite (see pytroll#1215).
In combine metadata, replace an ndarray instance check by a check for
the array interface (attribute __array__).
Added another regression test to combine_metadata, that better
simulates the situation with the FCI reader.  The ancillary_variables
attribute is actually List[xarray.DataArray], so this needs to be
handled as well.
In combine_metadata, cover lists of arrays such as happen when trying to
read an FCI composite.
The list of arrays was catching too much, such as [()].  Only catch
lists of arrays when they are non-empty.
PEP 8 fix, I never know where it wants those pesky list generators
indented!
satpy/dataset.py Outdated Show resolved Hide resolved
@gerritholl gerritholl marked this pull request as ready for review May 25, 2020 15:16
@coveralls
Copy link

coveralls commented May 25, 2020

Coverage Status

Coverage increased (+0.01%) to 89.777% when pulling f307d98 on gerritholl:combine-metadata-array-interface into 836c657 on pytroll:master.

@codecov
Copy link

codecov bot commented May 25, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@836c657). Click here to learn what that means.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1216   +/-   ##
=========================================
  Coverage          ?   89.77%           
=========================================
  Files             ?      202           
  Lines             ?    30031           
  Branches          ?        0           
=========================================
  Hits              ?    26961           
  Misses            ?     3070           
  Partials          ?        0           
Impacted Files Coverage Δ
satpy/dataset.py 93.75% <100.00%> (ø)
satpy/tests/test_dataset.py 100.00% <100.00%> (ø)
satpy/tests/reader_tests/test_viirs_compact.py 95.83% <0.00%> (ø)
satpy/tests/enhancement_tests/test_enhancements.py 100.00% <0.00%> (ø)
satpy/tests/test_writers.py 98.55% <0.00%> (ø)
satpy/tests/reader_tests/test_omps_edr.py 98.98% <0.00%> (ø)
satpy/readers/goes_imager_nc.py 65.72% <0.00%> (ø)
satpy/composites/crefl_utils.py 84.52% <0.00%> (ø)
satpy/readers/generic_image.py 93.33% <0.00%> (ø)
satpy/utils.py 70.90% <0.00%> (ø)
... and 194 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 836c657...f307d98. Read the comment docs.

@gerritholl
Copy link
Collaborator Author

gerritholl commented May 25, 2020

From the combine_metadata docstring:

If any keys are not equal or do not exist in all provided dictionaries
then they are not included in the returned dictionary.

  1. is that correct? The implementation is comparing values as well as keys! Should this be "if any values are not equal"?
  2. if yes; how could this be implemented in a dask-friendly way? The value of the ancillary_variables variable attribute is 124 MB per channel for the highest resolution channels on FCI. How can we decide whether or not to include a key if this decision depends on the result of a postponed evaluation?

(I'd like to resolve these questions before I do more stylistic work, such as solving the issues codebeat is complaining about)

@gerritholl
Copy link
Collaborator Author

One option would be that for arrays, instead of comparing the values, it compares object identity or perhaps uses numpy.shares_memory.

@djhoese
Copy link
Member

djhoese commented May 26, 2020

is that correct? The implementation is comparing values as well as keys! Should this be "if any values are not equal"?

I think this was supposed to mean "if the values of any keys are not equal".

Regardless, would it just be safer/smarter to special case ancillary_variables like we do "time" attributes? Or what about special casing just on DataArray objects specifically and use their .name property for comparison. Or shape, dtype, and name? Would shares_memory work for dask arrays? That function also returns True if one input is a subset of the other, right? That might not be what we want.

@gerritholl
Copy link
Collaborator Author

I don't know what we want, because I don't have a good idea of the intention of this function, how it is used in practice, or what might break in case of a false positive or false negative equals matching. I'm fine with any solution that doesn't involve computing the arrays at this stage.

@djhoese
Copy link
Member

djhoese commented May 26, 2020

I think the original use case were things that are now included as ancillary_variables or as coordinates on a DataArray. For example, VIIRS has a moon illumination fraction that could be provided as a per-scan array. I think this is now requested/loaded as a separate dataset by the few composites that use it. Another example is pressure levels for the NUCAPS reader. I think these are still added to .attrs as an attribute. I'm not sure an identity check would work in this case since it might be recreated for every dataset loaded by the reader.

Off the top of my head, I think ancillary_variables is the only case where a reader developer should need to add delayed "things" (dask arrays) outside of the normal DataArray .coords interface. So a special case for ancillary_variables would be OK with me, but a blanket DataArray specific check would be fine too. In either case I think we can assume DataArrays and use the name and shape of the arrays as a basic equality check.

@gerritholl
Copy link
Collaborator Author

@djhoese @mraspaud Do you have a preferred alternative?

@mraspaud
Copy link
Member

I would use 'is' as a comparator for datarrays

In combine_metadata, compare arrays with object identity rather than by
value, avoiding expensive computation.  Refactor the combine_metadata
function with three small helpers to reduce code complexity.  Expand
unit tests for combine_metadata.
PEP 8 fixes, forgot to remove #breakpoint() from test code.
@gerritholl
Copy link
Collaborator Author

The test failures appear unrelated to this PR.

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping this new functionality would only apply to DataArrays and not numpy or dask arrays, but this is probably better overall.

Thanks!

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FCI reader fails to load composites due to metadata issues
4 participants