New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix VIIRS SDR reader not handling multi-granule files with fewer scans #1771
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1771 +/- ##
==========================================
+ Coverage 92.82% 92.88% +0.05%
==========================================
Files 263 265 +2
Lines 38717 38910 +193
==========================================
+ Hits 35940 36141 +201
+ Misses 2777 2769 -8
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you for putting the effort to fix this bug and taking the time to do some refactoring along the way, it is highly appreciated. This PR lgtm, I just have a couple of comments that may or may not need action.
# The user may have requested a different chunking scheme, but we need | ||
# per granule chunking right now so factor chunks map 1:1 to data chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general comment here: I think the chunking scheme should be dependent of the data, not directly on what the user requested. So I'd like to see satpy accepting chunksizes in MB for example instead of number of pixels, as it could be wasteful. The developers of the filehandler know probably best how to chunk the data for size constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but is that something I should try to fix in this PR? This reader uses the HDF5 utility handler so it would be difficult to do it cleanly for just this reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing you need to fix here. I was just reacting to the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Do you plan on addressing the codebeat issues or should I just merge this?
Those are codebeat being confused. You merged my crefl optimization PR so now codebeat things that this PR is introducing the issues I fixed in that PR. ...There are no new issues in this PR. Feel free to merge. |
This was brought up by Artur on slack. He had some direct broadcast files from CSPP that had 6 granules in one file. Turns out that they weren't readable by the
viirs_sdr
reader. The reader would fail when apply scale factors because it was calculating a different number of total rows. The data arrays in his files total 4608 rows for the M02 band. If you check the number of scans per granule you'll notice that the first and the last have one missing:h5dump output + grep
So [47, 48, 48, 48, 48, 47] scans per granule. The data array is truncated based on this information so you get each of those number of scans multiplied by 16 which leads to a final data array with 4576 rows. Now for the scaling factors, previously we were taking
data.shape[0] // num_granules
. In this data case that is not a round integer value (762.6666 versus 762). This leads to 4572 rows for the expanded scaling factors. Finally numpy/dask fails because it can't do arithmetic between 4576 and 4572 rows.This overall situation for this particular data case (one with truncated granules at the beginning and end of the data file) would actually produce invalid factor <-> granule data mapping and apply the wrong factors. I think we made the assumption that a truncated granule can only happen at the end of a pass...apparently that's not true.
This Solution
In this PR I change the file handler in to important ways:
Other Info
I had to really improve the tests to make a test that failed for this. Because the failing number of scans per granule depended on a non-round integer result I actually discovered that this failure depends on the number of granules provided and their size. A lot of our fake data did the minimum of what it had to to produce something that could be tested, but they weren't accurate. The number of granules was always 1, the size of the data was always only a single scan (10 rows). I've updated this to be more realistic with 32 rows per scan and having the data array repeated for test cases with more than one granule.