Fix VIIRS SDR reader not handling multi-granule files with fewer scans #1771

djhoese · 2021-07-22T19:42:18Z

This was brought up by Artur on slack. He had some direct broadcast files from CSPP that had 6 granules in one file. Turns out that they weren't readable by the viirs_sdr reader. The reader would fail when apply scale factors because it was calculating a different number of total rows. The data arrays in his files total 4608 rows for the M02 band. If you check the number of scans per granule you'll notice that the first and the last have one missing:

h5dump output + grep

$ h5dump -A SVM02_j01_d20210719_t0941272_e0949569_b18998_c20210719100133953612_cspp_dev.h5 | grep -A 4 "Number_Of_Scans"
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 47
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 48
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 48
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 48
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 48
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 47

So [47, 48, 48, 48, 48, 47] scans per granule. The data array is truncated based on this information so you get each of those number of scans multiplied by 16 which leads to a final data array with 4576 rows. Now for the scaling factors, previously we were taking data.shape[0] // num_granules. In this data case that is not a round integer value (762.6666 versus 762). This leads to 4572 rows for the expanded scaling factors. Finally numpy/dask fails because it can't do arithmetic between 4576 and 4572 rows.

This overall situation for this particular data case (one with truncated granules at the beginning and end of the data file) would actually produce invalid factor <-> granule data mapping and apply the wrong factors. I think we made the assumption that a truncated granule can only happen at the end of a pass...apparently that's not true.

This Solution

In this PR I change the file handler in to important ways:

I update the factors to be chunked per granule.
I also temporarily rechunk the data and use map_blocks to handle the "this factor pair goes with this chunk of data". This should perform much better because we aren't repeating factors (producing a large array that doesn't need to exist) and because we are only adding one high level dask task.

Other Info

I had to really improve the tests to make a test that failed for this. Because the failing number of scans per granule depended on a non-round integer result I actually discovered that this failure depends on the number of granules provided and their size. A lot of our fake data did the minimum of what it had to to produce something that could be tested, but they weren't accurate. The number of granules was always 1, the size of the data was always only a single scan (10 rows). I've updated this to be more realistic with 32 rows per scan and having the data array repeated for test cases with more than one granule.

Tests added

codecov · 2021-07-22T19:52:48Z

Codecov Report

Merging #1771 (c244955) into main (4d34625) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1771      +/-   ##
==========================================
+ Coverage   92.82%   92.88%   +0.05%     
==========================================
  Files         263      265       +2     
  Lines       38717    38910     +193     
==========================================
+ Hits        35940    36141     +201     
+ Misses       2777     2769       -8

Flag	Coverage Δ
behaviourtests	`4.78% <0.00%> (-0.02%)`	⬇️
unittests	`93.42% <100.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
satpy/readers/viirs_sdr.py	`86.55% <100.00%> (+0.92%)`	⬆️
satpy/tests/reader_tests/test_viirs_sdr.py	`99.50% <100.00%> (+0.04%)`	⬆️
satpy/demo/__init__.py	`100.00% <0.00%> (ø)`
satpy/demo/viirs_sdr.py	`100.00% <0.00%> (ø)`
satpy/composites/crefl_utils.py
satpy/modifiers/_crefl_utils.py	`93.18% <0.00%> (ø)`
satpy/demo/utils.py	`100.00% <0.00%> (ø)`
satpy/demo/seviri_hrit.py	`100.00% <0.00%> (ø)`
satpy/tests/test_demo.py	`99.29% <0.00%> (+0.33%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d34625...c244955. Read the comment docs.

coveralls · 2021-07-22T20:06:34Z

Coverage remained the same at 93.366% when pulling c244955 on djhoese:bugfix-viirs-sdr-multigran into 4d34625 on pytroll:main.

mraspaud

Thanks you for putting the effort to fix this bug and taking the time to do some refactoring along the way, it is highly appreciated. This PR lgtm, I just have a couple of comments that may or may not need action.

satpy/tests/reader_tests/test_viirs_sdr.py

satpy/readers/viirs_sdr.py

mraspaud · 2021-07-23T09:24:48Z

satpy/readers/viirs_sdr.py

+        # The user may have requested a different chunking scheme, but we need
+        # per granule chunking right now so factor chunks map 1:1 to data chunks


Just a general comment here: I think the chunking scheme should be dependent of the data, not directly on what the user requested. So I'd like to see satpy accepting chunksizes in MB for example instead of number of pixels, as it could be wasteful. The developers of the filehandler know probably best how to chunk the data for size constraints.

I agree, but is that something I should try to fix in this PR? This reader uses the HDF5 utility handler so it would be difficult to do it cleanly for just this reader.

Nothing you need to fix here. I was just reacting to the comment.

mraspaud

LGTM! Do you plan on addressing the codebeat issues or should I just merge this?

djhoese · 2021-07-23T12:04:21Z

Those are codebeat being confused. You merged my crefl optimization PR so now codebeat things that this PR is introducing the issues I fixed in that PR. ...There are no new issues in this PR. Feel free to merge.

djhoese added 4 commits July 22, 2021 13:19

Add failing multi-granule test for the VIIRS SDR reader

22f58a6

Refactor viirs sdr reader to prepare for fixing multi-gran bug

2d3295d

Fix viirs sdr failing with multi-granule files

12789b0

Fix using the wrong number of chunks in viirs sdr scaling

653c9c9

djhoese added bug component:readers labels Jul 22, 2021

djhoese requested a review from mraspaud July 22, 2021 19:42

djhoese self-assigned this Jul 22, 2021

Fix codefactor complaints

a42e06a

Refactor scaling factor code in viirs sdr to make codefactor happier

3fa0e8b

mraspaud approved these changes Jul 23, 2021

View reviewed changes

djhoese changed the title ~~Fix VIIRS SDR reader not handling truncated multi-granule files during scaling~~ Fix VIIRS SDR reader not handling multi-granule files with fewer scans Jul 23, 2021

djhoese added 2 commits July 23, 2021 06:33

Change "truncated" to "short"

856ef4a

Refactor factor handling to make level of abstraction consistent

c244955

djhoese requested a review from mraspaud July 23, 2021 11:35

mraspaud approved these changes Jul 23, 2021

View reviewed changes

mraspaud merged commit 2abebf1 into pytroll:main Jul 23, 2021

djhoese deleted the bugfix-viirs-sdr-multigran branch July 23, 2021 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix VIIRS SDR reader not handling multi-granule files with fewer scans #1771

Fix VIIRS SDR reader not handling multi-granule files with fewer scans #1771

djhoese commented Jul 22, 2021

codecov bot commented Jul 22, 2021 •

edited

coveralls commented Jul 22, 2021 •

edited

mraspaud left a comment

mraspaud Jul 23, 2021

djhoese Jul 23, 2021

mraspaud Jul 23, 2021

mraspaud left a comment

djhoese commented Jul 23, 2021

		# The user may have requested a different chunking scheme, but we need
		# per granule chunking right now so factor chunks map 1:1 to data chunks

Fix VIIRS SDR reader not handling multi-granule files with fewer scans #1771

Fix VIIRS SDR reader not handling multi-granule files with fewer scans #1771

Conversation

djhoese commented Jul 22, 2021

This Solution

Other Info

codecov bot commented Jul 22, 2021 • edited

Codecov Report

coveralls commented Jul 22, 2021 • edited

mraspaud left a comment

Choose a reason for hiding this comment

mraspaud Jul 23, 2021

Choose a reason for hiding this comment

djhoese Jul 23, 2021

Choose a reason for hiding this comment

mraspaud Jul 23, 2021

Choose a reason for hiding this comment

mraspaud left a comment

Choose a reason for hiding this comment

djhoese commented Jul 23, 2021

codecov bot commented Jul 22, 2021 •

edited

coveralls commented Jul 22, 2021 •

edited