Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VIIRS SDR reader not handling multi-granule files with fewer scans #1771

Merged
merged 8 commits into from Jul 23, 2021

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Jul 22, 2021

This was brought up by Artur on slack. He had some direct broadcast files from CSPP that had 6 granules in one file. Turns out that they weren't readable by the viirs_sdr reader. The reader would fail when apply scale factors because it was calculating a different number of total rows. The data arrays in his files total 4608 rows for the M02 band. If you check the number of scans per granule you'll notice that the first and the last have one missing:

h5dump output + grep
$ h5dump -A SVM02_j01_d20210719_t0941272_e0949569_b18998_c20210719100133953612_cspp_dev.h5 | grep -A 4 "Number_Of_Scans"
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 47
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 48
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 48
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 48
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 48
--
            ATTRIBUTE "N_Number_Of_Scans" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
               DATA {
               (0,0): 47

So [47, 48, 48, 48, 48, 47] scans per granule. The data array is truncated based on this information so you get each of those number of scans multiplied by 16 which leads to a final data array with 4576 rows. Now for the scaling factors, previously we were taking data.shape[0] // num_granules. In this data case that is not a round integer value (762.6666 versus 762). This leads to 4572 rows for the expanded scaling factors. Finally numpy/dask fails because it can't do arithmetic between 4576 and 4572 rows.

This overall situation for this particular data case (one with truncated granules at the beginning and end of the data file) would actually produce invalid factor <-> granule data mapping and apply the wrong factors. I think we made the assumption that a truncated granule can only happen at the end of a pass...apparently that's not true.

This Solution

In this PR I change the file handler in to important ways:

  1. I update the factors to be chunked per granule.
  2. I also temporarily rechunk the data and use map_blocks to handle the "this factor pair goes with this chunk of data". This should perform much better because we aren't repeating factors (producing a large array that doesn't need to exist) and because we are only adding one high level dask task.

Other Info

I had to really improve the tests to make a test that failed for this. Because the failing number of scans per granule depended on a non-round integer result I actually discovered that this failure depends on the number of granules provided and their size. A lot of our fake data did the minimum of what it had to to produce something that could be tested, but they weren't accurate. The number of granules was always 1, the size of the data was always only a single scan (10 rows). I've updated this to be more realistic with 32 rows per scan and having the data array repeated for test cases with more than one granule.

  • Tests added

@codecov
Copy link

codecov bot commented Jul 22, 2021

Codecov Report

Merging #1771 (c244955) into main (4d34625) will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1771      +/-   ##
==========================================
+ Coverage   92.82%   92.88%   +0.05%     
==========================================
  Files         263      265       +2     
  Lines       38717    38910     +193     
==========================================
+ Hits        35940    36141     +201     
+ Misses       2777     2769       -8     
Flag Coverage Δ
behaviourtests 4.78% <0.00%> (-0.02%) ⬇️
unittests 93.42% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
satpy/readers/viirs_sdr.py 86.55% <100.00%> (+0.92%) ⬆️
satpy/tests/reader_tests/test_viirs_sdr.py 99.50% <100.00%> (+0.04%) ⬆️
satpy/demo/__init__.py 100.00% <0.00%> (ø)
satpy/demo/viirs_sdr.py 100.00% <0.00%> (ø)
satpy/composites/crefl_utils.py
satpy/modifiers/_crefl_utils.py 93.18% <0.00%> (ø)
satpy/demo/utils.py 100.00% <0.00%> (ø)
satpy/demo/seviri_hrit.py 100.00% <0.00%> (ø)
satpy/tests/test_demo.py 99.29% <0.00%> (+0.33%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d34625...c244955. Read the comment docs.

@coveralls
Copy link

coveralls commented Jul 22, 2021

Coverage Status

Coverage remained the same at 93.366% when pulling c244955 on djhoese:bugfix-viirs-sdr-multigran into 4d34625 on pytroll:main.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you for putting the effort to fix this bug and taking the time to do some refactoring along the way, it is highly appreciated. This PR lgtm, I just have a couple of comments that may or may not need action.

satpy/tests/reader_tests/test_viirs_sdr.py Outdated Show resolved Hide resolved
satpy/readers/viirs_sdr.py Outdated Show resolved Hide resolved
Comment on lines +235 to +236
# The user may have requested a different chunking scheme, but we need
# per granule chunking right now so factor chunks map 1:1 to data chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment here: I think the chunking scheme should be dependent of the data, not directly on what the user requested. So I'd like to see satpy accepting chunksizes in MB for example instead of number of pixels, as it could be wasteful. The developers of the filehandler know probably best how to chunk the data for size constraints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but is that something I should try to fix in this PR? This reader uses the HDF5 utility handler so it would be difficult to do it cleanly for just this reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing you need to fix here. I was just reacting to the comment.

@djhoese djhoese changed the title Fix VIIRS SDR reader not handling truncated multi-granule files during scaling Fix VIIRS SDR reader not handling multi-granule files with fewer scans Jul 23, 2021
@djhoese djhoese requested a review from mraspaud July 23, 2021 11:35
Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you plan on addressing the codebeat issues or should I just merge this?

@djhoese
Copy link
Member Author

djhoese commented Jul 23, 2021

Those are codebeat being confused. You merged my crefl optimization PR so now codebeat things that this PR is introducing the issues I fixed in that PR. ...There are no new issues in this PR. Feel free to merge.

@mraspaud mraspaud merged commit 2abebf1 into pytroll:main Jul 23, 2021
@djhoese djhoese deleted the bugfix-viirs-sdr-multigran branch July 23, 2021 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants