Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ABI readers using wrong dtype for resolution-based chunks #2627

Merged
merged 1 commit into from
Nov 10, 2023

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Nov 6, 2023

In-file data is 16-bit so our size has to be based on that. This is a continuation of #2621 where a user on slack (Matthew Scutter) discovered that his performance got worse after that PR. This was because he was using large chunk sizes already and it was making this bug (fixed in this PR) more apparent. As a summary of what's going on and how we compute chunks:

  1. Take current dask setting for number of bytes per chunk.
  2. Get number of elements in one "square" chunk in one dimension (N elements in an N * N chunk of B bytes for each element).
  3. Align this number of elements to be 4 * 226, representing the hardcoded on-disk chunk size of 226 elements and 4 for the highest multiplier of resolution (4 == 500m => 2 for 1km => 1 for 2km).
  4. Scale this aligned number of elements to the specific resolution for a specific file type. So divide by 2 for 1km data, or divide by 4 for 2km data, or divide by 1 (no-op) for 500m data (C02).
  5. Convert this number of elements back to an amount of memory by squaring it (2D square chunk) and then multiplying by the number of bytes for a single pixel. For our final data this is 4 for a 32-bit float.
  6. Temporarily set the dask array.chunk-size setting to this number of bytes.

This 4 in the second to last step is the problem because the files are actually 16-bit integers in the file. So if we do all our calculations for 32-bit floats, give it to dask when xarray open_dataset is called, but dask sees 16-bit integers we'll have chunks that are about 2x the size we want them to be once we end up scaling these 16-bit integers to 32-bit floats with the scale factor and add offset in the file.

This PR fixes this by changing this second to last step to multiply by 2 (2 bytes == 16-bit integers). This should also prevent the issue that Matthew Scutter was seeing where the number of chunks would end up not being aligned depending on the rounding and dividing dask ends up doing. Here's what you see with dask's default of 128MB:

  1. 226 on-disk aligned 500 meter band number of bytes: 117679104
  2. sqrt(117679104 / 2) for 16-bit integers gets you 7670.6944 elements per dimension (not an equal number of elements). So you end up with 33.94113 on-disk chunks to include. So dask needs to choose 33 or 34 on-disk chunks (it actually chooses 33).
  3. For the 1km band that would be 29419776 bytes => 3835.347181155834 elements per dimension => 16.9706 on-disk chunks (dask chooses 16).
  4. For the 2km band it chooses 8 on-disk chunks.

The above points out that since dask is not computing nice round number of elements we get inconsistent and not resolution-aligned chunk sizes (33 on-disk chunks for 500m, 16 for 1km, 8 for 2km).

If after this fix we still don't have consistent chunking someone please tell me as soon as possible.

  • Closes #xxxx
  • Tests added
  • Fully documented
  • Add your name to AUTHORS.md if not there already

In-file data is 16-bit so our size has to be based on that
Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Merging #2627 (4e485e9) into main (ee63577) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2627   +/-   ##
=======================================
  Coverage   95.18%   95.18%           
=======================================
  Files         354      354           
  Lines       51316    51318    +2     
=======================================
+ Hits        48846    48848    +2     
  Misses       2470     2470           
Flag Coverage Δ
behaviourtests 4.24% <0.00%> (-0.01%) ⬇️
unittests 95.81% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
satpy/readers/abi_base.py 95.36% <100.00%> (ø)
satpy/tests/reader_tests/test_abi_l1b.py 98.96% <100.00%> (+0.01%) ⬆️

@coveralls
Copy link

Pull Request Test Coverage Report for Build 6775731325

  • 11 of 11 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 95.763%

Totals Coverage Status
Change from base Build 6737851658: 0.0%
Covered Lines: 48974
Relevant Lines: 51141

💛 - Coveralls

@djhoese
Copy link
Member Author

djhoese commented Nov 10, 2023

This was tested by Matthew on slack. Looks good to go. Merging...

@djhoese djhoese merged commit 25ef6f9 into pytroll:main Nov 10, 2023
18 of 19 checks passed
@djhoese djhoese deleted the bugfix-abi-16bit-chunks branch November 10, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants