-
Notifications
You must be signed in to change notification settings - Fork 296
Rechunk derived #6516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rechunk derived #6516
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6516 +/- ##
==========================================
+ Coverage 89.80% 89.88% +0.07%
==========================================
Files 90 90
Lines 23752 23945 +193
Branches 4418 4467 +49
==========================================
+ Hits 21331 21522 +191
+ Misses 1672 1670 -2
- Partials 749 753 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Update 2025-06-18Thanks @trexfeathers @stephenworsley for offlines discussions on this, I'm happy that tests now give full code coverage + I'm marking this ready for review. |
⏱️ Performance Benchmark Report: c617e86Performance shifts
Full benchmark results
Generated by GHA run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass through, overall looks good. For now, just a question about the test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments now that I've read through this more thoroughly. I think test coverage is still basically my main concern here.
Hi @stephenworsley I've finally tidied up + pushed my updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a minor suggestion and a bunch of pondering about the subtleties of dask rechunking.
Thanks @stephenworsley I think the latest suggestion definitely enhances this -- it can now preserve pre-existing chunking better, so I think that puts this in a better place. Please re-review (again!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, I'm a lot happier with the dask behaviour now. Just some minor docs fixes and I think this is good to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Closes #6404
Automatic rechunking of derived coordinates
Investigation of the #6404 problem reveals that the points/bounds arrays of our derived (aka factory) coords have arrays which are mostly single chunks, which could thus be very large.
This was due to the fact that they tend to be like a broadcast product of several simple one-dimensional coords (dim or aux), each spanning a different dim or two, which themselves are quite small and so tend to be all single chunks.
When these are broadcast together, the result then tends to be one massive chunk, which can blow memory.
For example:
a result formed like A * B * C,
where these might have dims (T, Z, Y, X) of:
which are all relatively small, and so can be single chunks.
Say NT, NZ, NY, NX = 100, 70, 1000, 500.
then the result is (100 * 70 * 1000 * 500) -> 3.5Gpoints.
If element size is a typical 4 bytes, and dask chunksize is a typical 200Mb, then we expect a chunk ~50M array elements.
An array of this size, loaded from an input netcdf file, might get chunked (1, 70, 1000, 500) ~35M elements, or 140Mb.
But our derived coord will have the whole array, 3,500 Melements --> ~14Gb in a single chunk.
It seems likely that this problem has been noticed more recently because, since #5369, we now have derived coordinates which are time-dependent, so that is multiplying up the total size where before it did not.
However even before this, we were potentially mutliplying e.g. the size of a field * the number of model levels, which already lead to single-chunk arrays larger than ideal. Typical numbers : 70 * 1024 * 768 * 4 = 220Mib, already reaching the standard Dask chunksize of 200Mib (so hi-res fields or double resolution will clearly exceed).
Todo: