-
Notifications
You must be signed in to change notification settings - Fork 296
Rechunk derived #6516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Rechunk derived #6516
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6516 +/- ##
==========================================
+ Coverage 89.80% 89.87% +0.07%
==========================================
Files 90 90
Lines 23752 23927 +175
Branches 4418 4463 +45
==========================================
+ Hits 21331 21505 +174
+ Misses 1672 1670 -2
- Partials 749 752 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Update 2025-06-18Thanks @trexfeathers @stephenworsley for offlines discussions on this, I'm happy that tests now give full code coverage + I'm marking this ready for review. |
⏱️ Performance Benchmark Report: c617e86Performance shifts
Full benchmark results
Generated by GHA run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass through, overall looks good. For now, just a question about the test coverage.
lazy_deps = [ | ||
# Note: no attempt to make clever chunking choices here. If needed it | ||
# should get fixed later. Plus, single chunks keeps graph overhead small. | ||
dep if is_lazy_data(dep) else da.from_array(dep, chunks=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this only guarantees a single chunk if the data was initially non-lazy. From what I can tell of the tests it seems like you're only testing the case where there is a single chunk given. I think it would be worth making sure there is testing for the case where lazy_deps
contains chunked arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think the comment is really the problem here :
this "single chunks" statement really only applies to the real arrays which we wrap as lazy.
I will try and fix this ...
Background:
The initial calculation is supposed to produce a result that we can simply use, if its chunksize is OK, but we need it to be definitely lazy so that we can pre-check the chunksize before committing to do the calculation.
So we need to ensure that the initial 'test' calculation is lazy.
I did consider ensuring that just the first, or smallest term was lazy, but I realised that in the calculation, dask itself would then wrap any other real terms, using "auto" chunking by default, which is probably sub-optimal for our purposes.
If we were making our best single effort at producing a usable result array, we might logically use our "optimal_chunksize" scheme here in wrapping the real terms.
But in fact that is not a good approach, because the whole point is that you need to consider the terms (and especially their chunking) in alignment with all dimensions of the calculated result, and not just in their own individual dimensions. That's effectively the whole problem here.
So, I chose to first wrap all real terms as single chunks, and then assess the chunksize of the calculated result.
Only if that simplistic approach produces a chunksize which is too large, does the code then make a bigger effort to re-consider the chunking across all the terms, and re-chunk everything in certain dimensions.
I thought it was probably "safer" not to do that co-optimisation unless it is clearly needed, as the results might be a bit sub-optimal.
pts = np.ones(dims, dtype=np.int32) | ||
bds = np.stack([pts - 0.5, pts + 0.5], axis=-1) | ||
# Make them lazy with a single chunk in both cases | ||
pts, bds = (da.from_array(x, chunks=-1) for x in (pts, bds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, it looks like you're only testing the case where you're deriving from single chunked arrays. It may be worth checking what happens in the multi-chunk case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll look into this.
Watch this space ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, for a proper stress test you could try introducing a chunked coordinate, which would end up getting rechunked (by line 154 of aux_factory.py). And for extra measure you could set it up so that the chunks were also of uneven size (say, due to slicing) so that the rechunked chunks don't quite line up with the original chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments now that I've read through this more thoroughly. I think test coverage is still basically my main concern here.
# Create simple points + bounds arrays | ||
pts = np.ones(dims, dtype=np.int32) | ||
bds = np.stack([pts - 0.5, pts + 0.5], axis=-1) | ||
# Make them lazy with a single chunk in both cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what "both cases" means here, since there are 3 coordinates being made.
if new_chunks != dep_chunks: | ||
# When dep chunksize needs to change, produce a rechunked version. | ||
if is_lazy_data(original_dep): | ||
dep = original_dep.rechunk(new_chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile ensuring that this line gets test coverage.
pts = np.ones(dims, dtype=np.int32) | ||
bds = np.stack([pts - 0.5, pts + 0.5], axis=-1) | ||
# Make them lazy with a single chunk in both cases | ||
pts, bds = (da.from_array(x, chunks=-1) for x in (pts, bds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, for a proper stress test you could try introducing a chunked coordinate, which would end up getting rechunked (by line 154 of aux_factory.py). And for extra measure you could set it up so that the chunks were also of uneven size (say, due to slicing) so that the rechunked chunks don't quite line up with the original chunks.
Closes #6404
Automatic rechunking of derived coordinates
Investigation of the #6404 problem reveals that the points/bounds arrays of our derived (aka factory) coords have arrays which are mostly single chunks, which could thus be very large.
This was due to the fact that they tend to be like a broadcast product of several simple one-dimensional coords (dim or aux), each spanning a different dim or two, which themselves are quite small and so tend to be all single chunks.
When these are broadcast together, the result then tends to be one massive chunk, which can blow memory.
For example:
a result formed like A * B * C,
where these might have dims (T, Z, Y, X) of:
which are all relatively small, and so can be single chunks.
Say NT, NZ, NY, NX = 100, 70, 1000, 500.
then the result is (100 * 70 * 1000 * 500) -> 3.5Gpoints.
If element size is a typical 4 bytes, and dask chunksize is a typical 200Mb, then we expect a chunk ~50M array elements.
An array of this size, loaded from an input netcdf file, might get chunked (1, 70, 1000, 500) ~35M elements, or 140Mb.
But our derived coord will have the whole array, 3,500 Melements --> ~14Gb in a single chunk.
It seems likely that this problem has been noticed more recently because, since #5369, we now have derived coordinates which are time-dependent, so that is multiplying up the total size where before it did not.
However even before this, we were potentially mutliplying e.g. the size of a field * the number of model levels, which already lead to single-chunk arrays larger than ideal. Typical numbers : 70 * 1024 * 768 * 4 = 220Mib, already reaching the standard Dask chunksize of 200Mib (so hi-res fields or double resolution will clearly exceed).
Todo: