Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support specifying chunk sizes using labels (e.g. frequency string) #7559

Closed
dcherian opened this issue Feb 24, 2023 · 5 comments · Fixed by #9109
Closed

Support specifying chunk sizes using labels (e.g. frequency string) #7559

dcherian opened this issue Feb 24, 2023 · 5 comments · Fixed by #9109

Comments

@dcherian
Copy link
Contributor

dcherian commented Feb 24, 2023

Is your feature request related to a problem?

dask.dataframe supports repartitioning or rechunking using a frequency string (freq kwarg).

I think this would be a useful addition to .chunk. It would help with some groupby problems (as suggested in this comment) and generally make a few problems amenable to blockwise/map_blocks solutions.

Describe the solution you'd like

  1. One solution is to allow .chunk(lon=5, time="MS"). There is some ugliness in that this syntax mixes up integer index values (lon=5) and a label-based frequency string time="MS"
  2. So perhaps a second method chunk_by_labels would be useful where chunk_by_labels(lon=5, time="MS") would rechunk the data so that a single chunk contains 5° of longitude points and a month of time. Alternative this could be .chunk(lon=5, time="MS", by="labels")

Describe alternatives you've considered

Have the user do this manually but that's kind of annoying, and a bit advanced.

Additional context

No response

@TomNicholas
Copy link
Member

The chunk_by_labels functionality seems quite useful even when not talking about times, so I would be 👍 for that kind of option.

On the API question is there anywhere else in xarray where we have made some choice about how to let the user choose between specifying via indexes or labels? Apart from just .isel vs .sel I mean

@dcherian
Copy link
Contributor Author

dcherian commented Feb 24, 2023

is there anywhere else in xarray where we have made some choice about how to let the user choose between specifying via indexes or labels?

coarsen vs groupby/groupby_bins/resample.

I explored this idea in this tutorial

I think it may be a fundamental concept for labelled array analysis. You need to pick whether you're working in "index space" like unlabelled arrays, or in "label space". This also came up in this issue where shift (and roll) operate in "index space".

Another example: Alignment is in "label space", broadcasting seems like "index space" (you just change shapes, but it does use dimension names to do that so maybe 50/50).

@dcherian
Copy link
Contributor Author

Now I think the way to generalize is to eventually support Resampler objects.

I think overloading the existing .chunk is nicer that a new chunk_by method, but could be convinced otherwise.

I put up #9109 which allows specifying frequency strings.

@dcherian
Copy link
Contributor Author

dcherian commented Jun 13, 2024

Responding to @shoyer's comment:

Are frequency strings unambiguous? Rechunking already supports memory sizes for Dask using strings.

The table here doesn't seem to overlap with MB, KB etc. but clearly this behaviour isn't tested. I'll fix that.

I see at least two ways to proceed with more explicit API:

  1. A more explicit opt-in could be using Resampler objects, which we are pretty close to making public.
  2. Alternatively we could add the more explicit chunk_by(time="5ME").

@shoyer
Copy link
Member

shoyer commented Jun 13, 2024

  • A more explicit opt-in could be using Resampler objects, which we are pretty close to making public.

I like this option.

dcherian added a commit to dcherian/xarray that referenced this issue Jun 22, 2024
dcherian added a commit to dcherian/xarray that referenced this issue Jul 18, 2024
dcherian added a commit that referenced this issue Jul 29, 2024
* Support rechunking to a frequency.

Closes #7559

* Updates

* Fix typing

* More typing fixes.

* Switch to TimeResampler objects

* small fix

* Add whats-new

* More test

* fix docs

* fix

* Update doc/user-guide/dask.rst

Co-authored-by: Spencer Clark <spencerkclark@gmail.com>

---------

Co-authored-by: Spencer Clark <spencerkclark@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants