Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix numbagg aggregations #282

Merged
merged 25 commits into from
Nov 9, 2023
Merged

Fix numbagg aggregations #282

merged 25 commits into from
Nov 9, 2023

Conversation

dcherian
Copy link
Collaborator

@dcherian dcherian commented Nov 6, 2023

Closes #281

  1. fix version check
  2. Handle fill_value for numbagg.
  3. enable numbagg for count

TODO:

  • add test to core.py or xarray.py
  • save seen_groups in factorize_
  • see if this generalizes to other engines

flox/aggregate_numbagg.py Outdated Show resolved Hide resolved
@max-sixty
Copy link
Contributor

OK awesome, thanks a lot; let me know if there's anything you need on the numbagg side!

@dcherian
Copy link
Collaborator Author

dcherian commented Nov 7, 2023

FYI I'm running in to a major blocker with how Xarray handles groups with all-NaN entries, and groups with no entries. I think you should just choose numbagg explicitly using the engine kwarg and move on.

@dcherian dcherian changed the title Fix numbagg version check Fix numbagg aggregations Nov 8, 2023
* upstream/main:
  Remove unused `Aggregation.aggregate` field. (#285)
  Actually optimize out multiple "nanlen" (#283)
* main:
  Set order='F' when raveling group_idx after broadcast (#286)
  Ignore benchmarks for codecov (#287)
@max-sixty
Copy link
Contributor

Thanks a lot for doing all these — it looks like difficult and finicky work.

Let me know what I can do on the numbagg side to make it less of a burden on your end — particularly things that are bad / wrong in numbagg behavior (am still happy to fix bad upstream conventions if it makes your job much easier though)

@dcherian
Copy link
Collaborator Author

dcherian commented Nov 8, 2023

I'm not sure you should, it's really an Xarray annoyance

import pandas as pd
import numpy as np
from xarray import Dataset

times = pd.date_range("2000-01-01", freq="6H", periods=10)
ds = Dataset(
    {
        "bar": ("time", [1, 2, 3, np.nan, np.nan, np.nan, 4, 5, np.nan, np.nan], {"meta": "data"}),
        "time": times,
    }
)

expected_time = pd.date_range("2000-01-01", freq="3H", periods=19)
expected = ds.reindex(time=expected_time)
ds.resample(time="3H").sum().bar.data
# array([ 1., nan,  2., nan,  3., nan,  0., nan,  0., nan,  0., nan,  4., nan,  5., nan,  0., nan,  0.])

^ It's NaN when there are no observations in the window, and 0 if there are only NaNs in the window. Both numpy_groupies and numbagg would just give you all 0s (i.e. the identity element) which is sensible to me.

The Xarray behaviour is really an artifact of the fact that we accumulate np.nansum([np.nan]) in windows with all NaN observations, and then reindex with default fill_value=np.nan to the final time vector.
https://github.com/pydata/xarray/blob/feba6984aa914327408fee3c286dae15969d2a2f/xarray/core/groupby.py#L1435
I think the result above is an unintended consequence of the implementation.

@max-sixty
Copy link
Contributor

The Xarray behaviour is really an artifact of the fact that we accumulate np.nansum([np.nan]) in windows with all NaN observations, and then reindex with default fill_value=np.nan to the final time vector. https://github.com/pydata/xarray/blob/feba6984aa914327408fee3c286dae15969d2a2f/xarray/core/groupby.py#L1435 I think the result above is an unintended consequence of the implementation.

Great, definitely agree. Possibly we could change that.

Pandas even does the arguably more logical thing!

ds.to_pandas().resample("3H").sum()
Out[5]:
                     bar
time
2000-01-01 00:00:00  1.0
2000-01-01 03:00:00  0.0
2000-01-01 06:00:00  2.0
2000-01-01 09:00:00  0.0
2000-01-01 12:00:00  3.0
2000-01-01 15:00:00  0.0
2000-01-01 18:00:00  0.0
2000-01-01 21:00:00  0.0
2000-01-02 00:00:00  0.0
2000-01-02 03:00:00  0.0
2000-01-02 06:00:00  0.0
2000-01-02 09:00:00  0.0
2000-01-02 12:00:00  4.0
2000-01-02 15:00:00  0.0
2000-01-02 18:00:00  5.0
2000-01-02 21:00:00  0.0
2000-01-03 00:00:00  0.0
2000-01-03 03:00:00  0.0
2000-01-03 06:00:00  0.0

@dcherian dcherian merged commit 19db5b3 into main Nov 9, 2023
17 checks passed
@dcherian dcherian deleted the fix-numbagg-check branch November 9, 2023 04:25
@dcherian
Copy link
Collaborator Author

dcherian commented Nov 9, 2023

Sweet, looks like we are actually using numbagg by default now.

I don't understand the first row for mean but the rest are all functions I expect to send to numbagg

| Before     | After       | Ratio | Benchmark (Parameter)                                            |
| [273d319e] | [54d57d73]  |       |                                                                  |
|------------|-------------|-------|------------------------------------------------------------------|
| 297±0.7ms  | 149±0.4ms   | 0.5   | reduce.ChunkReduce2DAllAxes.time_reduce('mean', 'bins', None)    |
| 65.0±0.3ms | 28.1±0.3ms  | 0.43  | reduce.ChunkReduce2D.time_reduce('count', 'None', None)          |
| 144±0.5ms  | 45.0±0.4ms  | 0.31  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmax', 'None', None)  |
| 137±0.4ms  | 41.6±0.4ms  | 0.3   | reduce.ChunkReduce2DAllAxes.time_reduce('count', 'None', None)   |
| 117±0.1ms  | 34.0±0.1ms  | 0.29  | reduce.ChunkReduce2D.time_reduce('count', 'bins', None)          |
| 144±0.6ms  | 42.0±0.3ms  | 0.29  | reduce.ChunkReduce2DAllAxes.time_reduce('nansum', 'None', None)  |
| 139±0.3ms  | 34.2±0.04ms | 0.25  | reduce.ChunkReduce2D.time_reduce('nansum', 'bins', None)         |
| 75.2±0.2ms | 17.1±0.2ms  | 0.23  | reduce.ChunkReduce2D.time_reduce('nansum', 'None', None)         |
| 271±1ms    | 58.2±0.1ms  | 0.22  | reduce.ChunkReduce2DAllAxes.time_reduce('count', 'bins', None)   |
| 205±0.3ms  | 44.5±0.4ms  | 0.22  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmean', 'None', None) |
| 276±0.9ms  | 59.2±0.4ms  | 0.21  | reduce.ChunkReduce2DAllAxes.time_reduce('nansum', 'bins', None)  |
| 102±2ms    | 20.0±0.1ms  | 0.2   | reduce.ChunkReduce2D.time_reduce('nanmax', 'None', None)         |
| 337±0.8ms  | 59.7±0.4ms  | 0.18  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmean', 'bins', None) |
| 208±0.3ms  | 35.1±0.1ms  | 0.17  | reduce.ChunkReduce2D.time_reduce('nanmean', 'bins', None)        |
| 153±0.4ms  | 23.8±0.1ms  | 0.16  | reduce.ChunkReduce2D.time_reduce('nanmean', 'None', None)        |
| 276±0.5ms  | 45.1±0.4ms  | 0.16  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmax', 'bins', None)  |
| 205±0.3ms  | 19.0±0.1ms  | 0.09  | reduce.ChunkReduce2D.time_reduce('nanmax', 'bins', None)         |

@max-sixty
Copy link
Contributor

Nice!!

And I recently enabled parallel by default — it scales really well for multiple dims now if multiple cares are available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flox seems much slower in some cases?
2 participants