Fix numbagg aggregations #282

dcherian · 2023-11-06T23:41:43Z

Closes #281

fix version check
Handle fill_value for numbagg.
enable numbagg for count

TODO:

add test to core.py or xarray.py
~~save seen_groups in factorize_~~
see if this generalizes to other engines

Closes #281

flox/aggregate_numbagg.py

max-sixty · 2023-11-07T05:18:24Z

OK awesome, thanks a lot; let me know if there's anything you need on the numbagg side!

dcherian · 2023-11-07T17:50:26Z

FYI I'm running in to a major blocker with how Xarray handles groups with all-NaN entries, and groups with no entries. I think you should just choose numbagg explicitly using the engine kwarg and move on.

* upstream/main: Remove unused `Aggregation.aggregate` field. (#285) Actually optimize out multiple "nanlen" (#283)

flox/aggregations.py

* main: Set order='F' when raveling group_idx after broadcast (#286) Ignore benchmarks for codecov (#287)

max-sixty · 2023-11-08T22:11:35Z

Thanks a lot for doing all these — it looks like difficult and finicky work.

Let me know what I can do on the numbagg side to make it less of a burden on your end — particularly things that are bad / wrong in numbagg behavior (am still happy to fix bad upstream conventions if it makes your job much easier though)

dcherian · 2023-11-08T23:45:10Z

I'm not sure you should, it's really an Xarray annoyance

import pandas as pd
import numpy as np
from xarray import Dataset

times = pd.date_range("2000-01-01", freq="6H", periods=10)
ds = Dataset(
    {
        "bar": ("time", [1, 2, 3, np.nan, np.nan, np.nan, 4, 5, np.nan, np.nan], {"meta": "data"}),
        "time": times,
    }
)

expected_time = pd.date_range("2000-01-01", freq="3H", periods=19)
expected = ds.reindex(time=expected_time)
ds.resample(time="3H").sum().bar.data
# array([ 1., nan,  2., nan,  3., nan,  0., nan,  0., nan,  0., nan,  4., nan,  5., nan,  0., nan,  0.])

^ It's NaN when there are no observations in the window, and 0 if there are only NaNs in the window. Both numpy_groupies and numbagg would just give you all 0s (i.e. the identity element) which is sensible to me.

The Xarray behaviour is really an artifact of the fact that we accumulate np.nansum([np.nan]) in windows with all NaN observations, and then reindex with default fill_value=np.nan to the final time vector.
https://github.com/pydata/xarray/blob/feba6984aa914327408fee3c286dae15969d2a2f/xarray/core/groupby.py#L1435
I think the result above is an unintended consequence of the implementation.

max-sixty · 2023-11-09T00:19:52Z

The Xarray behaviour is really an artifact of the fact that we accumulate np.nansum([np.nan]) in windows with all NaN observations, and then reindex with default fill_value=np.nan to the final time vector. https://github.com/pydata/xarray/blob/feba6984aa914327408fee3c286dae15969d2a2f/xarray/core/groupby.py#L1435 I think the result above is an unintended consequence of the implementation.

Great, definitely agree. Possibly we could change that.

Pandas even does the arguably more logical thing!

ds.to_pandas().resample("3H").sum()
Out[5]:
                     bar
time
2000-01-01 00:00:00  1.0
2000-01-01 03:00:00  0.0
2000-01-01 06:00:00  2.0
2000-01-01 09:00:00  0.0
2000-01-01 12:00:00  3.0
2000-01-01 15:00:00  0.0
2000-01-01 18:00:00  0.0
2000-01-01 21:00:00  0.0
2000-01-02 00:00:00  0.0
2000-01-02 03:00:00  0.0
2000-01-02 06:00:00  0.0
2000-01-02 09:00:00  0.0
2000-01-02 12:00:00  4.0
2000-01-02 15:00:00  0.0
2000-01-02 18:00:00  5.0
2000-01-02 21:00:00  0.0
2000-01-03 00:00:00  0.0
2000-01-03 03:00:00  0.0
2000-01-03 06:00:00  0.0

dcherian · 2023-11-09T04:28:41Z

Sweet, looks like we are actually using numbagg by default now.

I don't understand the first row for mean but the rest are all functions I expect to send to numbagg

| Before     | After       | Ratio | Benchmark (Parameter)                                            |
| [273d319e] | [54d57d73]  |       |                                                                  |
|------------|-------------|-------|------------------------------------------------------------------|
| 297±0.7ms  | 149±0.4ms   | 0.5   | reduce.ChunkReduce2DAllAxes.time_reduce('mean', 'bins', None)    |
| 65.0±0.3ms | 28.1±0.3ms  | 0.43  | reduce.ChunkReduce2D.time_reduce('count', 'None', None)          |
| 144±0.5ms  | 45.0±0.4ms  | 0.31  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmax', 'None', None)  |
| 137±0.4ms  | 41.6±0.4ms  | 0.3   | reduce.ChunkReduce2DAllAxes.time_reduce('count', 'None', None)   |
| 117±0.1ms  | 34.0±0.1ms  | 0.29  | reduce.ChunkReduce2D.time_reduce('count', 'bins', None)          |
| 144±0.6ms  | 42.0±0.3ms  | 0.29  | reduce.ChunkReduce2DAllAxes.time_reduce('nansum', 'None', None)  |
| 139±0.3ms  | 34.2±0.04ms | 0.25  | reduce.ChunkReduce2D.time_reduce('nansum', 'bins', None)         |
| 75.2±0.2ms | 17.1±0.2ms  | 0.23  | reduce.ChunkReduce2D.time_reduce('nansum', 'None', None)         |
| 271±1ms    | 58.2±0.1ms  | 0.22  | reduce.ChunkReduce2DAllAxes.time_reduce('count', 'bins', None)   |
| 205±0.3ms  | 44.5±0.4ms  | 0.22  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmean', 'None', None) |
| 276±0.9ms  | 59.2±0.4ms  | 0.21  | reduce.ChunkReduce2DAllAxes.time_reduce('nansum', 'bins', None)  |
| 102±2ms    | 20.0±0.1ms  | 0.2   | reduce.ChunkReduce2D.time_reduce('nanmax', 'None', None)         |
| 337±0.8ms  | 59.7±0.4ms  | 0.18  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmean', 'bins', None) |
| 208±0.3ms  | 35.1±0.1ms  | 0.17  | reduce.ChunkReduce2D.time_reduce('nanmean', 'bins', None)        |
| 153±0.4ms  | 23.8±0.1ms  | 0.16  | reduce.ChunkReduce2D.time_reduce('nanmean', 'None', None)        |
| 276±0.5ms  | 45.1±0.4ms  | 0.16  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmax', 'bins', None)  |
| 205±0.3ms  | 19.0±0.1ms  | 0.09  | reduce.ChunkReduce2D.time_reduce('nanmax', 'bins', None)         |

max-sixty · 2023-11-09T09:33:50Z

Nice!!

And I recently enabled parallel by default — it scales really well for multiple dims now if multiple cares are available.

dcherian added 6 commits November 6, 2023 16:41

Fix numbagg version check

6c65fa0

Closes #281

Enable numbagg for count

75a7a3d

Better numbagg special-casing

a5dc574

Fixes.

ee62e37

A bunch of typing

55c06d6

Handle fill_value in core numbagg reduction.

46242ba

dcherian commented Nov 7, 2023

View reviewed changes

flox/aggregate_numbagg.py Outdated Show resolved Hide resolved

Update flox/aggregate_numbagg.py

83853c0

dcherian mentioned this pull request Nov 7, 2023

Flox seems much slower in some cases? #281

Closed

dcherian added 7 commits November 7, 2023 21:03

cleanup

36e9359

[WIP] test hacky fix

94cb700

[wip]

5fd2acd

Cleanup functions

e27b04b

Fix casting

8a2ac0e

Fix fill_value masking

c0d7347

optimize

b32794b

dcherian changed the title ~~Fix numbagg version check~~ Fix numbagg aggregations Nov 8, 2023

Merge remote-tracking branch 'upstream/main' into fix-numbagg-check

aa8b7e6

* upstream/main: Remove unused `Aggregation.aggregate` field. (#285) Actually optimize out multiple "nanlen" (#283)

dcherian commented Nov 8, 2023

View reviewed changes

flox/aggregations.py Show resolved Hide resolved

dcherian and others added 4 commits November 7, 2023 21:33

Update flox/aggregations.py

48e03d9

Small cleanup

c164f38

Fix.

6c7489d

Fix typing

3454db9

dcherian force-pushed the fix-numbagg-check branch from 262bb4a to 3454db9 Compare November 8, 2023 04:56

dcherian added 4 commits November 7, 2023 22:12

Another bugfix

0395899

Optimize seen_groups

783264d

Be careful about raveling

9366566

Fix benchmark skipping for numbagg

a0d9325

Merge branch 'main' into fix-numbagg-check

cf6eb15

* main: Set order='F' when raveling group_idx after broadcast (#286) Ignore benchmarks for codecov (#287)

add test

fd91510

dcherian merged commit 19db5b3 into main Nov 9, 2023
17 checks passed

dcherian deleted the fix-numbagg-check branch November 9, 2023 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix numbagg aggregations #282

Fix numbagg aggregations #282

dcherian commented Nov 6, 2023 •

edited

Loading

max-sixty commented Nov 7, 2023

dcherian commented Nov 7, 2023

max-sixty commented Nov 8, 2023

dcherian commented Nov 8, 2023 •

edited

Loading

max-sixty commented Nov 9, 2023

dcherian commented Nov 9, 2023

max-sixty commented Nov 9, 2023

Fix numbagg aggregations #282

Fix numbagg aggregations #282

Conversation

dcherian commented Nov 6, 2023 • edited Loading

max-sixty commented Nov 7, 2023

dcherian commented Nov 7, 2023

max-sixty commented Nov 8, 2023

dcherian commented Nov 8, 2023 • edited Loading

max-sixty commented Nov 9, 2023

dcherian commented Nov 9, 2023

max-sixty commented Nov 9, 2023

dcherian commented Nov 6, 2023 •

edited

Loading

dcherian commented Nov 8, 2023 •

edited

Loading