BUG: stats.binned_statistic_dd binnumber is not usable #16195

rebelot · 2022-05-17T08:41:54Z

Describe your issue.

when using binned_statistic_dd one could be interested in getting the indexes/values of data points that fall in certain bins depending on the value of the computed statistic.

After computing the statistic, one could be tempted to use np.nonzero((hist >= x) & (hist < y)) paired with np.ravel_multi_index or other indexer functions and think that the returned indices should correspond, in principle, to binnumbers. Then, accessing your data is as easy as data[bs.binnumber == indices].

However, this is not the case. If you have (N, M) bins, It looks like binnumbers are counted as if there were 2 extra bins per data dimension, i.e. (N+2, M+2) despite the shape of statistc being (N, M).

So, in order to "fix" the mapping between binnumber and statistic indices, one should re-ravel binnumbers or their indices as if they were coming from a larger array and "shifted one diagonal element up/down".

I guess this is needed internally when one wants to reuse previous results, however I think it would be better for binned_statistic_dd to encode/decode binnumbers so that they are are consistent with the "user's side" data shapes.

Reproducing Code Example

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binned_statistic_dd

rng = np.random.default_rng(42)
n1, n2 = 10, 10
data = rng.normal(size=(100, 2))
stat = binned_statistic_dd(data, np.zeros(len(data)), statistic="count", bins=(n1, n2))
hist = stat.statistic
binnr = stat.binnumber

try:
    assert len(data[binnr == np.argmax(hist)]) == hist.max(), "this should not fail"
except AssertionError as e:
    print(e)

binnr_fix = []
for bn in binnr:
    i1, i2 = np.unravel_index(bn, (n1 + 2, n2 + 2))
    binnr_fix.append(np.ravel_multi_index((i1 - 1, i2 - 1), np.array(hist.shape)))

assert len(data[binnr_fix == np.argmax(hist)]) == np.max(hist)
max_ij = np.nonzero(hist == hist.max())
assert len(data[binnr == np.ravel_multi_index((max_ij[0] + 1, max_ij[1] + 1), (n1 + 2 , n2 + 2))]) == np.max(hist)



### Error message

```shell
-

SciPy/NumPy/Python version information

1.7.3 1.21.5 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)

The text was updated successfully, but these errors were encountered:

jakevdp · 2022-05-17T12:20:02Z

Have you tried setting expand_binnumbers=True? This is False by default to keep the default representation more compact, but I think by setting it to True you'll get the behavior that you're looking for.

jakevdp · 2022-05-17T12:32:11Z

(Also xref #5449 where this was discussed previously)

rebelot · 2022-05-17T18:24:38Z

Expanded binnumbers make more sense, but

you cannot reuse previous results
checking which data where in some bin(s) is not straightforward with the expanded representation if the usecase is to generalize some calculations on M dimensions

Overall I think the documentation would need some adjustments and/or the returned values should be more "friendly" to reutilize

mdhaber · 2022-05-18T00:35:24Z

@rebelot Forgive me if you already know this, but I thought it was worth mentioning since I don't see it explicitly above.

It appears that you are not providing the range argument or specifying the bin edges, so all of your data is guaranteed to be within one of the automatically-determined bins (since they are defined by the data). But consider the case in which range or bin edges are specified, in which case some of the data may lie outside of all bins. This is why "binnumbers are counted as if there were 2 extra bins per data dimension" - for each dimension, there is one extra bin below and one extra bin above.

Do you agree that it is reasonable to add these extra bins in this case?

The designer decided to make the behavior consistent regardless of whether the extra bins are needed or not. Would you suggest something different?

For instance, would it be reasonable to add a boolean option that by default maintains the current behavior but, when set manually, refuses to add these extra bins? Then, you'd get the indexing you expected, and perhaps the binnumbers for out-of-bounds data could be nan.

rebelot · 2022-05-18T10:36:28Z

thanks for your reply.
My question is, what is the point of having the bins numbered this way if the two extra bins are never returned in statistic?
If a combination of edges or range makes some data fall outside the bin edges, they should not be assigned to any bin and I think its reasonable to assign bin number nan to outlier data.

mdhaber · 2022-05-19T00:05:37Z

if the two extra bins are never returned in statistic

I see. Yeah, that's surprising to me.

rebelot added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label May 17, 2022

tupui added the scipy.stats label May 17, 2022

mdhaber added enhancement A new feature or improvement and removed defect A clear bug or issue that prevents SciPy from being installed or used as expected labels May 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: stats.binned_statistic_dd binnumber is not usable #16195

BUG: stats.binned_statistic_dd binnumber is not usable #16195

rebelot commented May 17, 2022

jakevdp commented May 17, 2022

jakevdp commented May 17, 2022

rebelot commented May 17, 2022

mdhaber commented May 18, 2022 •

edited

rebelot commented May 18, 2022

mdhaber commented May 19, 2022

BUG: stats.binned_statistic_dd binnumber is not usable #16195

BUG: stats.binned_statistic_dd binnumber is not usable #16195

Comments

rebelot commented May 17, 2022

Describe your issue.

Reproducing Code Example

SciPy/NumPy/Python version information

jakevdp commented May 17, 2022

jakevdp commented May 17, 2022

rebelot commented May 17, 2022

mdhaber commented May 18, 2022 • edited

rebelot commented May 18, 2022

mdhaber commented May 19, 2022

mdhaber commented May 18, 2022 •

edited