Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stats.binned_statistic_dd binnumber is not usable #16195

Open
rebelot opened this issue May 17, 2022 · 6 comments
Open

BUG: stats.binned_statistic_dd binnumber is not usable #16195

rebelot opened this issue May 17, 2022 · 6 comments
Labels
enhancement A new feature or improvement scipy.stats

Comments

@rebelot
Copy link

rebelot commented May 17, 2022

Describe your issue.

when using binned_statistic_dd one could be interested in getting the indexes/values of data points that fall in certain bins depending on the value of the computed statistic.

After computing the statistic, one could be tempted to use np.nonzero((hist >= x) & (hist < y)) paired with np.ravel_multi_index or other indexer functions and think that the returned indices should correspond, in principle, to binnumbers. Then, accessing your data is as easy as data[bs.binnumber == indices].

However, this is not the case. If you have (N, M) bins, It looks like binnumbers are counted as if there were 2 extra bins per data dimension, i.e. (N+2, M+2) despite the shape of statistc being (N, M).

So, in order to "fix" the mapping between binnumber and statistic indices, one should re-ravel binnumbers or their indices as if they were coming from a larger array and "shifted one diagonal element up/down".

I guess this is needed internally when one wants to reuse previous results, however I think it would be better for binned_statistic_dd to encode/decode binnumbers so that they are are consistent with the "user's side" data shapes.

Reproducing Code Example

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binned_statistic_dd

rng = np.random.default_rng(42)
n1, n2 = 10, 10
data = rng.normal(size=(100, 2))
stat = binned_statistic_dd(data, np.zeros(len(data)), statistic="count", bins=(n1, n2))
hist = stat.statistic
binnr = stat.binnumber

try:
    assert len(data[binnr == np.argmax(hist)]) == hist.max(), "this should not fail"
except AssertionError as e:
    print(e)

binnr_fix = []
for bn in binnr:
    i1, i2 = np.unravel_index(bn, (n1 + 2, n2 + 2))
    binnr_fix.append(np.ravel_multi_index((i1 - 1, i2 - 1), np.array(hist.shape)))

assert len(data[binnr_fix == np.argmax(hist)]) == np.max(hist)
max_ij = np.nonzero(hist == hist.max())
assert len(data[binnr == np.ravel_multi_index((max_ij[0] + 1, max_ij[1] + 1), (n1 + 2 , n2 + 2))]) == np.max(hist)


### Error message

```shell
-

SciPy/NumPy/Python version information

1.7.3 1.21.5 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)

@rebelot rebelot added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label May 17, 2022
@jakevdp
Copy link
Member

jakevdp commented May 17, 2022

Have you tried setting expand_binnumbers=True? This is False by default to keep the default representation more compact, but I think by setting it to True you'll get the behavior that you're looking for.

@jakevdp
Copy link
Member

jakevdp commented May 17, 2022

(Also xref #5449 where this was discussed previously)

@rebelot
Copy link
Author

rebelot commented May 17, 2022

Expanded binnumbers make more sense, but

  • you cannot reuse previous results
  • checking which data where in some bin(s) is not straightforward with the expanded representation if the usecase is to generalize some calculations on M dimensions

Overall I think the documentation would need some adjustments and/or the returned values should be more "friendly" to reutilize

@mdhaber
Copy link
Contributor

mdhaber commented May 18, 2022

@rebelot Forgive me if you already know this, but I thought it was worth mentioning since I don't see it explicitly above.

It appears that you are not providing the range argument or specifying the bin edges, so all of your data is guaranteed to be within one of the automatically-determined bins (since they are defined by the data). But consider the case in which range or bin edges are specified, in which case some of the data may lie outside of all bins. This is why "binnumbers are counted as if there were 2 extra bins per data dimension" - for each dimension, there is one extra bin below and one extra bin above.

Do you agree that it is reasonable to add these extra bins in this case?

The designer decided to make the behavior consistent regardless of whether the extra bins are needed or not. Would you suggest something different?

For instance, would it be reasonable to add a boolean option that by default maintains the current behavior but, when set manually, refuses to add these extra bins? Then, you'd get the indexing you expected, and perhaps the binnumbers for out-of-bounds data could be nan.

@mdhaber mdhaber added enhancement A new feature or improvement and removed defect A clear bug or issue that prevents SciPy from being installed or used as expected labels May 18, 2022
@rebelot
Copy link
Author

rebelot commented May 18, 2022

thanks for your reply.
My question is, what is the point of having the bins numbered this way if the two extra bins are never returned in statistic?
If a combination of edges or range makes some data fall outside the bin edges, they should not be assigned to any bin and I think its reasonable to assign bin number nan to outlier data.

@mdhaber
Copy link
Contributor

mdhaber commented May 19, 2022

if the two extra bins are never returned in statistic

I see. Yeah, that's surprising to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

No branches or pull requests

4 participants