Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binned_statistic: incorrect binnumber results #5449

Closed
lzkelley opened this issue Nov 2, 2015 · 13 comments
Closed

binned_statistic: incorrect binnumber results #5449

lzkelley opened this issue Nov 2, 2015 · 13 comments
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org good first issue Good topic for first contributor pull requests, with a relatively straightforward solution scipy.stats

Comments

@lzkelley
Copy link
Contributor

lzkelley commented Nov 2, 2015

Perhaps I'm misunderstanding the meaning of the binnumber results, but it looks like they're not being cleaned up properly.

My understanding is that binnumber should provide a mapping from the values to the bins they belong in --- but the results don't seem to reflect that. Consider the following example:

a1 = [0.1, 0.1, 0.1, 0.6]
a2 = [2.1, 2.6, 2.1, 2.1]

b1 = [0.0, 0.5, 1.0]
b2 = [2.0, 2.5, 3.0]

stats = sp.stats.binned_statistic_2d(a1, a2, np.arange(len(a1)), 'count', bins=[b1,b2])

The resulting 'statistic' array looks good:

[[ 2.,  1.],
 [ 1.,  0.]]

But the 'binnumber' array seems almost meaningless:

[5, 6, 5, 9]

It gets correct that the first and third elements belong in the same bin; that the second element is one bin higher; and that the fourth element should be offset by a 'row'. The resulting statistic (results) array is cleaned up just before being returned:

# Shape into a proper matrix                                                                                                                                                                            
    result = result.reshape(np.sort(nbin))
    for i in np.arange(nbin.size):
        j = ni.argsort()[i]
        result = result.swapaxes(i, j)
        ni[i], ni[j] = ni[j], ni[i]

    # Remove outliers (indices 0 and -1 for each dimension).                                                                                                                                                
    core = D * [slice(1, -1)]
    result = result[core]

Should the same reshaping/cleaning process be happening to the binnumber array?


A couple of other minor points:

  • The docstring for binned_statistic_2d says that x and y can have different lengths. I think they should be the same.
  • When the statistic being used is just 'count', then the values array isn't used. Should it be made optional then? i.e. perhaps the default behavior should be values=None, statistic=None and a check can be made like:
if(statistic is None):
    if(values is None):
        statistic = 'count'
    else: 
        statistic = 'mean'
@jakevdp
Copy link
Member

jakevdp commented Nov 5, 2015

I agree it's very cryptic, but for what it's worth the meaning of binnumber in the 2D case is this:

from scipy.stats import binned_statistic_2d

a1 = [0.1, 0.1, 0.1, 0.6]
a2 = [2.1, 2.6, 2.1, 2.1]

b1 = [0.0, 0.5, 1.0]
b2 = [2.0, 2.5, 3.0]

stats = binned_statistic_2d(a1, a2, np.arange(len(a1)), 'count', bins=[b1,b2])
x_ind, y_ind = np.unravel_index(stats.binnumber,
                                (len(stats.x_edge) + 1, len(stats.y_edge) + 1))
print(x_ind)
# [1 1 1 2]
print(y_ind)
# [1 2 1 1]

The index along each dimension indicates where in the bin edges array the value should be inserted so as to keep the array sorted. This should probably be better documented.

Perhaps we should add another output field that gives these computed indices.

@lzkelley
Copy link
Contributor Author

lzkelley commented Nov 7, 2015

Interesting, thanks for the explanation @jakevdp. Is there any benefit to this representation? (i.e. as opposed to just having the unraveled indices.

@jakevdp
Copy link
Member

jakevdp commented Nov 8, 2015

The benefit is that it's compact; for D dimensions, this representation is O[N] in space, while the expanded representation is O[ND] in space. That said, it would probably be more useful to return the expanded form.

@ev-br
Copy link
Member

ev-br commented Nov 12, 2015

@lzkelley would you be interested in contributing this into an Examples section of the docstring of binned_statistic_2d?

@ev-br ev-br added scipy.stats Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org good first issue Good topic for first contributor pull requests, with a relatively straightforward solution labels Nov 12, 2015
@lzkelley
Copy link
Contributor Author

@ev-br absolutely, I was also thinking about putting in a PR replacing the compact version with the unraveled version. I was hoping more people would chime in with their opinions (no one really did on the scipy-dev list) on whether the 'compact' or 'unraveled' was better. Incorporating an enhancement suggested in numpy #4718 (allowing for multiple values to be binned simultaneously) was also something I was looking at.

@jakevdp
Copy link
Member

jakevdp commented Nov 12, 2015

How about adding an argument expand_binnumber=False to optionally compute & return the expanded version? The change would be backward compatible, would not use extra memory & CPUs unless the user wanted it, and would remind us to actually document what binnumber means 😄

@lzkelley
Copy link
Contributor Author

The above-linked PR includes the expand_binnumber=False argument, and seems to be working properly. This is my first PR (with actual code-changes), so any comments/critiques would be quite welcome.

@rgommers
Copy link
Member

gh-5497 is merged, so closing this.

@pbranson
Copy link

I think there may be a bug somewhere in this function - sorry havent been able to have a look, but the output from the following provides an example:

xEdges = np.arange(79950.,500050.,100.)
yEdges = np.arange(7489950.,7860050.,100.)

x = 356643.378
y = 7813944.500

binned, xedges, yedges, binnums = binned_statistic_2d((x,), (y,), (0.5,), 'mean', bins=[xEdges,yEdges],expand_binnumbers=True)

print binnums[0]
#3678
print np.argmax(xedges>x)
#2767

Any idea why this may be?

@lzkelley
Copy link
Contributor Author

To expand on @pbranson's issue (which should probably be opened as a new issue):

xEdges = np.arange(79950.,500050.,100.)
yEdges = np.arange(7489950.,7860050.,100.)
x = 356643.378
y = 7813944.500
binned, xedges, yedges, binnums = binned_statistic_2d((x,), (y,), (0.5,), 'mean', bins=[xEdges,yEdges],expand_binnumbers=True)

The binnums seem to be incorrect:

> binnums
array([[3678],
       [1291]])
> np.where(np.isfinite(binned))
(array([2766]), array([3239]))

@rgommers
Copy link
Member

Does indeed look like a bug. Would you mind opening a new issue?

@pbranson
Copy link

pbranson commented Feb 7, 2017 via email

@calquigs
Copy link

calquigs commented Aug 5, 2020

Hi, I'm aware its been 3+ years since this was closed. From my understanding the binnumbers also include a border of bins that are outside the ranges of the given bin edges. So when given edges for a 2x2 set of bins, it returns a total of 4x4 bins:

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

with anything within the range being assigned to 5, 6, 9, or 10.

My question is is there anyway to tell binned_statistic_2d to not do that? So I can get more useful rowmajor ordered binnumbers? Or do I have to figure out how to do the math using the x and y bin indeces to get the true bin number?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org good first issue Good topic for first contributor pull requests, with a relatively straightforward solution scipy.stats
Projects
None yet
Development

No branches or pull requests

6 participants