binned_statistic: incorrect binnumber results #5449

lzkelley · 2015-11-02T20:59:32Z

Perhaps I'm misunderstanding the meaning of the binnumber results, but it looks like they're not being cleaned up properly.

My understanding is that binnumber should provide a mapping from the values to the bins they belong in --- but the results don't seem to reflect that. Consider the following example:

a1 = [0.1, 0.1, 0.1, 0.6]
a2 = [2.1, 2.6, 2.1, 2.1]

b1 = [0.0, 0.5, 1.0]
b2 = [2.0, 2.5, 3.0]

stats = sp.stats.binned_statistic_2d(a1, a2, np.arange(len(a1)), 'count', bins=[b1,b2])

The resulting 'statistic' array looks good:

[[ 2.,  1.],
 [ 1.,  0.]]

But the 'binnumber' array seems almost meaningless:

[5, 6, 5, 9]

It gets correct that the first and third elements belong in the same bin; that the second element is one bin higher; and that the fourth element should be offset by a 'row'. The resulting statistic (results) array is cleaned up just before being returned:

# Shape into a proper matrix                                                                                                                                                                            
    result = result.reshape(np.sort(nbin))
    for i in np.arange(nbin.size):
        j = ni.argsort()[i]
        result = result.swapaxes(i, j)
        ni[i], ni[j] = ni[j], ni[i]

    # Remove outliers (indices 0 and -1 for each dimension).                                                                                                                                                
    core = D * [slice(1, -1)]
    result = result[core]

Should the same reshaping/cleaning process be happening to the binnumber array?

A couple of other minor points:

The docstring for binned_statistic_2d says that x and y can have different lengths. I think they should be the same.
When the statistic being used is just 'count', then the values array isn't used. Should it be made optional then? i.e. perhaps the default behavior should be values=None, statistic=None and a check can be made like:

if(statistic is None):
    if(values is None):
        statistic = 'count'
    else: 
        statistic = 'mean'

The text was updated successfully, but these errors were encountered:

jakevdp · 2015-11-05T20:41:10Z

I agree it's very cryptic, but for what it's worth the meaning of binnumber in the 2D case is this:

from scipy.stats import binned_statistic_2d

a1 = [0.1, 0.1, 0.1, 0.6]
a2 = [2.1, 2.6, 2.1, 2.1]

b1 = [0.0, 0.5, 1.0]
b2 = [2.0, 2.5, 3.0]

stats = binned_statistic_2d(a1, a2, np.arange(len(a1)), 'count', bins=[b1,b2])
x_ind, y_ind = np.unravel_index(stats.binnumber,
                                (len(stats.x_edge) + 1, len(stats.y_edge) + 1))
print(x_ind)
# [1 1 1 2]
print(y_ind)
# [1 2 1 1]

The index along each dimension indicates where in the bin edges array the value should be inserted so as to keep the array sorted. This should probably be better documented.

Perhaps we should add another output field that gives these computed indices.

lzkelley · 2015-11-07T21:32:53Z

Interesting, thanks for the explanation @jakevdp. Is there any benefit to this representation? (i.e. as opposed to just having the unraveled indices.

jakevdp · 2015-11-08T03:13:38Z

The benefit is that it's compact; for D dimensions, this representation is O[N] in space, while the expanded representation is O[ND] in space. That said, it would probably be more useful to return the expanded form.

ev-br · 2015-11-12T13:19:41Z

@lzkelley would you be interested in contributing this into an Examples section of the docstring of binned_statistic_2d?

lzkelley · 2015-11-12T15:22:12Z

@ev-br absolutely, I was also thinking about putting in a PR replacing the compact version with the unraveled version. I was hoping more people would chime in with their opinions (no one really did on the scipy-dev list) on whether the 'compact' or 'unraveled' was better. Incorporating an enhancement suggested in numpy #4718 (allowing for multiple values to be binned simultaneously) was also something I was looking at.

jakevdp · 2015-11-12T19:20:53Z

How about adding an argument expand_binnumber=False to optionally compute & return the expanded version? The change would be backward compatible, would not use extra memory & CPUs unless the user wanted it, and would remind us to actually document what binnumber means 😄

lzkelley · 2015-11-13T21:49:21Z

The above-linked PR includes the expand_binnumber=False argument, and seems to be working properly. This is my first PR (with actual code-changes), so any comments/critiques would be quite welcome.

rgommers · 2015-11-24T19:55:48Z

gh-5497 is merged, so closing this.

pbranson · 2017-01-25T06:53:46Z

I think there may be a bug somewhere in this function - sorry havent been able to have a look, but the output from the following provides an example:

xEdges = np.arange(79950.,500050.,100.)
yEdges = np.arange(7489950.,7860050.,100.)

x = 356643.378
y = 7813944.500

binned, xedges, yedges, binnums = binned_statistic_2d((x,), (y,), (0.5,), 'mean', bins=[xEdges,yEdges],expand_binnumbers=True)

print binnums[0]
#3678
print np.argmax(xedges>x)
#2767

Any idea why this may be?

lzkelley · 2017-01-25T15:48:43Z

To expand on @pbranson's issue (which should probably be opened as a new issue):

xEdges = np.arange(79950.,500050.,100.)
yEdges = np.arange(7489950.,7860050.,100.)
x = 356643.378
y = 7813944.500
binned, xedges, yedges, binnums = binned_statistic_2d((x,), (y,), (0.5,), 'mean', bins=[xEdges,yEdges],expand_binnumbers=True)

The binnums seem to be incorrect:

> binnums
array([[3678],
       [1291]])
> np.where(np.isfinite(binned))
(array([2766]), array([3239]))

rgommers · 2017-01-25T19:20:45Z

Does indeed look like a bug. Would you mind opening a new issue?

pbranson · 2017-02-07T06:02:23Z

I have opened up a new issue for this here: #7010

…

On Thu, Jan 26, 2017 at 3:20 AM, Ralf Gommers ***@***.***> wrote: Does indeed look like a bug. Would you mind opening a new issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5449 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AM3bQEVr4xyLqydfsHZ9Q-EgOZFGg96Dks5rV6CRgaJpZM4Gab9W> .

calquigs · 2020-08-05T21:34:20Z

Hi, I'm aware its been 3+ years since this was closed. From my understanding the binnumbers also include a border of bins that are outside the ranges of the given bin edges. So when given edges for a 2x2 set of bins, it returns a total of 4x4 bins:

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

with anything within the range being assigned to 5, 6, 9, or 10.

My question is is there anyway to tell binned_statistic_2d to not do that? So I can get more useful rowmajor ordered binnumbers? Or do I have to figure out how to do the math using the x and y bin indeces to get the true bin number?

ev-br added scipy.stats Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org good first issue Good topic for first contributor pull requests, with a relatively straightforward solution labels Nov 12, 2015

lzkelley mentioned this issue Nov 13, 2015

Enhancement to binned_statistic: option to unraveled returned bin-number mappings, and able to pass multiple data arrays at once #5497

Closed

rgommers closed this as completed Nov 24, 2015

lzkelley mentioned this issue Feb 12, 2017

scipy.statsbinned_statistic_2d: incorrect binnumbers returned #7010

Closed

jakevdp mentioned this issue May 17, 2022

BUG: stats.binned_statistic_dd binnumber is not usable #16195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binned_statistic: incorrect binnumber results #5449

binned_statistic: incorrect binnumber results #5449

lzkelley commented Nov 2, 2015

jakevdp commented Nov 5, 2015

lzkelley commented Nov 7, 2015

jakevdp commented Nov 8, 2015

ev-br commented Nov 12, 2015

lzkelley commented Nov 12, 2015

jakevdp commented Nov 12, 2015

lzkelley commented Nov 13, 2015

rgommers commented Nov 24, 2015

pbranson commented Jan 25, 2017

lzkelley commented Jan 25, 2017

rgommers commented Jan 25, 2017

pbranson commented Feb 7, 2017 via email

calquigs commented Aug 5, 2020

binned_statistic: incorrect binnumber results #5449

binned_statistic: incorrect binnumber results #5449

Comments

lzkelley commented Nov 2, 2015

jakevdp commented Nov 5, 2015

lzkelley commented Nov 7, 2015

jakevdp commented Nov 8, 2015

ev-br commented Nov 12, 2015

lzkelley commented Nov 12, 2015

jakevdp commented Nov 12, 2015

lzkelley commented Nov 13, 2015

rgommers commented Nov 24, 2015

pbranson commented Jan 25, 2017

lzkelley commented Jan 25, 2017

rgommers commented Jan 25, 2017

pbranson commented Feb 7, 2017 via email

calquigs commented Aug 5, 2020