New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
binned_statistic: incorrect binnumber results #5449
Comments
I agree it's very cryptic, but for what it's worth the meaning of from scipy.stats import binned_statistic_2d
a1 = [0.1, 0.1, 0.1, 0.6]
a2 = [2.1, 2.6, 2.1, 2.1]
b1 = [0.0, 0.5, 1.0]
b2 = [2.0, 2.5, 3.0]
stats = binned_statistic_2d(a1, a2, np.arange(len(a1)), 'count', bins=[b1,b2])
x_ind, y_ind = np.unravel_index(stats.binnumber,
(len(stats.x_edge) + 1, len(stats.y_edge) + 1))
print(x_ind)
# [1 1 1 2]
print(y_ind)
# [1 2 1 1] The index along each dimension indicates where in the bin edges array the value should be inserted so as to keep the array sorted. This should probably be better documented. Perhaps we should add another output field that gives these computed indices. |
Interesting, thanks for the explanation @jakevdp. Is there any benefit to this representation? (i.e. as opposed to just having the unraveled indices. |
The benefit is that it's compact; for D dimensions, this representation is O[N] in space, while the expanded representation is O[ND] in space. That said, it would probably be more useful to return the expanded form. |
@lzkelley would you be interested in contributing this into an |
@ev-br absolutely, I was also thinking about putting in a PR replacing the compact version with the unraveled version. I was hoping more people would chime in with their opinions (no one really did on the scipy-dev list) on whether the 'compact' or 'unraveled' was better. Incorporating an enhancement suggested in numpy #4718 (allowing for multiple values to be binned simultaneously) was also something I was looking at. |
How about adding an argument |
The above-linked PR includes the |
gh-5497 is merged, so closing this. |
I think there may be a bug somewhere in this function - sorry havent been able to have a look, but the output from the following provides an example: xEdges = np.arange(79950.,500050.,100.) x = 356643.378 binned, xedges, yedges, binnums = binned_statistic_2d((x,), (y,), (0.5,), 'mean', bins=[xEdges,yEdges],expand_binnumbers=True) print binnums[0] Any idea why this may be? |
To expand on @pbranson's issue (which should probably be opened as a new issue):
The
|
Does indeed look like a bug. Would you mind opening a new issue? |
I have opened up a new issue for this here:
#7010
…On Thu, Jan 26, 2017 at 3:20 AM, Ralf Gommers ***@***.***> wrote:
Does indeed look like a bug. Would you mind opening a new issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5449 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AM3bQEVr4xyLqydfsHZ9Q-EgOZFGg96Dks5rV6CRgaJpZM4Gab9W>
.
|
Hi, I'm aware its been 3+ years since this was closed. From my understanding the binnumbers also include a border of bins that are outside the ranges of the given bin edges. So when given edges for a 2x2 set of bins, it returns a total of 4x4 bins: 0 1 2 3 with anything within the range being assigned to 5, 6, 9, or 10. My question is is there anyway to tell binned_statistic_2d to not do that? So I can get more useful rowmajor ordered binnumbers? Or do I have to figure out how to do the math using the x and y bin indeces to get the true bin number? |
Perhaps I'm misunderstanding the meaning of the
binnumber
results, but it looks like they're not being cleaned up properly.My understanding is that
binnumber
should provide a mapping from the values to the bins they belong in --- but the results don't seem to reflect that. Consider the following example:The resulting 'statistic' array looks good:
But the 'binnumber' array seems almost meaningless:
It gets correct that the first and third elements belong in the same bin; that the second element is one bin higher; and that the fourth element should be offset by a 'row'. The resulting
statistic
(results
) array is cleaned up just before being returned:Should the same reshaping/cleaning process be happening to the
binnumber
array?A couple of other minor points:
binned_statistic_2d
says thatx
andy
can have different lengths. I think they should be the same.values
array isn't used. Should it be made optional then? i.e. perhaps the default behavior should bevalues=None, statistic=None
and a check can be made like:The text was updated successfully, but these errors were encountered: