New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in linspace optimization of bin
and hist
#2923
Conversation
// may run into an integer overflow when converting the "bin" | ||
// computation result to `Index`. | ||
if (x < edges.front() || x >= edges.back()) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this fix is nearly identical in bin
and histogram
, is it possible to extract the common code into a function that computes the target bin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried now, this causes a single-threaded performance regression of
- 20-30% for
hist
- 5-10% for
bin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* 5-10% for `bin`
Scratch that, the implementation in this PR has the 5% regression (no visible difference when multi-threading is on), but extracting the function seems to perform as the original. Need to investigate more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote it slightly, and see no significant performance impact. There may be a very small slowdown, in particular for hist
of the order of 5-10%, but I doubt it would be relevant in practice (i.e., with multi-threading).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
size = da.bins.size().sum().value | ||
table.coords['x'].values[0] = 2.0 * np.finfo(np.float32).max | ||
da = table.bin(x=x) | ||
assert da.bins.size().sum().value == size - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these tests check that the correct element was dropped, only that the size was reduced by 1. Instead, you could slice out the bad element and group/bin/hist the slice and use that as the expected result.
Fixes #2922. This was caused by int overflows, wrapping, e.g., a large positive to a negative.
Please have a very close look, this is not the first time this code piece is buggy.