Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in linspace optimization of bin and hist #2923

Merged
merged 6 commits into from Jan 5, 2023

Conversation

SimonHeybrock
Copy link
Member

@SimonHeybrock SimonHeybrock commented Jan 4, 2023

Fixes #2922. This was caused by int overflows, wrapping, e.g., a large positive to a negative.

Please have a very close look, this is not the first time this code piece is buggy.

// may run into an integer overflow when converting the "bin"
// computation result to `Index`.
if (x < edges.front() || x >= edges.back())
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this fix is nearly identical in bin and histogram, is it possible to extract the common code into a function that computes the target bin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried now, this causes a single-threaded performance regression of

  • 20-30% for hist
  • 5-10% for bin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* 5-10% for `bin`

Scratch that, the implementation in this PR has the 5% regression (no visible difference when multi-threading is on), but extracting the function seems to perform as the original. Need to investigate more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote it slightly, and see no significant performance impact. There may be a very small slowdown, in particular for hist of the order of 5-10%, but I doubt it would be relevant in practice (i.e., with multi-threading).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

size = da.bins.size().sum().value
table.coords['x'].values[0] = 2.0 * np.finfo(np.float32).max
da = table.bin(x=x)
assert da.bins.size().sum().value == size - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these tests check that the correct element was dropped, only that the size was reduced by 1. Instead, you could slice out the bad element and group/bin/hist the slice and use that as the expected result.

@SimonHeybrock SimonHeybrock merged commit 06b62f7 into main Jan 5, 2023
@SimonHeybrock SimonHeybrock deleted the fix-bin-int-overflow branch January 5, 2023 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bin with "linspace" edges broken for large values outside bin bounds
2 participants