-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overflow in Int Column Fails Silently #41
Comments
Hi waylonflinn, overflow-check is per default in cython code deactivated, |
Thanks for building an amazingly fast library on top of bcolz! I'll try to put together a simple example that illustrates the problem. |
This library was possible thanks to @CarstVaartjes & @esc. |
Okay here's a minimum functional example that reproduces the problem. The following code creates a table with two columns, one of which is an int64. It adds two rows. The first has the maximum signed integer (9,223,372,036,854,775,807) for it's value. The second, a one (1). It then sums over these two columns producing the minimum signed integer as a result (−9,223,372,036,854,775,808), demonstrating integer overflow. # create table with two ints
SIGNED_INT_MAX = 9223372036854775807
dtypes = [('test_category', 'S3'), ('test_sum_column', 'i8')]
tuples = [
('foo', SIGNED_INT_MAX),
('foo', 1)]
data = np.fromiter(tuples, dtype=dtypes, count=2)
overflow_example_table = bquery.ctable(data)
# sum column of ints
overflow_result = overflow_example_table.groupby(['test_category'], ['test_sum_column'],
agg_method='sum')
(b'foo', -9223372036854775808) |
Hi, Thanks for your example, I get exactly the same results. It seems Setting this feature to True will incur a run-time penalty, in my opinion is something that should be added but it will require some discussion. |
Not sure what the magnitude of the performance penalties are, but I have a couple of alternative solutions I'd like to throw into the mix. I ran into this issue when I was doing sums on an
My workaround in the situation I mentioned was to (implicity) use option 2 above. I first created a new column with a Both of these solutions probably have performance penalties of their own. I'm curious about how they compare to one another. |
i think we should try some performance tests to find out how much it really is affected. the int64 and float variants are more user options for workarounds I think. most important that silents fails are not great :) |
When using the sum aggregation on a column with an integer datatype, overflows can occur. No warning or error is generated when this happens.
The text was updated successfully, but these errors were encountered: