Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overflow in Int Column Fails Silently #41

Open
waylonflinn opened this issue Jun 3, 2015 · 7 comments
Open

Overflow in Int Column Fails Silently #41

waylonflinn opened this issue Jun 3, 2015 · 7 comments

Comments

@waylonflinn
Copy link
Contributor

When using the sum aggregation on a column with an integer datatype, overflows can occur. No warning or error is generated when this happens.

@FrancescElies
Copy link
Contributor

Hi waylonflinn,

overflow-check is per default in cython code deactivated,
I guess this might be cause for this issue. If you could provide an example that would be great.

@waylonflinn
Copy link
Contributor Author

Thanks for building an amazingly fast library on top of bcolz! I'll try to put together a simple example that illustrates the problem.

@FrancescElies
Copy link
Contributor

This library was possible thanks to @CarstVaartjes & @esc.
Glad to see you like it :) and looking forward to your example

@waylonflinn
Copy link
Contributor Author

Okay here's a minimum functional example that reproduces the problem.

The following code creates a table with two columns, one of which is an int64. It adds two rows. The first has the maximum signed integer (9,223,372,036,854,775,807) for it's value. The second, a one (1).

It then sums over these two columns producing the minimum signed integer as a result (−9,223,372,036,854,775,808), demonstrating integer overflow.

# create table with two ints

SIGNED_INT_MAX = 9223372036854775807

dtypes = [('test_category', 'S3'), ('test_sum_column', 'i8')]
tuples = [
('foo', SIGNED_INT_MAX), 
('foo', 1)]

data = np.fromiter(tuples, dtype=dtypes, count=2)
overflow_example_table = bquery.ctable(data)

# sum column of ints
overflow_result = overflow_example_table.groupby(['test_category'], ['test_sum_column'],
                                                 agg_method='sum')

overflow_result[0] looks like this:

(b'foo', -9223372036854775808)

@FrancescElies
Copy link
Contributor

Hi,

Thanks for your example, I get exactly the same results.
Setting overflowcheck to True should do the trick, but our actual code base raised no Exception, weird.

It seems cython.overflowcheck does not work for compound statements like j += k
For this to work some refactoring would be needed: out_buffer[current_index] += in_buffer[i] as out_buffer[current_index] = out_buffer[current_index] + in_buffer[i].

Setting this feature to True will incur a run-time penalty, in my opinion is something that should be added but it will require some discussion.

@waylonflinn
Copy link
Contributor Author

Not sure what the magnitude of the performance penalties are, but I have a couple of alternative solutions I'd like to throw into the mix.

I ran into this issue when I was doing sums on an int32 column, rather than an int64. It's overflows on that field size that concern me most. There are two alternate solutions for that problem that I see:

  1. use int64 for all calculations
  2. use float32 or float64 for all calculations

My workaround in the situation I mentioned was to (implicity) use option 2 above. I first created a new column with a float64 type and copied that data into that. Since my eventual goal was to calculate the mean, anyway, this was the natural choice.

Both of these solutions probably have performance penalties of their own. I'm curious about how they compare to one another.

@CarstVaartjes
Copy link
Member

i think we should try some performance tests to find out how much it really is affected. the int64 and float variants are more user options for workarounds I think. most important that silents fails are not great :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants