New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A range of fixes relating to small coords dtypes #158

Merged
merged 1 commit into from Jun 12, 2018

Conversation

Projects
None yet
2 participants
@hameerabbasi
Collaborator

hameerabbasi commented Jun 12, 2018

There was a range of bugs relating to small coords dtypes (used to save memory) often causing overflows and other bugs, also in interaction with dask.array

I removed all np.min_scalar_type instances as well as custom small coords instances.

This led to a few other bugs, which were also fixed. First seen in scipy-conference/scipy_proceedings#388 (comment)

Example from that here: https://gist.github.com/hameerabbasi/0230eb7b6359416d1db4167f802ada83

@codecov

This comment has been minimized.

codecov bot commented Jun 12, 2018

Codecov Report

Merging #158 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #158      +/-   ##
=========================================
- Coverage   96.92%   96.9%   -0.03%     
=========================================
  Files          11      11              
  Lines        1205    1196       -9     
=========================================
- Hits         1168    1159       -9     
  Misses         37      37
Impacted Files Coverage Δ
sparse/coo/common.py 96.95% <100%> (-0.1%) ⬇️
sparse/coo/umath.py 96.82% <100%> (-0.04%) ⬇️
sparse/utils.py 97.11% <100%> (+0.02%) ⬆️
sparse/coo/core.py 96.31% <100%> (-0.05%) ⬇️
sparse/coo/indexing.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4104eab...4d8c1fc. Read the comment docs.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jun 12, 2018

I'm a little sad to see the smaller dtypes go away, but I agree that they have caused enough small and hard-to-detect errors that removing them is probably the right way to go short-term. I would not be surprised if a future performance-oriented effort lead to them being reintroduced. Are there tests that we can do to ensure that these bugs don't get added back in in the future by enthusiastic devs?

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jun 12, 2018

Thanks for making these fixes by the way. It's surprisingly nice to discover bugs in the evening and then learn that someone else has fixed them when you wake up :)

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jun 12, 2018

I'm a little sad to see the smaller dtypes go away, but I agree that they have caused enough small and hard-to-detect errors that removing them is probably the right way to go short-term. I would not be surprised if a future performance-oriented effort lead to them being reintroduced.

I hope so too, but I wouldn't be surprised if it didn't happen. IIRC, this is the third time someone has reported bugs caused by this. Overflows can be surprisingly hard to catch, you have to go through every line for a possible overflow. Given the nature of sparse arrays, I'd like to see us switch to bigints or something similar in the future, for really large sparse arrays.

Are there tests that we can do to ensure that these bugs don't get added back in in the future by enthusiastic devs?

The tests are already there. They weren't being caught because (surprise, surprise) we were using unsigned dtypes. Now that we're using np.intp, they should be caught.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jun 12, 2018

The tests are already there. They weren't being caught because (surprise, surprise) we were using unsigned dtypes. Now that we're using np.intp, they should be caught.

I'm not sure I understand. If someone were to undo the source code changes that you're making here and go back to using dtypes as we were using them before, they wouldn't get a signal that they were breaking things.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jun 12, 2018

I'm not sure I understand. If someone were to undo the source code changes that you're making here and go back to using dtypes as we were using them before, they wouldn't get a signal that they were breaking things.

I just did a quick test, adding back small dtypes without making the fixes required to actually support them. A lot of things do break.

It's hard to come up with a comprehensive test suite that tests for overflows. You have to test for (for example) when you go from np.uint8 to np.uint16, np.int8 to np.int16 , etc. (for someone who decides to use signed types), find all of those thresholds, find what triggers them in each case. This can be challenging in general, about as challenging as actually finding the cases that would cause overflows in the first place.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jun 12, 2018

@hameerabbasi hameerabbasi merged commit 1447e0f into pydata:master Jun 12, 2018

4 checks passed

ci/circleci: build_27 Your tests passed on CircleCI!
Details
ci/circleci: build_36 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.92%)
Details
codecov/project Absolute coverage decreased by -0.02% but relative coverage increased by +3.07% compared to 4104eab
Details

@hameerabbasi hameerabbasi deleted the hameerabbasi:various-fixes branch Jul 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment