Floating point exception in GradientBoostingClassifier during nosetests on macosx 10.8.2 64bit (and others?) #1406

Closed
erg opened this Issue Nov 24, 2012 · 19 comments

Comments

Projects
None yet
8 participants
Contributor

erg commented Nov 24, 2012

Fails every time. Using current master and either nosetests or make. Git id 3c46eba.

sklearn.decomposition.tests.test_sparse_pca.test_initialization ... ok
sklearn.decomposition.tests.test_sparse_pca.test_mini_batch_correct_shapes ... ok
sklearn.decomposition.tests.test_sparse_pca.test_mini_batch_fit_transform ... SKIP
Doctest: sklearn.ensemble.gradient_boosting.GradientBoostingClassifier ... Floating point exception: 8
modern:scikit-learn erg$ [master*] 
Owner

mblondel commented Nov 25, 2012

Seems like your computer is catching lots of errors :)

Owner

pprett commented Nov 25, 2012

@erg I cannot reproduce (neither does our buildbot)... can you try to pin point the error? what numpy/blas version are your running?

thx,
Peter

Contributor

erg commented Nov 25, 2012

Blas is the builtin one for mac. How do I find the version?

All you need is a new Macbook or like 10.8.2 64bit. Might get time to debug later...

In [2]: np.__version__
Out[2]: '1.8.0.dev-20224ea'

modern:scikit-learn erg$ [master*] file /usr/lib/libblas.dylib
/usr/lib/libblas.dylib: Mach-O universal binary with 2 architectures
/usr/lib/libblas.dylib (for architecture i386): Mach-O dynamically linked shared library i386
/usr/lib/libblas.dylib (for architecture x86_64):   Mach-O 64-bit dynamically linked shared library x86_64
Owner

amueller commented Nov 25, 2012

On 11/25/2012 11:06 PM, erg wrote:

Blas is the builtin one for mac. How do I find the version?

All you need is a new Macbook or like 10.8.2 64bit. Might get time to
debug later...

@erg if you send me one, I'll to the debugging for you ;)

Owner

ogrisel commented Nov 25, 2012

I cannot reproduce either, although I am running OS X 10.8.2 (12C3006):

$ otool -L /usr/local/lib/python2.7/site-packages/numpy/core/_dotblas.so 
/usr/local/lib/python2.7/site-packages/numpy/core/_dotblas.so:
    /System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate (compatibility version 1.0.0, current version 4.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)
Owner

GaelVaroquaux commented Nov 26, 2012

I can reproduce with my system with the hand-compiled atlas library.

Owner

GaelVaroquaux commented Nov 26, 2012

Hum, I can even reproduce with 0.12.... I am pretty certain that the tests used to pass. Thus I am starting to think that it is not a problem in the scikit.

Owner

GaelVaroquaux commented Nov 26, 2012

OK, with numpy 1.6.2 it works fine...

Owner

pprett commented Nov 26, 2012

I see - I'm using 1.6.2 - I'll hunt down the problem with 1.7

Owner

GaelVaroquaux commented Nov 26, 2012

I did a bisect on numpy, and the numpy commit that induces the breakage is:
numpy/numpy@c48156d

Owner

amueller commented Nov 26, 2012

@GaelVaroquaux nice job :)

Owner

ogrisel commented Nov 26, 2012

Now we need a volunteer to qualify the bug in a minimalistic reproduction script and report it to the numpy github issue tracker if it's indeed a numpy regression rather than a misuse from our side.

Owner

GaelVaroquaux commented Nov 26, 2012

Now we need a volunteer to qualify the bug in a minimalistic reproduction
script and report it to the numpy github issue tracker if it's indeed a numpy
regression rather than a misuse from our side.

I am still looking at it. We might have something fishy. The segfault is
in _tree.pyx, in recursive_partition, line 437:
cdef int y_stride = y.strides[0] / y.strides[1]

I suspect that y.strides[1] is now 0.

Owner

GaelVaroquaux commented Nov 26, 2012

cdef int y_stride = y.strides[0] / y.strides[1]

I suspect that y.strides[1] is now 0.

Confirmed.

OK, there's been an API change at the numpy level: a C-contiguous array
can have zero-length strides. It's a change in behavior, and we can
report it. But we can easily be robust to it. Is this a bug on the numpy
side? I wonder.

Owner

GaelVaroquaux commented Nov 26, 2012

OK, there's been an API change at the numpy level: a C-contiguous array
can have zero-length strides. It's a change in behavior, and we can
report it. But we can easily be robust to it.

This change of behavior has a wider impact. I have fixed the segfault in
my fix_1406 branch, but I am now getting a lot of failing tests. I'll
report on the numpy tracker and see what follows.

Owner

pprett commented Nov 26, 2012

@GaelVaroquaux thanks for taking care of that

Owner

GaelVaroquaux commented Nov 26, 2012

This change of behavior has a wider impact. I have fixed the segfault in
my fix_1406 branch, but I am now getting a lot of failing tests. I'll
report on the numpy tracker and see what follows.

numpy/numpy#2770

We'll see what they reply.

Contributor

seberg commented Nov 28, 2012

I had already mentioned this in numpy/numpy#2694 (comment) (there is a second occurance that I saw back then outside of tree in case you did not yet) the fix is simply to use .itemsize for the division. The use before the division is OK, since it seems only used for that given dimension.

I have not tried this specifically for this code, but with previous numpy you should be able to break it for example with some (possibly not easy to create) 0-sized or 1-sized arrays.

Btw. sorry if you guys spend so much time on it, I somewhat thought it was seen (and also thought by then it would be reverted in numpy faster, though thats no reason not to fix it here as well)

Owner

glouppe commented Dec 12, 2012

Fixed in #1458.

@glouppe glouppe closed this Dec 12, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment