Ward test crashes on Windows with Python 3.3.2 from Anaconda and MSVC 2008 and 2010 #2322

Closed
vene opened this Issue Jul 29, 2013 · 30 comments

Comments

Projects
None yet
5 participants
@vene
Owner

vene commented Jul 29, 2013

The test at fault is Check that we obtain the correct number of clusters with Ward clustering. ...

I need a break from this and will not pursue at the moment. I think it works fine with mingw, but binaries can't be released this way.

cc @ogrisel

@ogrisel

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Jul 29, 2013

Owner

but binaries can't be released this way.

What problems have been identified if we release scikit-learn binaries built with MinGW?

Owner

ogrisel commented Jul 29, 2013

but binaries can't be released this way.

What problems have been identified if we release scikit-learn binaries built with MinGW?

@vene

This comment has been minimized.

Show comment Hide comment
@vene

vene Jul 29, 2013

Owner

They were working for me and (i suspect) other people with a similar
MinGW-based environment, but not outside of one. Maybe it's not the case
anymore or something else was the problem? We can try; I have a 100%
mingw-free VM, you can send me some mingw-built binaries.

On Mon, Jul 29, 2013 at 4:26 PM, Olivier Grisel notifications@github.comwrote:

but binaries can't be released this way.

What problems have been identified if we release scikit-learn binaries
built with MinGW?


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-21723345
.

Owner

vene commented Jul 29, 2013

They were working for me and (i suspect) other people with a similar
MinGW-based environment, but not outside of one. Maybe it's not the case
anymore or something else was the problem? We can try; I have a 100%
mingw-free VM, you can send me some mingw-built binaries.

On Mon, Jul 29, 2013 at 4:26 PM, Olivier Grisel notifications@github.comwrote:

but binaries can't be released this way.

What problems have been identified if we release scikit-learn binaries
built with MinGW?


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-21723345
.

@ogrisel

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Jul 29, 2013

Owner

Ok let's do that.

Owner

ogrisel commented Jul 29, 2013

Ok let's do that.

@ogrisel

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Jul 29, 2013

Owner

Just for the record: installing MinGW based scikit-learn builds on non MinGW machines yield an import error for the __check_build module: apparently we need to distribute binaries that are built with visual C++ express as Python (MSVC 2008 for python 2.7 and MSVC 2010 for python 3.3).

Owner

ogrisel commented Jul 29, 2013

Just for the record: installing MinGW based scikit-learn builds on non MinGW machines yield an import error for the __check_build module: apparently we need to distribute binaries that are built with visual C++ express as Python (MSVC 2008 for python 2.7 and MSVC 2010 for python 3.3).

@vene

This comment has been minimized.

Show comment Hide comment
@vene

vene Jul 29, 2013

Owner

The last issue (MSVC version) is not sure: at the moment both 2008 and 2010
work ok to build scikit-learn under Python 3.3 and import works, but both
variants crash in the aforementioned test.

On Mon, Jul 29, 2013 at 6:49 PM, Olivier Grisel notifications@github.comwrote:

Just for the record: installing MinGW based scikit-learn builds on non
MinGW machines yield an import error for the __check_build module:
apparently we need to distribute binaries that are built with visual C++
express as Python (MSVC 2008 for python 2.7 and MSVC 2010 for python 3.3).


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-21733679
.

Owner

vene commented Jul 29, 2013

The last issue (MSVC version) is not sure: at the moment both 2008 and 2010
work ok to build scikit-learn under Python 3.3 and import works, but both
variants crash in the aforementioned test.

On Mon, Jul 29, 2013 at 6:49 PM, Olivier Grisel notifications@github.comwrote:

Just for the record: installing MinGW based scikit-learn builds on non
MinGW machines yield an import error for the __check_build module:
apparently we need to distribute binaries that are built with visual C++
express as Python (MSVC 2008 for python 2.7 and MSVC 2010 for python 3.3).


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-21733679
.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 4, 2013

Owner

@vene : can you give us a C stack trace?

I believe that this guy is quite nasty, and may be a release blocker.

Owner

GaelVaroquaux commented Aug 4, 2013

@vene : can you give us a C stack trace?

I believe that this guy is quite nasty, and may be a release blocker.

@vene

This comment has been minimized.

Show comment Hide comment
@vene

vene Aug 5, 2013

Owner

I am travelling and I might not find the time. I'll try to reproduce
tonight and get a stack trace.

Vlad

On Mon, Aug 5, 2013 at 12:24 AM, Gael Varoquaux notifications@github.comwrote:

@vene https://github.com/vene : can you give us a C stack trace?

I believe that this guy is quite nasty, and may be a release blocker.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22079300
.

Owner

vene commented Aug 5, 2013

I am travelling and I might not find the time. I'll try to reproduce
tonight and get a stack trace.

Vlad

On Mon, Aug 5, 2013 at 12:24 AM, Gael Varoquaux notifications@github.comwrote:

@vene https://github.com/vene : can you give us a C stack trace?

I believe that this guy is quite nasty, and may be a release blocker.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22079300
.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 6, 2013

Owner

@ogrisel : the ImportError might be due to old files leftover from a previous build. I had that. I am commiting a custom 'clean' distutils command to fix that.

Edit scrap that: your problem is unrelated, I believe.

Owner

GaelVaroquaux commented Aug 6, 2013

@ogrisel : the ImportError might be due to old files leftover from a previous build. I had that. I am commiting a custom 'clean' distutils command to fix that.

Edit scrap that: your problem is unrelated, I believe.

@larsmans

This comment has been minimized.

Show comment Hide comment
@larsmans

larsmans Aug 7, 2013

Owner

A quick look at the Ward clustering code shows undefined behavior in a few places.

Shall I try to clean up the code and push a version that should work? I don't have a Windows build environment to test it, so someone else will need to do that.

Owner

larsmans commented Aug 7, 2013

A quick look at the Ward clustering code shows undefined behavior in a few places.

Shall I try to clean up the code and push a version that should work? I don't have a Windows build environment to test it, so someone else will need to do that.

@vene

This comment has been minimized.

Show comment Hide comment
@vene

vene Aug 7, 2013

Owner

I can handle the testing, thanks so much Lars!

On Wed, Aug 7, 2013 at 2:26 PM, Lars Buitinck notifications@github.comwrote:

A quick look at the Ward clustering code shows undefined behavior in
various places, e.g.:

hc_get_heads

cdef unsigned int parent, node0, node, size
size = parents.size
for node0 in range(size):
node0 = size - node0 - 1
node = node0
parent = parents[node]

On the very first iteration, this sets node0 to (unsigned)(-1). I'm
surprised this works anywhere.

Shall I try to clean up the code and push a version that should work? I
don't have a Windows build environment to test it, so someone else will
need to do that.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22244595
.

Owner

vene commented Aug 7, 2013

I can handle the testing, thanks so much Lars!

On Wed, Aug 7, 2013 at 2:26 PM, Lars Buitinck notifications@github.comwrote:

A quick look at the Ward clustering code shows undefined behavior in
various places, e.g.:

hc_get_heads

cdef unsigned int parent, node0, node, size
size = parents.size
for node0 in range(size):
node0 = size - node0 - 1
node = node0
parent = parents[node]

On the very first iteration, this sets node0 to (unsigned)(-1). I'm
surprised this works anywhere.

Shall I try to clean up the code and push a version that should work? I
don't have a Windows build environment to test it, so someone else will
need to do that.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22244595
.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

On Wed, Aug 07, 2013 at 04:26:52AM -0700, Lars Buitinck wrote:

A quick look at the Ward clustering code shows undefined behavior in various
places, e.g.:

hc_get_heads

cdef unsigned int parent, node0, node, size
size = parents.size
for node0 in range(size):
node0 = size - node0 - 1
node = node0
parent = parents[node]

On the very first iteration, this sets node0 to (unsigned)(-1).

I am not sure I get your point: on the first iteration, node0 is 0 at the
begining of the loop, so at the first line it becomes "size - 1", which
is positive. On the last iteration, node0 is (size -1) entering the loop,
and thus becomes 0 at the first line. Or am I missing something? This can
go wrong is size is zero, but then I believe that no iteration should
happen.

Shall I try to clean up the code and push a version that should work?

Your welcome to do so, but I haven't understood the problem.

Owner

GaelVaroquaux commented Aug 7, 2013

On Wed, Aug 07, 2013 at 04:26:52AM -0700, Lars Buitinck wrote:

A quick look at the Ward clustering code shows undefined behavior in various
places, e.g.:

hc_get_heads

cdef unsigned int parent, node0, node, size
size = parents.size
for node0 in range(size):
node0 = size - node0 - 1
node = node0
parent = parents[node]

On the very first iteration, this sets node0 to (unsigned)(-1).

I am not sure I get your point: on the first iteration, node0 is 0 at the
begining of the loop, so at the first line it becomes "size - 1", which
is positive. On the last iteration, node0 is (size -1) entering the loop,
and thus becomes 0 at the first line. Or am I missing something? This can
go wrong is size is zero, but then I believe that no iteration should
happen.

Shall I try to clean up the code and push a version that should work?

Your welcome to do so, but I haven't understood the problem.

@larsmans

This comment has been minimized.

Show comment Hide comment
@larsmans

larsmans Aug 7, 2013

Owner

You're right, I misread that part of the code. But there are other places were signed and unsigned arithmetic are mixed, and that's not a safe thing to do. I'm replacing all the integers with np.npy_intp just to be on the safe side.

Owner

larsmans commented Aug 7, 2013

You're right, I misread that part of the code. But there are other places were signed and unsigned arithmetic are mixed, and that's not a safe thing to do. I'm replacing all the integers with np.npy_intp just to be on the safe side.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

I'm replacing all the integers with np.npy_intp just to be on the safe side.

OK, thanks.

Similar problems were already pointed out in
#2313 . I didn't feel
authoritative to review this PR, but if you have feedback, it would be
great.

Owner

GaelVaroquaux commented Aug 7, 2013

I'm replacing all the integers with np.npy_intp just to be on the safe side.

OK, thanks.

Similar problems were already pointed out in
#2313 . I didn't feel
authoritative to review this PR, but if you have feedback, it would be
great.

@ogrisel

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Aug 7, 2013

Owner

I just built the 0.14.X branch on windows 7 with Visual C++ Express 2008 and I only get the following joblib related heisenfailure when running the tests:

======================================================================
FAIL: Check the performance of hashing numpy arrays:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
AssertionError: False is not true

----------------------------------------------------------------------

+1 for releasing.

Please feel free to tag the 0.14 and I can regenerate the windows installer from the tag.

Owner

ogrisel commented Aug 7, 2013

I just built the 0.14.X branch on windows 7 with Visual C++ Express 2008 and I only get the following joblib related heisenfailure when running the tests:

======================================================================
FAIL: Check the performance of hashing numpy arrays:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
AssertionError: False is not true

----------------------------------------------------------------------

+1 for releasing.

Please feel free to tag the 0.14 and I can regenerate the windows installer from the tag.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

I just built the 0.14.X branch on windows 7 with Visual C++ Express 2008 and I only get the following joblib related heisenfailure when running the tests:

OK, this is fantastic news. Give me 2 hours, if you can...

G

Owner

GaelVaroquaux commented Aug 7, 2013

I just built the 0.14.X branch on windows 7 with Visual C++ Express 2008 and I only get the following joblib related heisenfailure when running the tests:

OK, this is fantastic news. Give me 2 hours, if you can...

G

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

Give me 2 hours, if you can...

Actually, I can tag pretty much now. I probably need 20-30mn before I
tag, and then we can start building the binaries.

Owner

GaelVaroquaux commented Aug 7, 2013

Give me 2 hours, if you can...

Actually, I can tag pretty much now. I probably need 20-30mn before I
tag, and then we can start building the binaries.

@vene

This comment has been minimized.

Show comment Hide comment
@vene

vene Aug 7, 2013

Owner

The crash is still happening. I don't know how to get a stack trace, I
think I need to install something. If Federico beats me to it, I won't
even be mad :)

On Wed, Aug 7, 2013 at 4:47 PM, Gael Varoquaux notifications@github.comwrote:

Give me 2 hours, if you can...

Actually, I can tag pretty much now. I probably need 20-30mn before I
tag, and then we can start building the binaries.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22252128
.

Owner

vene commented Aug 7, 2013

The crash is still happening. I don't know how to get a stack trace, I
think I need to install something. If Federico beats me to it, I won't
even be mad :)

On Wed, Aug 7, 2013 at 4:47 PM, Gael Varoquaux notifications@github.comwrote:

Give me 2 hours, if you can...

Actually, I can tag pretty much now. I probably need 20-30mn before I
tag, and then we can start building the binaries.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22252128
.

@larsmans

This comment has been minimized.

Show comment Hide comment
@larsmans

larsmans Aug 7, 2013

Owner

Can you find out which line of the test it's occurring in?

Owner

larsmans commented Aug 7, 2013

Can you find out which line of the test it's occurring in?

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

After a lot of investigation, it seems that this is due to joblib, it particular caching (or maybe loading).

Owner

GaelVaroquaux commented Aug 7, 2013

After a lot of investigation, it seems that this is due to joblib, it particular caching (or maybe loading).

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

Whoot! This is a bug in np.load. Neither sklearn, nor joblib.

I can easily work around this in the test. I will report the bug upstream.

Owner

GaelVaroquaux commented Aug 7, 2013

Whoot! This is a bug in np.load. Neither sklearn, nor joblib.

I can easily work around this in the test. I will report the bug upstream.

@ogrisel

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Aug 7, 2013

Owner

Great news! Thanks for investigating.

Owner

ogrisel commented Aug 7, 2013

Great news! Thanks for investigating.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

Great news! Thanks for investigating.

Upon more investigation, np.load is completely broken under that
platform. Which means that the test suite segfaults, as it calls the
joblib test suite, which heavily uses np.load.

I am passed caring: I am jumping in a bus, going back home and tagging.

Owner

GaelVaroquaux commented Aug 7, 2013

Great news! Thanks for investigating.

Upon more investigation, np.load is completely broken under that
platform. Which means that the test suite segfaults, as it calls the
joblib test suite, which heavily uses np.load.

I am passed caring: I am jumping in a bus, going back home and tagging.

@vene

This comment has been minimized.

Show comment Hide comment
@vene

vene Aug 7, 2013

Owner

Skip the tests on Windows?

Gael, did you tag? Should I build the binaries?

Vlad

On Wed, Aug 7, 2013 at 9:04 PM, Gael Varoquaux notifications@github.comwrote:

Great news! Thanks for investigating.

Upon more investigation, np.load is completely broken under that
platform. Which means that the test suite segfaults, as it calls the
joblib test suite, which heavily uses np.load.

I am passed caring: I am jumping in a bus, going back home and tagging.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22271023
.

Owner

vene commented Aug 7, 2013

Skip the tests on Windows?

Gael, did you tag? Should I build the binaries?

Vlad

On Wed, Aug 7, 2013 at 9:04 PM, Gael Varoquaux notifications@github.comwrote:

Great news! Thanks for investigating.

Upon more investigation, np.load is completely broken under that
platform. Which means that the test suite segfaults, as it calls the
joblib test suite, which heavily uses np.load.

I am passed caring: I am jumping in a bus, going back home and tagging.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22271023
.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

Skip the tests on Windows?

There are a lot.

Gael, did you tag?

Not yet, but I am going through the release process. It should be there
anytime.

Should I build the binaries?

I would say either tonight late (in 30mn) or tomorrow.

Owner

GaelVaroquaux commented Aug 7, 2013

Skip the tests on Windows?

There are a lot.

Gael, did you tag?

Not yet, but I am going through the release process. It should be there
anytime.

Should I build the binaries?

I would say either tonight late (in 30mn) or tomorrow.

@vene

This comment has been minimized.

Show comment Hide comment
@vene

vene Aug 7, 2013

Owner

I could do it ASAP, but ideally I would prefer tomorrow afternoon, as I
have a talk and I need to finish (actually start) the slides for it :-/

Would this be fine?

On Thu, Aug 8, 2013 at 12:41 AM, Gael Varoquaux notifications@github.comwrote:

Skip the tests on Windows?

There are a lot.

Gael, did you tag?

Not yet, but I am going through the release process. It should be there
anytime.

Should I build the binaries?

I would say either tonight late (in 30mn) or tomorrow.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22286533
.

Owner

vene commented Aug 7, 2013

I could do it ASAP, but ideally I would prefer tomorrow afternoon, as I
have a talk and I need to finish (actually start) the slides for it :-/

Would this be fine?

On Thu, Aug 8, 2013 at 12:41 AM, Gael Varoquaux notifications@github.comwrote:

Skip the tests on Windows?

There are a lot.

Gael, did you tag?

Not yet, but I am going through the release process. It should be there
anytime.

Should I build the binaries?

I would say either tonight late (in 30mn) or tomorrow.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22286533
.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 7, 2013

Owner

I could do it ASAP, but ideally I would prefer tomorrow afternoon, as I
have a talk and I need to finish (actually start) the slides for it :-/

Would this be fine?

Yes. I think that people can live without binaries for a day. I'd rather
do that than rush the release.

Owner

GaelVaroquaux commented Aug 7, 2013

I could do it ASAP, but ideally I would prefer tomorrow afternoon, as I
have a talk and I need to finish (actually start) the slides for it :-/

Would this be fine?

Yes. I think that people can live without binaries for a day. I'd rather
do that than rush the release.

@vene

This comment has been minimized.

Show comment Hide comment
@vene

vene Aug 7, 2013

Owner

Maybe there is a way to skip a whole file conditionally?

Thanks for understanding. I'd do it tonight but I could be sleeping in 30
minutes so I can do a proper job tomorrow at both things.

On Thu, Aug 8, 2013 at 12:45 AM, Gael Varoquaux notifications@github.comwrote:

I could do it ASAP, but ideally I would prefer tomorrow afternoon, as I
have a talk and I need to finish (actually start) the slides for it :-/

Would this be fine?

Yes. I think that people can live without binaries for a day. I'd rather
do that than rush the release.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22286776
.

Owner

vene commented Aug 7, 2013

Maybe there is a way to skip a whole file conditionally?

Thanks for understanding. I'd do it tonight but I could be sleeping in 30
minutes so I can do a proper job tomorrow at both things.

On Thu, Aug 8, 2013 at 12:45 AM, Gael Varoquaux notifications@github.comwrote:

I could do it ASAP, but ideally I would prefer tomorrow afternoon, as I
have a talk and I need to finish (actually start) the slides for it :-/

Would this be fine?

Yes. I think that people can live without binaries for a day. I'd rather
do that than rush the release.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/2322#issuecomment-22286776
.

@ogrisel

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Aug 20, 2013

Owner

Has this been fixed?

Owner

ogrisel commented Aug 20, 2013

Has this been fixed?

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 20, 2013

Owner

Has this been fixed?

No, but it's a numpy bug (that hasn't been confirmed by anyone, although
I reported it on the mailing list).

G

Owner

GaelVaroquaux commented Aug 20, 2013

Has this been fixed?

No, but it's a numpy bug (that hasn't been confirmed by anyone, although
I reported it on the mailing list).

G

@amueller amueller modified the milestones: 0.15.1, 0.14 Jul 18, 2014

@ogrisel

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Jul 29, 2014

Owner

So based on @GaelVaroquaux's last comment, this is not a scikit-learn bug. Anyway we don't do the release with mingw32 but msvc.

Closing.

Owner

ogrisel commented Jul 29, 2014

So based on @GaelVaroquaux's last comment, this is not a scikit-learn bug. Anyway we don't do the release with mingw32 but msvc.

Closing.

@ogrisel ogrisel closed this Jul 29, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment