[MRG] fix malloc errors in trees #2715

Merged
merged 3 commits into from Jan 11, 2014

Projects

None yet

6 participants

@larsmans
Member
larsmans commented Jan 5, 2014

There were still a few unchecked mallocs and constructs were the checks would actually cause double free. I also fixed some other minor stuff. See commit log for details.

I was actually trying to optimize the sorting by implementing three-way quicksort, but when I tried to cache some stuff in a new malloc'd array, the tree code started crashing.

Ping @glouppe.

@glouppe
Member
glouppe commented Jan 5, 2014

Thanks! However, we touched to a lot of things during #2570. Could this wait for @pprett PR to be merged first?

@glouppe
Member
glouppe commented Jan 5, 2014

Regarding the sorting algorithm, this is really something we have to be cautious about. I experimented with several sorting algorithms during the rewrite this summer and it can easily slow down the whole algorithm (by a factor from 1 to 10 depending on the distribution of the feature values). In particular, I had first used the Quicksort implementation from the Numerical Recipes book, but the current Heapsort happens to be quite faster.

@glouppe
Member
glouppe commented Jan 5, 2014

(I would be glad to do more experiments on this topic though, it is still not exactly clear to me which sorting algorithm is better in which cases.)

@larsmans
Member
larsmans commented Jan 5, 2014

@glouppe I was actually going to ask you about the heapsort. I'd never try that algo myself, but now it seems I can't beat it.

I tried a vanilla randomized quicksort too, and it slowed the code down 1/3; I then tried quicksort with backoff to heap sort or insertion sort and it was still slower. I'm not sure if 3-way quicksort will be better, but there's one other trick I want to try, which is to copy the features used for sorting into a contiguous array to make the sorting cache-friendly. That also means less state has to be passed around in a recursive sort.

@larsmans
Member
larsmans commented Jan 5, 2014

Hmm, test failure in gradient boosting...

@pprett
Member
pprett commented Jan 5, 2014

infinite values in the predictions on the boston housing regression test.. that's strange... need to look into the changes

@larsmans
Member
larsmans commented Jan 5, 2014

@pprett I haven't followed #2570, but judging from the diff it mostly adds stuff in isolated places. Do you think it'll be merged soon?

@glouppe glouppe and 1 other commented on an outdated diff Jan 5, 2014
sklearn/tree/_tree.pyx
self.n_total_samples = X.shape[0]
- self.sample_mask = <SIZE_t*> calloc(self.n_total_samples,
@glouppe
glouppe Jan 5, 2014 Member

I think the error comes from here: sample_mask should indeed be initialized with zeroes. It is important in PresortBestSplitter (the one used by GBRT).

@larsmans
larsmans Jan 5, 2014 Member

Missed that. I'll memset it.

@larsmans
Member
larsmans commented Jan 5, 2014

Ok, let's see if it passes now; I do get an exception from joblib on my own box.

@coveralls

Coverage Status

Coverage remained the same when pulling d748453 on larsmans:tree-malloc into 9e9a656 on scikit-learn:master.

@glouppe
Member
glouppe commented Jan 5, 2014

Yay :) The exception from joblib that you have is the same as the one we have on Jenkins? cc: @ogrisel

@larsmans
Member
larsmans commented Jan 5, 2014

Something about "EOF in multi-line statement" and pages full of traceback info.

@ogrisel
Member
ogrisel commented Jan 6, 2014

I think I could reproduce that exception on master once but it does not seem to be deterministic. I will open a new issue for this one.

@ogrisel
Member
ogrisel commented Jan 6, 2014

I opened #2721 to track the resolution of the JoblibIndexError heisen-error.

@larsmans
Member

Actually never mind, I think I can manage the rebase. But again, @jnothman is this good as a stopgap measure?

@larsmans
Member

Rebased on master, waiting for tests.

@larsmans
Member

Ok, the whole thing segfaults now. Back to the drawing board.

@jnothman
Member

At the moment #2732 only covers the storage in Tree, not any of the Criterion implementations. So if my making the small changes there means we're likely to merge that soon, you can limit the scope of this patch, perhaps.

@glouppe
Member
glouppe commented Jan 11, 2014

@jnothman What are the issues with the criteria? This is private interface which is totally opaque for users. I don't understand why there would be a need to change it...

@jnothman
Member

As I understand, this PR is modifying crtieria mallocing for the most part,
which is completely independent of my PR...

On 11 January 2014 23:22, Gilles Louppe notifications@github.com wrote:

@jnothman https://github.com/jnothman What are the issues with the
criteria? This is private interface which is totally opaque for users. I
don't understand why there would be a need to change it...


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2715#issuecomment-32094688
.

@glouppe glouppe and 1 other commented on an outdated diff Jan 11, 2014
sklearn/tree/_tree.pyx
self.n_total_samples = X.shape[0]
- self.sample_mask = <SIZE_t*> calloc(self.n_total_samples,
- sizeof(SIZE_t))
@glouppe
glouppe Jan 11, 2014 Member

Where do you allocate sample_mask now? I can't find it in your changes.

@glouppe
glouppe Jan 11, 2014 Member

(I suspect the segfault is due to this omission)

@larsmans
larsmans Jan 11, 2014 Member

Ah, crap, this died during the rebase. Thanks heaps! (Pun intended.)

@larsmans
larsmans Jan 11, 2014 Member

Ok, that pushed the segfault 6 tests further.

@glouppe
Member
glouppe commented Jan 11, 2014

@jnothman Indeed, sorry, I misunderstood.

@larsmans Either way, I think the first of two PRs to be merged will make (some) conflicts with the other. I am fine with merging this one if it is ready in first.

@glouppe
Member
glouppe commented Jan 11, 2014

... and regarding your expected changes on the sorting algorithm, I would rather make that in a new PR. Otherwise, I fear it'll delay the whole thing (both this and @jnothman PR).

@larsmans
Member

Yes, sorting will be a new PR anyway. I'll have to see whether I want to use memoryviews to handle that elegantly.

@larsmans
Member

Got it now. Tests succeed on my box, let's wait for Travis.

@glouppe
Member
glouppe commented Jan 11, 2014

Tests also fail on my box. I get double-free or memory corruption errors.

@larsmans
Member

With the most recent code?

@glouppe
Member
glouppe commented Jan 11, 2014

Yes.

@glouppe glouppe and 1 other commented on an outdated diff Jan 11, 2014
sklearn/tree/_tree.pyx
- free(self.sq_sum_left)
- free(self.sq_sum_right)
- free(self.sq_sum_total)
- free(self.var_left)
- free(self.var_right)
- free(self.sum_left)
- free(self.sum_right)
- free(self.sum_total)
+ if (self.mean_left == NULL
+ or self.mean_right == NULL
+ or self.mean_total == NULL
+ or self.sq_sum_left == NULL
+ or self.sq_sum_right == NULL
+ or self.sq_sum_total == NULL
+ or self.var_left == NULL
+ or self.var_right == NULL):
@glouppe
glouppe Jan 11, 2014 Member

Just noticed that sum_left, sum_right and sum_total are missing in this test. (They also are missing in master)

@larsmans
larsmans Jan 11, 2014 Member

Good catch, fixing.

@larsmans
Member

It's a bug in the new _utils.pyx, I'm on it.

@glouppe glouppe and 1 other commented on an outdated diff Jan 11, 2014
sklearn/tree/_utils.pyx
@@ -65,8 +65,9 @@ cdef class Stack:
# Resize if capacity not sufficient
if top >= self.capacity:
self.capacity *= 2
- self.stack_ = <StackRecord*> realloc(self.stack_, self.capacity * sizeof(StackRecord))
- if self.stack_ == NULL:
+ stack = <StackRecord*> realloc(self.stack_,
+ self.capacity * sizeof(StackRecord))
+ if stack == NULL:
@glouppe
glouppe Jan 11, 2014 Member

You don't set self.stack_.

@larsmans
larsmans Jan 11, 2014 Member

That seems to be the whole issue.

@glouppe glouppe commented on an outdated diff Jan 11, 2014
sklearn/tree/_utils.pyx
@@ -198,10 +199,10 @@ cdef class PriorityHeap:
# Resize if capacity not sufficient
if heap_ptr >= self.capacity:
self.capacity *= 2
- self.heap_ = <PriorityHeapRecord*> realloc(self.heap_,
- self.capacity *
- sizeof(PriorityHeapRecord))
- if self.heap_ == NULL:
+ heap = <PriorityHeapRecord*> realloc(self.heap_,
+ self.capacity *
+ sizeof(PriorityHeapRecord))
+ if heap == NULL:
@glouppe
glouppe Jan 11, 2014 Member

Same for self.heap_.

@larsmans
Member

Ok, story so far: the new DepthFirstTreeBuilder is reading unitialized memory, passing bogus start and end values to Splitter.node_reset which then goes out of bounds.

@glouppe
Member
glouppe commented Jan 11, 2014

Ok, story so far: the new DepthFirstTreeBuilder is reading unitialized memory, passing bogus start and end values to Splitter.node_reset which then goes out of bounds.

See my comments above. Setting self.stack_ and self.heap_ fix the issues on my box. (This is because realloc might return a new pointer value.)

larsmans added some commits Dec 29, 2013
@larsmans larsmans COSMIT tree: unused variable warnings and use for/range
Not recompiled with Cython, as line numbers still match up.
b83e74d
@larsmans larsmans MAINT ignore sklearn/tree/_utils.c in diff f309023
@larsmans larsmans BUG fix unchecked mallocs in trees
All mallocs should be checked now. Some notes for the next person to touch
this code, or do anything with malloc in Cython code:

* with gil: raise MemoryError() didn't work as expected, as a surrounding
  nogil block catches and logs the exception, then proceeds as if nothing
  happened.
* There were several opportunities for double free in the code. This is
  handled by letting the destructor handle the freeing.
* Similarly, constructors failed to set pointers to NULL so __dealloc__
  would try to free unitialized pointers.
* Out of memory might leave trees in an inconsistent state, but will not
  cause a double free. A subsequent successful fit should produce a
  consistent state, although I didn't test this.
* Replaced a free/calloc with a realloc; realloc(NULL, n) = malloc(n).
* free(NULL) is a harmless no-op per ISO C89.

Also performed an optimization: the pre-sort splitter was using size_t for
an array of booleans. unsigned char is faster and smaller.

Finally, fixed some cosmetic issues (long lines).
7fc52ff
@larsmans
Member

Fixed the reassignment bug, let's wait for Travis once more.

@coveralls

Coverage Status

Coverage remained the same when pulling 7fc52ff on larsmans:tree-malloc into c77255f on scikit-learn:master.

@larsmans
Member

Travis is happy :)

@glouppe
Member
glouppe commented Jan 11, 2014

Great! Thanks for making the code safer. +1 for merge.

@larsmans larsmans merged commit 9d5d2da into scikit-learn:master Jan 11, 2014

1 check passed

default The Travis CI build passed
Details
@larsmans
Member

Thanks for the help!

@larsmans larsmans deleted the larsmans:tree-malloc branch Jan 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment