[MRG] Sparse input support for decision tree and forest #3173

arjoly · 2014-05-21T12:08:00Z

This pull request aims to finish the work in #2984.
For reference: #655, #2848

TODO list:

Todo later:

update what's new

coveralls · 2014-05-22T13:28:54Z

Coverage increased (+0.03%) when pulling ee99297 on arjoly:sparse-tree into d9cc662 on scikit-learn:master.

coveralls · 2014-05-22T13:55:55Z

Coverage increased (+0.03%) when pulling 924ee60 on arjoly:sparse-tree into d9cc662 on scikit-learn:master.

coveralls · 2014-05-23T13:36:40Z

Coverage increased (+0.06%) when pulling cce08a3 on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

coveralls · 2014-05-23T16:15:02Z

Coverage increased (+0.06%) when pulling cce08a3 on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

arjoly · 2014-05-26T15:36:09Z

I think this pull request is ready to be reviewed. I switch to MRG.

coveralls · 2014-05-26T15:36:32Z

Coverage increased (+0.05%) when pulling 2e9c6db on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

arjoly · 2014-05-26T15:39:13Z

@fareshedayati I have added your name in the authors' lists to acknowledge your work.

arjoly · 2014-05-26T15:51:21Z

@ogrisel Could it be that this failure https://travis-ci.org/scikit-learn/scikit-learn/jobs/26057366 (from commit 2e9c6db) is related to joblib ?

fareshedayati · 2014-05-26T17:36:13Z

@arjoly thanks a lot.

ogrisel · 2014-05-26T17:47:13Z

The timeout failure is probably caused by an overloaded travis worker. I relaunched it to confirm.

larsmans · 2014-05-26T18:24:38Z

sklearn/ensemble/forest.py

+            X, = check_arrays(X, dtype=DTYPE, sparse_format='csr')
+
+        else:
+            X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")


Can't we do this in one check_arrays call?

coveralls · 2014-05-27T08:14:26Z

Coverage increased (+0.06%) when pulling 9f05d1c on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

coveralls · 2014-05-27T09:34:43Z

Coverage increased (+0.04%) when pulling a532cef on arjoly:sparse-tree into daa1dba on scikit-learn:master.

coveralls · 2014-05-27T10:18:41Z

Coverage increased (+0.04%) when pulling cd7d576 on arjoly:sparse-tree into daa1dba on scikit-learn:master.

jnothman · 2014-05-27T11:48:00Z

sklearn/tree/tree.py

+                   "presort-best": _tree.PresortBestSplitter,
+                   "random": _tree.RandomSplitter}
+
+SPARSE_SPLITTER = {"best": _tree.BestSparseSplitter,


This should be SPARSE_SPLITTERS

Thanks fix in b1d6030

coveralls · 2014-05-27T12:17:34Z

Coverage increased (+0.04%) when pulling b1d6030 on arjoly:sparse-tree into daa1dba on scikit-learn:master.

ogrisel · 2014-06-04T13:29:11Z

It seems that this PR is ready for merge. Any more comments @jnothman @larsmans @glouppe?

glouppe · 2014-06-04T13:33:20Z

I would really like to make a proper review of this. Unfortunately, I am afraid I will be busy for two more weeks on my thesis. If others could have a look, that would be great.

arjoly · 2014-11-07T09:17:09Z

rebased on top of master

ogrisel · 2014-11-12T14:33:37Z

Thanks, it looks good to me. Would be great to get another round of reviews. Maybe @amueller @glouppe @pprett @ndawe?

jnothman · 2014-11-12T14:52:50Z

I was fairly happy with it a while ago. My plate is a bit full this week, but perhaps next week I'll give this another look-in.

arjoly · 2014-11-14T16:12:30Z

Thanks @ogrisel !

arjoly · 2014-11-14T16:16:19Z

And also thanks joel, i would really appreciate your review if you have some time.

jnothman · 2014-11-15T14:34:00Z

sklearn/tree/_tree.pyx

+        cdef SIZE_t n_samples = self.end - self.start
+
+        # Use binary search if n_samples * log(n_indices) <
+        # n_indices and index_to_samples approach otherwise.


n_indices * EXTRACT_NNZ_SWITCH would be more precise... but it's repeated in the code, so I'm not sure this comment needs to be so verbose in any case.

jnothman · 2014-11-15T14:37:34Z

You've effectively got 3 +1s on an earlier version of this code (from me, @amueller and @glouppe), so asking for another review is likely to lead to over-optimisation.

On which basis:

Should {Dense,Sparse}Splitter be called Base{Dense,Sparse}Splitter?
Does extract_nnz take up a substantial portion of runtime? (If not, why do you require two implementations of it?) Given how frequently sparse datasets are nonnegative, should we have a way to short-circuit if we know this is the case (i.e. keep a record that the entire dataset or each feature is nonnegative).
Since I last reviewed this you changed the min-weight handling so that the criterion update happened before the min-weight check. Is this something (i.e. its former failure) that is currently tested?

I've not given this a thorough pass now, but gather that mostly cosmetic changes have been applied since the last review. I think this PR is of a maturity that it needs no further review to be merged.

arjoly · 2014-11-17T09:24:26Z

Should {Dense,Sparse}Splitter be called Base{Dense,Sparse}Splitter?

Sound good.

Does extract_nnz take up a substantial portion of runtime? (If not, why do you require two implementations of it?)

The second implementation is there to be able to cope with a degenerated column which wouldn't be sparse.

Given how frequently sparse datasets are nonnegative, should we have a way to short-circuit if we know this is the case (i.e. keep a record that the entire dataset or each feature is nonnegative).

Do you think there would be high gain to do that?

Since I last reviewed this you changed the min-weight handling so that the criterion update happened before the min-weight check. Is this something (i.e. its former failure) that is currently tested?

There are tests for this. Though, those could be better.

jnothman · 2014-11-17T09:33:40Z

Do you think there would be high gain to do that?

That's why I asked "Does extract_nnz take up a substantial portion of runtime?" Obviously testing the non-negativity of a sparse matrix takes O(nnz) too.

But in short, I think you shouldn't be looking for more reviews. The code can be tightened once it's merged.

arjoly · 2014-11-17T09:49:34Z

"Does extract_nnz take up a substantial portion of runtime?"

Formally, I haven't tested. It should take most of the time especially near the bottom of the tree.

jnothman · 2014-11-21T04:59:10Z

Are we okay to merge?

ogrisel · 2014-11-21T09:10:01Z

Yes let's merge, this is already useful as it is and the public API should not change so we can always decide to change implementation details on whether or not to special case the handling non-negative sparse data later.

[MRG] Sparse input support for decision tree and forest

ogrisel · 2014-11-21T09:11:14Z

Thanks @arjoly! I think many users were waiting for this!

ogrisel · 2014-11-21T09:14:21Z

I will add an entry to whats new.

glouppe · 2014-11-21T09:36:05Z

Great job everyone :)

ogrisel · 2014-11-21T09:55:26Z

🍻

GaelVaroquaux · 2014-11-21T10:00:21Z

🍻

F* yeah! Awesome

pprett · 2014-11-21T10:33:02Z

congrats @arjoly and everybody involved -- didnt expect sparse matrix support for the decision trees -- great job!

arjoly · 2014-11-21T11:40:57Z

Finally, this is in !!! :-D Thanks to all people that have brought a contribution to this feature!

dan-blanchard · 2014-11-21T14:04:36Z

sklearn/ensemble/forest.py

 from ..base import ClassifierMixin, RegressorMixin
 from ..externals.joblib import Parallel, delayed
 from ..externals import six
-from ..externals.six.moves import xrange


I was just looking through this and noticed that this file is missing

from ..externals.six.moves import xrange as range

This is also true of tree.py, so maybe this was intentional. It just seems strange from an outsider's perspective to make 2.7 memory usage worse.

Indeed, but as far as I can see this is only for iterations over n_outputs_ which should be much smaller than n_samples in typical cases so users won't notice it in practice.

amueller · 2014-11-21T15:08:41Z

Wohoo congrats @arjoly! This is awesome!

fareshedayati · 2014-11-21T18:32:07Z

Great work @arjoly thank you for all your hard work!

jnothman · 2014-11-22T12:35:50Z

congrats, Arnaud and Fares!

This was referenced May 21, 2014

Added sparsity feature to decision tree classifier #2984

Closed

Bug LassoCV and ElasticNetCV doesn't handle input type in np.float32 #3174

Closed

[MRG] Modernizes tests for the forest module #3182

Merged

arjoly mentioned this pull request May 23, 2014

Heisen-bug with omp_cv #3190

Closed

arjoly mentioned this pull request May 23, 2014

Joblib heisen bug failure joblib/joblib#136

Closed

arjoly changed the title ~~[WIP] Sparse input support for decision tree and forest~~ [MRG] Sparse input support for decision tree and forest May 26, 2014

arjoly mentioned this pull request May 26, 2014

[WIP] Sparse matrix support for Decision tree #2848

Closed

larsmans reviewed May 26, 2014
View reviewed changes

jnothman reviewed May 27, 2014
View reviewed changes

Remove constant print

4f423d6

arjoly force-pushed the sparse-tree branch from 41876ab to 4f423d6 Compare November 6, 2014 13:03

jnothman reviewed Nov 15, 2014
View reviewed changes

COSMIT add Base prefix to DenseSplitter and DenseSplitter

ab57964

ogrisel added a commit that referenced this pull request Nov 21, 2014

Merge pull request #3173 from arjoly/sparse-tree

8621f00

[MRG] Sparse input support for decision tree and forest

ogrisel merged commit 8621f00 into scikit-learn:master Nov 21, 2014

arjoly deleted the sparse-tree branch November 21, 2014 11:56

dan-blanchard mentioned this pull request Nov 21, 2014

Remove decision tree and random forest learners from requires_dense set EducationalTestingService/skll#207

Closed

dan-blanchard reviewed Nov 21, 2014
View reviewed changes

[MRG] Sparse input support for decision tree and forest #3173

[MRG] Sparse input support for decision tree and forest #3173

Conversation

arjoly commented May 21, 2014

coveralls commented May 22, 2014

coveralls commented May 22, 2014

coveralls commented May 23, 2014

coveralls commented May 23, 2014

arjoly commented May 26, 2014

coveralls commented May 26, 2014

arjoly commented May 26, 2014

arjoly commented May 26, 2014

fareshedayati commented May 26, 2014

ogrisel commented May 26, 2014

larsmans May 26, 2014

Choose a reason for hiding this comment

coveralls commented May 27, 2014

coveralls commented May 27, 2014

coveralls commented May 27, 2014

jnothman May 27, 2014

Choose a reason for hiding this comment

arjoly May 27, 2014

Choose a reason for hiding this comment

coveralls commented May 27, 2014

ogrisel commented Jun 4, 2014

glouppe commented Jun 4, 2014

arjoly commented Nov 7, 2014

ogrisel commented Nov 12, 2014

jnothman commented Nov 12, 2014

arjoly commented Nov 14, 2014

arjoly commented Nov 14, 2014

jnothman Nov 15, 2014

Choose a reason for hiding this comment

jnothman commented Nov 15, 2014

arjoly commented Nov 17, 2014

jnothman commented Nov 17, 2014

arjoly commented Nov 17, 2014

jnothman commented Nov 21, 2014

ogrisel commented Nov 21, 2014

ogrisel commented Nov 21, 2014

ogrisel commented Nov 21, 2014

glouppe commented Nov 21, 2014

ogrisel commented Nov 21, 2014

GaelVaroquaux commented Nov 21, 2014

pprett commented Nov 21, 2014

arjoly commented Nov 21, 2014

dan-blanchard Nov 21, 2014

Choose a reason for hiding this comment

dan-blanchard Nov 21, 2014

Choose a reason for hiding this comment

ogrisel Nov 21, 2014

Choose a reason for hiding this comment

amueller commented Nov 21, 2014

fareshedayati commented Nov 21, 2014

jnothman commented Nov 22, 2014