-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Sparse input support for decision tree and forest #3173
Conversation
I think this pull request is ready to be reviewed. I switch to MRG. |
@fareshedayati I have added your name in the authors' lists to acknowledge your work. |
@ogrisel Could it be that this failure https://travis-ci.org/scikit-learn/scikit-learn/jobs/26057366 (from commit 2e9c6db) is related to joblib ? |
@arjoly thanks a lot. |
The timeout failure is probably caused by an overloaded travis worker. I relaunched it to confirm. |
X, = check_arrays(X, dtype=DTYPE, sparse_format='csr') | ||
|
||
else: | ||
X, = check_arrays(X, dtype=DTYPE, sparse_format="dense") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do this in one check_arrays
call?
"presort-best": _tree.PresortBestSplitter, | ||
"random": _tree.RandomSplitter} | ||
|
||
SPARSE_SPLITTER = {"best": _tree.BestSparseSplitter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be SPARSE_SPLITTERS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks fix in b1d6030
I would really like to make a proper review of this. Unfortunately, I am afraid I will be busy for two more weeks on my thesis. If others could have a look, that would be great. |
rebased on top of master |
I was fairly happy with it a while ago. My plate is a bit full this week, but perhaps next week I'll give this another look-in. |
Thanks @ogrisel ! |
And also thanks joel, i would really appreciate your review if you have some time. |
cdef SIZE_t n_samples = self.end - self.start | ||
|
||
# Use binary search if n_samples * log(n_indices) < | ||
# n_indices and index_to_samples approach otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_indices * EXTRACT_NNZ_SWITCH
would be more precise... but it's repeated in the code, so I'm not sure this comment needs to be so verbose in any case.
You've effectively got 3 +1s on an earlier version of this code (from me, @amueller and @glouppe), so asking for another review is likely to lead to over-optimisation. On which basis:
I've not given this a thorough pass now, but gather that mostly cosmetic changes have been applied since the last review. I think this PR is of a maturity that it needs no further review to be merged. |
Sound good.
The second implementation is there to be able to cope with a degenerated column which wouldn't be sparse.
Do you think there would be high gain to do that?
There are tests for this. Though, those could be better. |
That's why I asked "Does extract_nnz take up a substantial portion of runtime?" Obviously testing the non-negativity of a sparse matrix takes O(nnz) too. But in short, I think you shouldn't be looking for more reviews. The code can be tightened once it's merged. |
Formally, I haven't tested. It should take most of the time especially near the bottom of the tree. |
Are we okay to merge? |
Yes let's merge, this is already useful as it is and the public API should not change so we can always decide to change implementation details on whether or not to special case the handling non-negative sparse data later. |
[MRG] Sparse input support for decision tree and forest
Thanks @arjoly! I think many users were waiting for this! |
I will add an entry to whats new. |
Great job everyone :) |
🍻 |
F* yeah! Awesome |
congrats @arjoly and everybody involved -- didnt expect sparse matrix support for the decision trees -- great job! |
Finally, this is in !!! :-D Thanks to all people that have brought a contribution to this feature! |
from ..base import ClassifierMixin, RegressorMixin | ||
from ..externals.joblib import Parallel, delayed | ||
from ..externals import six | ||
from ..externals.six.moves import xrange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just looking through this and noticed that this file is missing
from ..externals.six.moves import xrange as range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also true of tree.py
, so maybe this was intentional. It just seems strange from an outsider's perspective to make 2.7 memory usage worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, but as far as I can see this is only for iterations over n_outputs_
which should be much smaller than n_samples
in typical cases so users won't notice it in practice.
Wohoo congrats @arjoly! This is awesome! |
Great work @arjoly thank you for all your hard work! |
congrats, Arnaud and Fares! |
This pull request aims to finish the work in #2984.
For reference: #655, #2848
TODO list:
feature_importances_
Todo later: