Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Sparse input support for decision tree and forest #3173

Merged
merged 24 commits into from
Nov 21, 2014

Conversation

arjoly
Copy link
Member

@arjoly arjoly commented May 21, 2014

This pull request aims to finish the work in #2984.
For reference: #655, #2848

TODO list:

  • clean rebase (remove unwanted generated file, github merge commit, ...)
  • simpler input validation for tree base method (safe_array, check_array?)
  • reduce testing times and rationalize tests
  • use the new safe_realloc
  • document fit, ... with the sparse format support
  • fix bug in algorithm selection binary_search
  • raise error if int64 sparse index matrix
  • test oob scores
  • test random tree embedding with sparse input data
  • ensure that X is 2d
  • test forest apply (sparse)
  • test error raised if bootstrap=False and oob_score=True
  • test not fitted and feature_importances_
  • Check warning oob_score
  • Let tree estimator handle int64 based sparse matrices
  • switch xrange -> range
  • Update adaboost for dtype
  • divide sparse and dense splitter correctly
  • Test min weight leaf
  • pull out unrelated improvements
  • SparseSplitter is only (mainly) for csc.
  • rebench the constant

Todo later:

  • update what's new

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling ee99297 on arjoly:sparse-tree into d9cc662 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling 924ee60 on arjoly:sparse-tree into d9cc662 on scikit-learn:master.

@arjoly arjoly mentioned this pull request May 23, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) when pulling cce08a3 on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) when pulling cce08a3 on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

@arjoly
Copy link
Member Author

arjoly commented May 26, 2014

I think this pull request is ready to be reviewed. I switch to MRG.

@arjoly arjoly changed the title [WIP] Sparse input support for decision tree and forest [MRG] Sparse input support for decision tree and forest May 26, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling 2e9c6db on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

@arjoly
Copy link
Member Author

arjoly commented May 26, 2014

@fareshedayati I have added your name in the authors' lists to acknowledge your work.

@arjoly
Copy link
Member Author

arjoly commented May 26, 2014

@ogrisel Could it be that this failure https://travis-ci.org/scikit-learn/scikit-learn/jobs/26057366 (from commit 2e9c6db) is related to joblib ?

@fareshedayati
Copy link

@arjoly thanks a lot.

@ogrisel
Copy link
Member

ogrisel commented May 26, 2014

The timeout failure is probably caused by an overloaded travis worker. I relaunched it to confirm.

X, = check_arrays(X, dtype=DTYPE, sparse_format='csr')

else:
X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do this in one check_arrays call?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) when pulling 9f05d1c on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling a532cef on arjoly:sparse-tree into daa1dba on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling cd7d576 on arjoly:sparse-tree into daa1dba on scikit-learn:master.

"presort-best": _tree.PresortBestSplitter,
"random": _tree.RandomSplitter}

SPARSE_SPLITTER = {"best": _tree.BestSparseSplitter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be SPARSE_SPLITTERS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks fix in b1d6030

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling b1d6030 on arjoly:sparse-tree into daa1dba on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2014

It seems that this PR is ready for merge. Any more comments @jnothman @larsmans @glouppe?

@glouppe
Copy link
Contributor

glouppe commented Jun 4, 2014

I would really like to make a proper review of this. Unfortunately, I am afraid I will be busy for two more weeks on my thesis. If others could have a look, that would be great.

@arjoly
Copy link
Member Author

arjoly commented Nov 7, 2014

rebased on top of master

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2014

Thanks, it looks good to me. Would be great to get another round of reviews. Maybe @amueller @glouppe @pprett @ndawe?

@jnothman
Copy link
Member

I was fairly happy with it a while ago. My plate is a bit full this week, but perhaps next week I'll give this another look-in.

@arjoly
Copy link
Member Author

arjoly commented Nov 14, 2014

Thanks @ogrisel !

@arjoly
Copy link
Member Author

arjoly commented Nov 14, 2014

And also thanks joel, i would really appreciate your review if you have some time.

cdef SIZE_t n_samples = self.end - self.start

# Use binary search if n_samples * log(n_indices) <
# n_indices and index_to_samples approach otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_indices * EXTRACT_NNZ_SWITCH would be more precise... but it's repeated in the code, so I'm not sure this comment needs to be so verbose in any case.

@jnothman
Copy link
Member

You've effectively got 3 +1s on an earlier version of this code (from me, @amueller and @glouppe), so asking for another review is likely to lead to over-optimisation.

On which basis:

  • Should {Dense,Sparse}Splitter be called Base{Dense,Sparse}Splitter?
  • Does extract_nnz take up a substantial portion of runtime? (If not, why do you require two implementations of it?) Given how frequently sparse datasets are nonnegative, should we have a way to short-circuit if we know this is the case (i.e. keep a record that the entire dataset or each feature is nonnegative).
  • Since I last reviewed this you changed the min-weight handling so that the criterion update happened before the min-weight check. Is this something (i.e. its former failure) that is currently tested?

I've not given this a thorough pass now, but gather that mostly cosmetic changes have been applied since the last review. I think this PR is of a maturity that it needs no further review to be merged.

@arjoly
Copy link
Member Author

arjoly commented Nov 17, 2014

Should {Dense,Sparse}Splitter be called Base{Dense,Sparse}Splitter?

Sound good.

Does extract_nnz take up a substantial portion of runtime? (If not, why do you require two implementations of it?)

The second implementation is there to be able to cope with a degenerated column which wouldn't be sparse.

Given how frequently sparse datasets are nonnegative, should we have a way to short-circuit if we know this is the case (i.e. keep a record that the entire dataset or each feature is nonnegative).

Do you think there would be high gain to do that?

Since I last reviewed this you changed the min-weight handling so that the criterion update happened before the min-weight check. Is this something (i.e. its former failure) that is currently tested?

There are tests for this. Though, those could be better.

@jnothman
Copy link
Member

Do you think there would be high gain to do that?

That's why I asked "Does extract_nnz take up a substantial portion of runtime?" Obviously testing the non-negativity of a sparse matrix takes O(nnz) too.

But in short, I think you shouldn't be looking for more reviews. The code can be tightened once it's merged.

@arjoly
Copy link
Member Author

arjoly commented Nov 17, 2014

"Does extract_nnz take up a substantial portion of runtime?"

Formally, I haven't tested. It should take most of the time especially near the bottom of the tree.

@jnothman
Copy link
Member

Are we okay to merge?

@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2014

Yes let's merge, this is already useful as it is and the public API should not change so we can always decide to change implementation details on whether or not to special case the handling non-negative sparse data later.

ogrisel added a commit that referenced this pull request Nov 21, 2014
[MRG] Sparse input support for decision tree and forest
@ogrisel ogrisel merged commit 8621f00 into scikit-learn:master Nov 21, 2014
@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2014

Thanks @arjoly! I think many users were waiting for this!

@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2014

I will add an entry to whats new.

@glouppe
Copy link
Contributor

glouppe commented Nov 21, 2014

Great job everyone :)

@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2014

🍻

@GaelVaroquaux
Copy link
Member

🍻

F* yeah! Awesome

@pprett
Copy link
Member

pprett commented Nov 21, 2014

congrats @arjoly and everybody involved -- didnt expect sparse matrix support for the decision trees -- great job!

@arjoly
Copy link
Member Author

arjoly commented Nov 21, 2014

Finally, this is in !!! :-D Thanks to all people that have brought a contribution to this feature!

from ..base import ClassifierMixin, RegressorMixin
from ..externals.joblib import Parallel, delayed
from ..externals import six
from ..externals.six.moves import xrange
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just looking through this and noticed that this file is missing

from ..externals.six.moves import xrange as range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also true of tree.py, so maybe this was intentional. It just seems strange from an outsider's perspective to make 2.7 memory usage worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but as far as I can see this is only for iterations over n_outputs_ which should be much smaller than n_samples in typical cases so users won't notice it in practice.

@amueller
Copy link
Member

Wohoo congrats @arjoly! This is awesome!

@fareshedayati
Copy link

Great work @arjoly thank you for all your hard work!

@jnothman
Copy link
Member

congrats, Arnaud and Fares!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet