Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

[MRG] Sparse input support for decision tree and forest #3173

Merged
merged 24 commits into from
@arjoly
Owner

This pull request aims to finish the work in #2984.
For reference: #655, #2848

TODO list:

  • clean rebase (remove unwanted generated file, github merge commit, ...)
  • simpler input validation for tree base method (safe_array, check_array?)
  • reduce testing times and rationalize tests
  • use the new safe_realloc
  • document fit, ... with the sparse format support
  • fix bug in algorithm selection binary_search
  • raise error if int64 sparse index matrix
  • test oob scores
  • test random tree embedding with sparse input data
  • ensure that X is 2d
  • test forest apply (sparse)
  • test error raised if bootstrap=False and oob_score=True
  • test not fitted and feature_importances_
  • Check warning oob_score
  • Let tree estimator handle int64 based sparse matrices
  • switch xrange -> range
  • Update adaboost for dtype
  • divide sparse and dense splitter correctly
  • Test min weight leaf
  • pull out unrelated improvements
  • SparseSplitter is only (mainly) for csc.
  • rebench the constant

Todo later:

  • update what's new
@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling ee99297 on arjoly:sparse-tree into d9cc662 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 924ee60 on arjoly:sparse-tree into d9cc662 on scikit-learn:master.

@arjoly arjoly referenced this pull request
Closed

Heisen-bug with omp_cv #3190

@coveralls

Coverage Status

Coverage increased (+0.06%) when pulling cce08a3 on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

@arjoly arjoly referenced this pull request in joblib/joblib
Closed

Joblib heisen bug failure #136

@coveralls

Coverage Status

Coverage increased (+0.06%) when pulling cce08a3 on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

@arjoly
Owner

I think this pull request is ready to be reviewed. I switch to MRG.

@arjoly arjoly changed the title from [WIP] Sparse input support for decision tree and forest to [MRG] Sparse input support for decision tree and forest
@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 2e9c6db on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

@arjoly
Owner

@fareshedayati I have added your name in the authors' lists to acknowledge your work.

@arjoly
Owner

@ogrisel Could it be that this failure https://travis-ci.org/scikit-learn/scikit-learn/jobs/26057366 (from commit 2e9c6db) is related to joblib ?

@fareshedayati

@arjoly thanks a lot.

@ogrisel
Owner

The timeout failure is probably caused by an overloaded travis worker. I relaunched it to confirm.

sklearn/ensemble/forest.py
@@ -188,10 +193,16 @@ def apply(self, X):
For each datapoint x in X and for each tree in the forest,
return the index of the leaf x ends up in.
"""
- X = array2d(X, dtype=DTYPE)
+ if issparse(X):
+ X, = check_arrays(X, dtype=DTYPE, sparse_format='csr')
+
+ else:
+ X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
@larsmans Owner

Can't we do this in one check_arrays call?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
@@ -221,7 +233,12 @@ def fit(self, X, y, sample_weight=None):
random_state = check_random_state(self.random_state)
# Convert data
- X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
+ if issparse(X):
+ X, = check_arrays(X, dtype=DTYPE, sparse_format='csc')
+ X.sort_indices()
@larsmans Owner

Since sort_indices changes the input, don't we need an explicit copy here?

@arjoly Owner
arjoly added a note

Both are fine for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
@@ -451,8 +471,11 @@ def predict_proba(self, X):
classes corresponds to that in the attribute `classes_`.
"""
# Check data
- if getattr(X, "dtype", None) != DTYPE or X.ndim != 2:
- X = array2d(X, dtype=DTYPE)
+ if issparse(X):
+ X, = check_arrays(X, dtype=DTYPE, sparse_format='csr')
+
+ else:
+ X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
@larsmans Owner

Same as above. If one check_arrays call is impossible, can we use atleast2d_or_csr? Or factor out the logic to a local validation helper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((170 lines not shown))
+ start_positive[0] = start_positive_
+
+
+cdef inline void extract_nnz(INT32_t* X_indices,
+ DTYPE_t* X_data,
+ INT32_t indptr_start,
+ INT32_t indptr_end,
+ SIZE_t* samples,
+ SIZE_t start,
+ SIZE_t end,
+ SIZE_t* index_to_samples,
+ DTYPE_t* Xf,
+ SIZE_t* end_negative,
+ SIZE_t* start_positive,
+ SIZE_t* sorted_samples,
+ bint* is_samples_sorted) nogil:
@larsmans Owner

This is missing a comment explaining the parameters, or even what it does.

@arjoly Owner
arjoly added a note

I have updated the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((179 lines not shown))
+ SIZE_t end,
+ SIZE_t* index_to_samples,
+ DTYPE_t* Xf,
+ SIZE_t* end_negative,
+ SIZE_t* start_positive,
+ SIZE_t* sorted_samples,
+ bint* is_samples_sorted) nogil:
+
+ cdef SIZE_t n_indices = <SIZE_t>(indptr_end - indptr_start)
+ cdef SIZE_t n_samples = end - start
+
+ # Use binary search to if n_samples * log(n_indices) <
+ # n_indices and coloring technique otherwise.
+ # O(n_samples * log(n_indices)) is the running time of binary
+ # search and O(n_indices) is the running time of coloring
+ # technique.
@larsmans Owner

Do we really need all this complexity? Has the alternative (always use coloring) been benchmarked? (If so and it's been discusses, sorry that I missed it, I stopped following the discussion at some point.)

@arjoly Owner
arjoly added a note

This allows to guard against the worst case where n_indices >>> n_node_samples, which mainly occurrs near the leaves (bottom of the tree). On a not so sparse dataset like covertype, this behavior is clearly shown.

# 20 news group (density ~ 0.0012)
rf = RandomForestClassifier(max_features=0.1, random_state=0)
rf without                88.6212s   0.0903s     0.3586  
rf with 0.1 * n_indices   95.1268s   0.1226s     0.3586  
rf with 0.2 * n_indices   89.6108s   0.1077s     0.3586  

# Covertype (density ~ 0.22)
dt without               436.5677s   0.0164s     0.0426  
dt with 0.1 * n_indices   68.9361s   0.0193s     0.0426 
dt with 0.2 * n_indices   60.6809s   0.0147s     0.0426

Though the 0.1 factor for algorithm switching might be too small.

@larsmans Owner

Convincing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_utils.c
@@ -1,4 +1,4 @@
-/* Generated by Cython 0.20.1 on Thu Apr 10 20:15:12 2014 */
+/* Generated by Cython 0.20.1 on Thu May 22 14:53:34 2014 */
@larsmans Owner

_utils.pyx hasn't changed, so you don't need to regenerate this file.

@arjoly Owner
arjoly added a note

Do you know how to unset those modifications using git?

@jnothman Owner

To revert the modifications you could use git checkout <SHA> <PATH> then commit and push that... Or you can eradicate the change from the history by interactive rebasing, amending the relevant commit and force pushing.

@arjoly Owner
arjoly added a note

thanks, done with

git filter-branch --force --index-filter \
'git checkout master sklearn/tree/_utils.c' \
--prune-empty master..sparse-tree
@larsmans Owner

Never mind, that's much smarter than what I would have done :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage increased (+0.06%) when pulling 9f05d1c on arjoly:sparse-tree into 95619a9 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.04%) when pulling a532cef on arjoly:sparse-tree into daa1dba on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.04%) when pulling cd7d576 on arjoly:sparse-tree into daa1dba on scikit-learn:master.

sklearn/tree/tree.py
((7 lines not shown))
+DENSE_SPLITTERS = {"best": _tree.BestSplitter,
+ "presort-best": _tree.PresortBestSplitter,
+ "random": _tree.RandomSplitter}
+
+SPARSE_SPLITTER = {"best": _tree.BestSparseSplitter,
@jnothman Owner

This should be SPARSE_SPLITTERS

@arjoly Owner
arjoly added a note

Thanks fix in b1d6030

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage increased (+0.04%) when pulling b1d6030 on arjoly:sparse-tree into daa1dba on scikit-learn:master.

@ogrisel
Owner

It seems that this PR is ready for merge. Any more comments @jnothman @larsmans @glouppe?

@glouppe
Owner

I would really like to make a proper review of this. Unfortunately, I am afraid I will be busy for two more weeks on my thesis. If others could have a look, that would be great.

doc/modules/tree.rst
@@ -326,6 +334,13 @@ Tips on practical use
* All decision trees use ``np.float32`` arrays internally.
If training data is not in this format, a copy of the dataset will be made.
+ * If the input matrix X is very sparse, it is highly recommend to convert the
+ matrix into sparse `csc_matrix` format before feeding it to `fit`.
+ For testing, it is recommendable to convert the data into `csr_matrix`
+ format before passing it to `predict` function. Training time is between
+ 10 to 40 times faster with sparse matrix input compared to dense matrix in
+ case of sparse data.
@ogrisel Owner
ogrisel added a note

I think this depends on the number of non-zeros. I would rather say that "Training time can be orders of magnitude faster" for a sparse matrix input compare to a dense matrix when features have zero values in most of the samples." or something in that spirit but with a better phrasing.

@ogrisel Owner
ogrisel added a note

@arjoly WDYT about that suggested change? I am pretty sure that you can get more than 40 speed up on very very sparse datasets so I would rather remove misleading absolute bounded speedup ranges from this paragraph.

@arjoly Owner
arjoly added a note

You are right. I am going to make the proposed modification. I haven't found the time to make it yet. :-)

Thanks for the review!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
((5 lines not shown))
for estimator in self.estimators_:
- mask = np.ones(n_samples, dtype=np.bool)
+ mask = np.ones((n_samples, ), dtype=np.bool)
@ogrisel Owner
ogrisel added a note

style => (n_samples,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/tests/test_forest.py
((22 lines not shown))
+
+ if name in FOREST_TRANSFORMERS:
+ assert_array_almost_equal(sparse.transform(X).toarray(),
+ dense.transform(X).toarray())
+ assert_array_almost_equal(sparse.fit_transform(X).toarray(),
+ dense.fit_transform(X).toarray())
+
+
+def test_sparse_input():
+ X, y = datasets.make_multilabel_classification(return_indicator=True,
+ random_state=0,
+ n_samples=40)
+
+ for name, sparse_matrix in product(FOREST_ESTIMATORS,
+ (csr_matrix, csc_matrix)):
+ yield check_sparse_input, name, X, sparse_matrix(X), y
@ogrisel Owner
ogrisel added a note

I don't really like passing data arrays / matrices to yield statements in tests as the nose verbose output get very verbose. Wouldn't it be possible to generate the datasets once and for all at the module level (as constants) instead?

@arjoly Owner
arjoly added a note

If you prefer, I could add a dict of (name, bunch).

@ogrisel Owner
ogrisel added a note

You mean a global, module level dict? or passing a dict as argument to the yield statement? The latter will not solve the verbosity problem.

@arjoly Owner
arjoly added a note

I meant a global variable as ALL_TREES.

@ogrisel Owner
ogrisel added a note

alright.

@ogrisel Owner
ogrisel added a note

I just checked and the output is not that verbose, so I am fine with leaving those tests as they are.

@arjoly Owner
arjoly added a note

Anyway, I will make it. Whenever you try to get the slowest test with nose-timer, the output is too verbose. :-(

@amueller Owner

You can also overwrite the output produced, so it doesn't show the whole arrays. From the docs:

By default, the test name output for a generated test in verbose mode will be the name of the generator functionor method, followed by the args passed to the yielded callable. If you want to show a different test name, set the description attribute of the yielded callable.

So you can do something like

check_sparse_input.description = "check_sparse_input"

to get rid of the verbosity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

@glouppe I am fine with waiting for the end of your thesis prior to merging this if you wish.

@glouppe
Owner

@ogrisel I would prefer, but on the other hand, I don't want to delay the PR any further. In particular, @arjoly, is this a roadblock for your student's GSoC?

@arjoly
Owner

In particular, @arjoly, is this a roadblock for your student's GSoC?

Yes, this is a blocker. Whenever this pull request is merged, @hamsal will be able to attack sparse input support for gradient boosting (the last estimator of the ensemble module without sparse input support).

@ogrisel
Owner

+1 for merging on my side once the suggested doc fix is in. @larsmans @jnothman any further comments?

Maybe @ndawe would also be interested in having a look at this PR?

@jnothman
Owner

Sorry, I'm not really able to put aside time for a full review atm.

sklearn/tree/_tree.pyx
((40 lines not shown))
+ DTYPE_t* X_data,
+ INT32_t indptr_start,
+ INT32_t indptr_end,
+ SIZE_t* samples,
+ SIZE_t start,
+ SIZE_t end,
+ SIZE_t* index_to_samples,
+ DTYPE_t* Xf,
+ SIZE_t* end_negative,
+ SIZE_t* start_positive) nogil:
+ """intersection between X_indices[indptr_start:indptr_end]
+ and samples[start:end] using a index_to_samples approach.
+
+ Complexity is O(indptr_end - indptr_start).
+ """
+ cdef INT32_t k_
@jnothman Owner

Why is the underscore helpful? I find it vastly reduces readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((42 lines not shown))
+ INT32_t indptr_end,
+ SIZE_t* samples,
+ SIZE_t start,
+ SIZE_t end,
+ SIZE_t* index_to_samples,
+ DTYPE_t* Xf,
+ SIZE_t* end_negative,
+ SIZE_t* start_positive) nogil:
+ """intersection between X_indices[indptr_start:indptr_end]
+ and samples[start:end] using a index_to_samples approach.
+
+ Complexity is O(indptr_end - indptr_start).
+ """
+ cdef INT32_t k_
+ cdef SIZE_t index
+ cdef SIZE_t end_negative_ = start
@jnothman Owner

Very minor point, but end_negative_, start_positive_ seem very long names for frequently-used local variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((49 lines not shown))
+ SIZE_t* start_positive) nogil:
+ """intersection between X_indices[indptr_start:indptr_end]
+ and samples[start:end] using a index_to_samples approach.
+
+ Complexity is O(indptr_end - indptr_start).
+ """
+ cdef INT32_t k_
+ cdef SIZE_t index
+ cdef SIZE_t end_negative_ = start
+ cdef SIZE_t start_positive_ = end
+
+ for k_ in range(indptr_start, indptr_end):
+ if start <= index_to_samples[X_indices[k_]] < end:
+ if X_data[k_] > 0:
+ start_positive_ -= 1
+ Xf[start_positive_] = X_data[k_]
@jnothman Owner

It looks like this passage of code is repeated four times in extra_nnz_index_to_samples and extract_nnz_binary_search, sometimes with formatting inconsistencies. A private inline function like _swap_samples(k_, start_positive_, X_indices, X_data, Xf, samples, index_to_samples) might be helpful, although many of those arguments could be dropped if extract_nnz and its subsidiaries were members of SparseSplitter (with tiny-if-any overhead for dereferencing self->member).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((52 lines not shown))
+
+ Complexity is O(indptr_end - indptr_start).
+ """
+ cdef INT32_t k_
+ cdef SIZE_t index
+ cdef SIZE_t end_negative_ = start
+ cdef SIZE_t start_positive_ = end
+
+ for k_ in range(indptr_start, indptr_end):
+ if start <= index_to_samples[X_indices[k_]] < end:
+ if X_data[k_] > 0:
+ start_positive_ -= 1
+ Xf[start_positive_] = X_data[k_]
+ index = index_to_samples[X_indices[k_]]
+
+ tmp = samples[index]
@jnothman Owner

I would find something like the following clearer to read, and I'm pretty sure Cython recognises this notation.

                samples[index], samples[start_positive_] = samples[start_positive_], samples[index]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((178 lines not shown))
+
+
+cdef inline void extract_nnz(INT32_t* X_indices,
+ DTYPE_t* X_data,
+ INT32_t indptr_start,
+ INT32_t indptr_end,
+ SIZE_t* samples,
+ SIZE_t start,
+ SIZE_t end,
+ SIZE_t* index_to_samples,
+ DTYPE_t* Xf,
+ SIZE_t* end_negative,
+ SIZE_t* start_positive,
+ SIZE_t* sorted_samples,
+ bint* is_samples_sorted) nogil:
+ """ Extract non zero values of X (csc format) in samples[start:end]
@jnothman Owner

This should note that it is extracting and partitioning values for a particular feature, and should say into not in samples[start:end]. The subsidiary functions say they do "Intersection" which is not helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((56 lines not shown))
+ cdef SIZE_t index
+ cdef SIZE_t end_negative_ = start
+ cdef SIZE_t start_positive_ = end
+
+ for k_ in range(indptr_start, indptr_end):
+ if start <= index_to_samples[X_indices[k_]] < end:
+ if X_data[k_] > 0:
+ start_positive_ -= 1
+ Xf[start_positive_] = X_data[k_]
+ index = index_to_samples[X_indices[k_]]
+
+ tmp = samples[index]
+ samples[index] = samples[start_positive_]
+ samples[start_positive_] = tmp
+ index_to_samples[samples[index]] = index
+ index_to_samples[samples[start_positive_]] = start_positive_
@jnothman Owner

I haven't got to reviewing tests, but please make sure all sparse tree splitters are tested for sparse matrices containing explicit 0 values. You might need >= instead of > for example.

@arjoly Owner
arjoly added a note

I have added for explicit zero and it's work. See 9cf6687

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly
Owner

@jnothman Thanks Joel ! I will have time next week to address your suggestions.

@jnothman
Owner
@jnothman
Owner

Looking further through the code, I think there's a lot of repetition, much of which is likely to impede maintainability/readability more than it saves in runtime. For example, at jnothman@3ed1325, I have factored out the (pos,feature,threshold,improvement,impurity_left,impurity_right) data into a struct (SplitInfo), which deletes 60 lines of repetitious Cython and 360 lines of generated C. Splitter.node_split could also take a pointer to one of these structs, but that would be changing a somewhat-public interface, so I have not implemented it.

sklearn/tree/_tree.pyx
((280 lines not shown))
+ cdef double current_threshold
+
+ cdef SIZE_t f_i = n_features
+ cdef SIZE_t f_j, p, tmp
+ cdef SIZE_t n_visited_features = 0
+ # Number of features discovered to be constant during the split search
+ cdef SIZE_t n_found_constants = 0
+ # Number of features known to be constant and drawn without replacement
+ cdef SIZE_t n_drawn_constants = 0
+ cdef SIZE_t n_known_constants = n_constant_features[0]
+ # n_total_constants = n_known_constants + n_found_constants
+ cdef SIZE_t n_total_constants = n_known_constants
+ cdef DTYPE_t current_feature_value
+ cdef SIZE_t partition_end
+
+ cdef SIZE_t k_
@jnothman Owner

k_ is unused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((550 lines not shown))
+ cdef SIZE_t f_j, p, tmp
+ cdef SIZE_t n_visited_features = 0
+ # Number of features discovered to be constant during the split search
+ cdef SIZE_t n_found_constants = 0
+ # Number of features known to be constant and drawn without replacement
+ cdef SIZE_t n_drawn_constants = 0
+ cdef SIZE_t n_known_constants = n_constant_features[0]
+ # n_total_constants = n_known_constants + n_found_constants
+ cdef SIZE_t n_total_constants = n_known_constants
+ cdef SIZE_t partition_end
+
+ cdef DTYPE_t min_feature_value
+ cdef DTYPE_t max_feature_value
+
+ cdef SIZE_t k_
+ cdef SIZE_t p_next
@jnothman Owner

k_ and p_next are unused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((260 lines not shown))
+ cdef DTYPE_t* Xf = self.feature_values
+ cdef SIZE_t* sorted_samples = self.sorted_samples
+ cdef SIZE_t* index_to_samples = self.index_to_samples
+ cdef SIZE_t max_features = self.max_features
+ cdef SIZE_t min_samples_leaf = self.min_samples_leaf
+ cdef UINT32_t* random_state = &self.rand_r_state
+
+ cdef double best_impurity_left = INFINITY
+ cdef double best_impurity_right = INFINITY
+ cdef SIZE_t best_pos = end
+ cdef SIZE_t best_feature = 0
+ cdef double best_threshold = 0.
+ cdef double best_improvement = -INFINITY
+
+ cdef double current_improvement
+ cdef double current_impurity
@jnothman Owner

current_impurity is unused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((527 lines not shown))
+ cdef SIZE_t* sorted_samples = self.sorted_samples
+ cdef SIZE_t* index_to_samples = self.index_to_samples
+ cdef SIZE_t max_features = self.max_features
+ cdef SIZE_t min_samples_leaf = self.min_samples_leaf
+ cdef UINT32_t* random_state = &self.rand_r_state
+
+ cdef double best_impurity_left = INFINITY
+ cdef double best_impurity_right = INFINITY
+ cdef SIZE_t best_pos = end
+ cdef SIZE_t best_feature = 0
+ cdef double best_threshold = 0.
+ cdef double best_improvement = -INFINITY
+
+ cdef DTYPE_t current_feature_value
+ cdef double current_improvement
+ cdef double current_impurity
@jnothman Owner

current_impurity is unused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
Owner

PS: I realise these refactors seem a little off-topic given the hurry to merge this and other GSOC-related PRs, but I can't understand/verify the code without doing them (mentally at least), so I might as well offer them back.

sklearn/tree/_tree.pyx
((669 lines not shown))
-cdef class Splitter:
- def __cinit__(self, Criterion criterion, SIZE_t max_features,
- SIZE_t min_samples_leaf, object random_state):
- self.criterion = criterion
+ # Draw a random threshold
@jnothman Owner

Please fix indentation.

@jnothman Owner

Also from here to the end of the best update could be refactored with the dense RandomSplitter: the code is identical except for the partition, which can be factored out. I imagine that Cython will repeat the implementation anyway as part of the inheritance (which is good for inlining, but bad for number of lines of C code). The most annoying part is that the dense version would need to include dummy values for start_positive and end_negative because C doesn't do closures.

So I'm not certain this is worthwhile, but it may be worth considering. Once _partition is factored out, it might be no big deal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((679 lines not shown))
- self.samples = NULL
- self.n_samples = 0
- self.features = NULL
- self.n_features = 0
- self.feature_values = NULL
+ if current_threshold == max_feature_value:
@jnothman Owner

The implementation of rand_double is commented to say it generates in the range [0; 1), so this shouldn't be necessary. I think it's a good idea to reimplement rand_int and rand_double to take (lo, hi, random_state), seeing as this is how it is used in the module. Since it's inline, the lo == 0 case should be handled by the optimizer.

@jnothman Owner

If you'd rather I put that into a separate PR, I can.

@arjoly Owner
arjoly added a note

+1 for a separate pr and keep this one focused.

@arjoly Owner
arjoly added a note

finally I will make it

@jnothman Owner

It's also fine if you focus on the remainder of this PR and ignoring that cosmit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
Owner

Apart from the minor changes I have noted or contributed, _tree.pyx appears correct to me (but I am relying on the fact that the tests pass to some extent, as I haven't validated every update to index_to_samples, etc., nor have I worked out exactly when they are necessary). I will have a further look at the tests and the docs shortly. Then I will try to look at @hamsal's PRs.

@jnothman
Owner

Also this PR include an apparently unnecessary recompilation of _gradient_boosting.c.

sklearn/tree/_tree.pyx
((9 lines not shown))
# check if dtype is correct
- if X.dtype != DTYPE:
+ if issparse(X):
@jnothman Owner

Now that it is more substantial, could you please factor out this data validation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman commented on the diff
sklearn/tree/_tree.pyx
((33 lines not shown))
+ for i in range(n_samples):
+ node = self.nodes
+ # While node not a leaf
+ while node.left_child != _TREE_LEAF:
+ # ... and node.right_child != _TREE_LEAF:
+ if X_ptr[X_sample_stride * i +
+ X_fx_stride * node.feature] <= node.threshold:
+ node = &self.nodes[node.left_child]
+ else:
+ node = &self.nodes[node.right_child]
+
+ out_ptr[i] = <SIZE_t>(node - self.nodes) # node offset
+
+ return out
+
+ cdef inline np.ndarray _apply_sparse_csr(self, object X):
@jnothman Owner

Please assert X.format == 'csr' in case someone decides to run this directly with a csc.

@arjoly Owner
arjoly added a note

Done with some input checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((58 lines not shown))
+ cdef INT32_t* X_indptr = <INT32_t*>X_indptr_ndarray.data
+
+ cdef SIZE_t n_samples = X.shape[0]
+ cdef SIZE_t n_features = X.shape[1]
+
+ # Initialize output
+ cdef np.ndarray[SIZE_t, ndim=1] out = np.zeros((n_samples,),
+ dtype=np.intp)
+ cdef SIZE_t* out_ptr = <SIZE_t*> out.data
+
+ # Initialize auxiliary data-structure
+ cdef DTYPE_t feature_value = 0
+ cdef Node* node = NULL
+
+ cdef DTYPE_t* X_sample = NULL
+ cdef SIZE_t* feature_to_sample = NULL
@jnothman Owner

These are not especially clear names, so they at least deserve a comment, if not renaming. feature_to_sample as a data structure records the last seen sample for each feature; functionally, it is an efficient way to identify which features are nonzero in the present sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
((11 lines not shown))
for estimator in self.estimators_:
- mask = np.ones(n_samples, dtype=np.bool)
+ mask = np.ones((n_samples, ), dtype=np.bool)
@jnothman Owner

Even if you prefer a tuple here, I'm not sure why there should be a space between , and )

@jnothman Owner

Ah. Olivier caught this below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/tests/test_forest.py
@@ -231,14 +249,51 @@ def check_oob_score(name, X, y, n_estimators=20):
assert_greater(test_score, est.oob_score_)
assert_greater(est.oob_score_, .8)
+ # Check warning if not enought estimator
@jnothman Owner

"enough"

@jnothman Owner

"estimators"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/tests/test_forest.py
((21 lines not shown))
+ for name in FOREST_REGRESSORS:
+ yield check_oob_score, name, boston.data, boston.target, 50
+
+
+def check_oob_score_raise_error(name):
+ ForestEstimator = FOREST_ESTIMATORS[name]
+
+ if name in FOREST_TRANSFORMERS:
+ for oob_score in [True, False]:
+ assert_raises(TypeError, ForestEstimator, oob_score=oob_score)
+
+ assert_raises(NotImplementedError, ForestEstimator()._set_oob_score,
+ X, y)
+
+ else:
+ # Unfitted / no bootrapt / no oob_score
@jnothman Owner

"bootstrap"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/tree.rst
@@ -195,6 +199,10 @@ instead of integer values::
>>> clf.predict([[1, 1]])
array([ 0.5])
+Note that sparsity is also supported by :class:`DecisionTreeClassifier` as in
@jnothman Owner

what do you mean by "as in classification"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
Owner

I have completed my review, and apart from that backlog of comments, and a glorious what's new entry, this is looking great.

sklearn/ensemble/forest.py
@@ -179,8 +186,9 @@ def apply(self, X):
Parameters
----------
- X : array-like, shape = [n_samples, n_features]
- Input data.
+ X : array-like or sparse matrix, shape = [n_samples, n_features]
+ Input data. Use csr sparse matrix and ``dtype=np.float32``
@amueller Owner

the np.float32 comment also applies to dense data, right?

@arjoly Owner
arjoly added a note

Yes. Thanks @amueller for reviewing!

@vene Owner
vene added a note

I'd say "sparse csr_matrix" instead of "csr sparse matrix".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
((14 lines not shown))
mask[estimator.indices_] = False
+ mask = sample_indices[mask]
@amueller Owner

nitpick: I would call this indices or mask_indices as I would expect mask to always be a boolean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
((9 lines not shown))
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
"""
- n_samples = len(X)
@amueller Owner

fyi there is a helper function to get the number of samples in a save way in utils ;) I think it is called _num_samples or something. Your solution is obviously also fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
((8 lines not shown))
mask[estimator.indices_] = False
+ mask = sample_indices[mask]
@amueller Owner

same as a above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tests/test_tree.py
((137 lines not shown))
+
+
+def check_explicit_sparse_zeros(tree, dataset, max_depth=5):
+ TreeEstimator = ALL_TREES[tree]
+ X = DATASETS[dataset]["X"]
+ y = DATASETS[dataset]["y"]
+ X_sparse = DATASETS[dataset]["X_sparse"]
+
+ n_samples, n_features = X_sparse.shape
+ n_explicit_zeros = n_samples * n_features / 5
+
+ random_state = check_random_state(0)
+ i = random_state.randint(0, n_samples, size=(n_explicit_zeros,))
+ j = random_state.randint(0, n_features, size=(n_explicit_zeros,))
+
+ X_sparse[i, j] = 0.
@jnothman Owner

Unfortunately, this doesn't assure that a 0 lands up in X_sparse.data, which is what we want. (To be more precise, scipy explicit zero support is largely undocumented and untested, so we can't be sure whether this works, and it currently does for some sparse matrix types and does not for others!)

To be sure, modify X_sparse.data directly (or construct the matrix with the (data, indices, indptr) containing 0s).

@jnothman Owner

Also, I think you might need to make a copy of what you pull out of DATASETS or you will affect other tests.

@ogrisel Owner
ogrisel added a note

You have the X_sparse.eliminate_zeros() method (at least on CSR, I just checked).

@jnothman Owner

Yes, I don't think it should be necessary to use eliminate_zeros in the tree implementation. But we need to test that explicit zeros don't break things in any case.

@ogrisel Owner
ogrisel added a note

Ah alright I read too quickly and understood the opposite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 6230d0e on arjoly:sparse-tree into 5cab6c9 on scikit-learn:master.

doc/modules/tree.rst
@@ -326,6 +325,11 @@ Tips on practical use
* All decision trees use ``np.float32`` arrays internally.
If training data is not in this format, a copy of the dataset will be made.
+ * If the input matrix X is very sparse, it is recommended to convert
+ the array into a sparse `csc_matrix` format applying `fit` and `csr_matrix`
+ format before predicting. Training time can be orders of magnitude faster
+ for a sparse matrix input compare to a dense matrix when features have
@glouppe Owner
glouppe added a note

compared

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/tree.rst
@@ -326,6 +325,11 @@ Tips on practical use
* All decision trees use ``np.float32`` arrays internally.
If training data is not in this format, a copy of the dataset will be made.
+ * If the input matrix X is very sparse, it is recommended to convert
+ the array into a sparse `csc_matrix` format applying `fit` and `csr_matrix`
+ format before predicting. Training time can be orders of magnitude faster
@glouppe Owner
glouppe added a note

"it is recommended to convert the array into a sparse csc_matrix format applying fit and csr_matrix format before predicting"

The sentence is confusing. I would say in more simpler terms: It is recommended to convert the array into a sparse csc_matrix format before calling fit or predict.

@glouppe Owner
glouppe added a note

Hmm never mind, I got it now, CSC is recommended for fitting and CSR for predictions. The difference is subtle and should be highlighted more clearly.

@vene Owner
vene added a note

Why not: csc_matrix before calling fit and csr_matrix before calling predict?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
@@ -34,28 +34,35 @@ class calls the ``fit`` method of each sub-estimator on random samples
# Authors: Gilles Louppe <g.louppe@gmail.com>
# Brian Holt <bdholt1@gmail.com>
+# Joly Arnaud <arnaud.v.joly@gmail.com>
+# Fares Hedayati
@glouppe Owner
glouppe added a note

no email for Fares?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
((19 lines not shown))
from ..base import ClassifierMixin, RegressorMixin
from ..externals.joblib import Parallel, delayed
from ..externals import six
-from ..externals.six.moves import xrange
+from ..externals.six.moves import xrange as range
@glouppe Owner
glouppe added a note

Is this necessary at all? We could drop this import.

@arjoly Owner
arjoly added a note

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
@@ -10,21 +10,26 @@
# Lars Buitinck <L.J.Buitinck@uva.nl>
# Arnaud Joly <arnaud.v.joly@gmail.com>
# Joel Nothman <joel.nothman@gmail.com>
+# Fares Hedayati
@glouppe Owner
glouppe added a note
  • email
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((20 lines not shown))
+ np.ndarray[DOUBLE_t, ndim=2, mode="c"] y,
+ DOUBLE_t* sample_weight) except *:
+ """Initialize the splitter."""
+
+ # Call parent init
+ Splitter.init(self, X, y, sample_weight)
+
+ # Initialize X
+ cdef np.ndarray X_ndarray = X
+
+ self.X = <DTYPE_t*> X_ndarray.data
+ self.X_sample_stride = <SIZE_t> X.strides[0] / <SIZE_t> X.itemsize
+ self.X_fx_stride = <SIZE_t> X.strides[1] / <SIZE_t> X.itemsize
+
+
+cdef class SparseSplitter(Splitter):
@glouppe Owner
glouppe added a note

I would put all sparse splitters after dense splitters. They are mixed right now in the source code.

@amueller Owner

I would add at least a comment that this is for CSC only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((273 lines not shown))
+ features[f_j] = features[n_total_constants]
+ features[n_total_constants] = current.feature
+
+ n_found_constants += 1
+ n_total_constants += 1
+
+ else:
+ f_i -= 1
+ features[f_i], features[f_j] = features[f_j], features[f_i]
+
+ # Evaluate all splits
+ self.criterion.reset()
+ p = start
+
+ while p < end:
+ p_next = (p + 1 if p + 1 != end_negative
@glouppe Owner
glouppe added a note

these notations are confusing without checking the precedence of operations. Is it (p+1) if or p + (1 if ...)?

@arjoly Owner
arjoly added a note

ternary operator is the second operator with lowest precedence.

Thus, it is equivalent to (p+1) if ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
@@ -9,21 +9,27 @@
# Noel Dawe <noel@dawe.me>
# Satrajit Gosh <satrajit.ghosh@gmail.com>
# Joly Arnaud <arnaud.v.joly@gmail.com>
+# Fares Hedayati
@glouppe Owner
glouppe added a note
  • email
@larsmans Owner

Not required, it's in the commit log. In fact, I've been thinking of removing my own email address from the source to prevent people from contacting me personally for support.

@GaelVaroquaux Owner
@arjoly Owner
arjoly added a note

Maybe this should be done uniformly everywhere. We have the conventions for now that it's in every source file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
((6 lines not shown))
if check_input:
- X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
+ X = atleast2d_or_csc(X, dtype=DTYPE)
+ if issparse(X):
+ X.sort_indices()
+
+ if X.indices.dtype != np.int32 or X.indptr.dtype != np.int32:
+ raise ValueError("64 bit index based sparse matrix are "
+ "not supported")
@glouppe Owner
glouppe added a note

matrices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
((9 lines not shown))
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes, or the predict values.
"""
- if getattr(X, "dtype", None) != DTYPE or X.ndim != 2:
- X = array2d(X, dtype=DTYPE)
+ X = atleast2d_or_csr(X, dtype=DTYPE)
+ if issparse(X) and (X.indices.dtype != np.int32 or
+ X.indptr.dtype != np.int32):
+ raise ValueError("No support for int64 index based sparse "
+ "matrix ")
@glouppe Owner
glouppe added a note

Use the same error message as in fit, for consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
@@ -505,8 +529,12 @@ def predict_proba(self, X):
The class probabilities of the input samples. The order of the
classes corresponds to that in the attribute `classes_`.
"""
- if getattr(X, "dtype", None) != DTYPE or X.ndim != 2:
- X = array2d(X, dtype=DTYPE)
+ X = atleast2d_or_csr(X, dtype=DTYPE)
+ if issparse(X) and (X.indices.dtype != np.int32 or
+ X.indptr.dtype != np.int32):
+ raise ValueError("No support for int64 index based sparse "
@glouppe Owner
glouppe added a note

Same here

@larsmans Owner

Strictly speaking the type should be np.intc, not np.int32; scipy.sparse uses a plain C int.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/weight_boosting.py
@@ -1096,7 +1132,12 @@ def staged_predict(self, X):
The predicted regression values.
"""
self._check_fitted()
- X = safe_asarray(X)
+ if (self.base_estimator is None or
+ isinstance(self.base_estimator,
+ (BaseDecisionTree, BaseForest))):
+ X = atleast2d_or_csr(X, dtype=DTYPE)
+ else:
+ X = safe_asarray(X)
@glouppe Owner
glouppe added a note

This code is dupplicated 6 times. Could you factor it out within a _validate_X method? (Just like we have _validate_y in forest)

@vene Owner
vene added a note

+1, and I'd comment that this is about which base estimators support sparse data (hope I'm not reading the code wrong)

@amueller Owner

Actually I'm not sure what the reason for this is. I thought it was for converting csc to csr only onces, and not in every tree?

@arjoly Owner
arjoly added a note

This is the idea, but staged_ function are exposed to the user so we have to ensure that it has the proper format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((58 lines not shown))
+ def __dealloc__(self):
+ """Deallocate memory"""
+ free(self.index_to_samples)
+ free(self.sorted_samples)
+
+ cdef void init(self,
+ object X,
+ np.ndarray[DOUBLE_t, ndim=2, mode="c"] y,
+ DOUBLE_t* sample_weight) except *:
+ """Initialize the splitter."""
+
+ # Call parent init
+ Splitter.init(self, X, y, sample_weight)
+
+ if not isinstance(X, csc_matrix):
+ raise ValueError("X should in csc format")
@glouppe Owner
glouppe added a note

+be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((6 lines not shown))
np.ndarray sample_weight=None):
"""Build a decision tree from the training set (X, y)."""
pass
+ cdef inline _check_input(self, object X, np.ndarray y,
+ np.ndarray sample_weight):
+ """Check input dtype, layout and format"""
+ if issparse(X):
+ X = X.tocsc()
+ X.sort_indices()
+
+ if X.data.dtype != DTYPE:
+ X.data = np.ascontiguousarray(X.data, dtype=DTYPE)
+
+ if X.indices.dtype != np.int32 or X.indptr.dtype != np.int32:
+ raise ValueError("64 bit index based sparse matrix are not "
+ "supported")
@glouppe Owner
glouppe added a note

same here, please be consistent with the error messages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((90 lines not shown))
+ safe_realloc(&self.index_to_samples, n_total_samples * sizeof(SIZE_t))
+ safe_realloc(&self.sorted_samples, n_samples * sizeof(SIZE_t))
+
+ cdef SIZE_t* index_to_samples = self.index_to_samples
+ cdef SIZE_t p
+ for p in range(n_total_samples):
+ index_to_samples[p] = -1
+
+ for p in range(n_samples):
+ index_to_samples[samples[p]] = p
+
+ cdef inline SIZE_t _partition(self, DTYPE_t* Xf, double threshold,
+ SIZE_t start, SIZE_t end,
+ SIZE_t end_negative, SIZE_t start_positive,
+ SIZE_t zero_pos) nogil:
+ """Perform a partition of samples based on the threshold"""
@glouppe Owner
glouppe added a note

I would rather say: "Partition samples[start:end] based on threshold."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((186 lines not shown))
+ Parameters
+ ----------
+ X_indices : c-array of INT32_t,
+ Indices of the csc matrix which are in sorted order
+
+ X_data : c-array of INT32_t,
+ Data of the csc matrix
+
+ indptr_start, indptr_end : INT32_t,
+ indptr_start, indptr_end = X_indptr[feature], X_indptr[feature + 1]
+ where X_indptr would be the indptr of the csc matrix.
+
+ samples, start, end : c-array of SIZE_t, SIZE_t, SIZE_t
+ samples[start:end] is the subset of samples to split
+
+ index_to_samples : c-arrayof SIZE_t,
@glouppe Owner
glouppe added a note

missing whitespace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((209 lines not shown))
+ sorted_samples, is_samples_sorted : c-array of SIZE_t, bint,
+ If is_samples_sorted, then sorted_samples[start:end] will be the sorted
+ version of samples[start:end], else is_samples_sorted is set to True
+ and samples[start:end]
+
+ """
+ cdef SIZE_t n_indices = <SIZE_t>(indptr_end - indptr_start)
+ cdef SIZE_t n_samples = end - start
+
+ # Use binary search if n_samples * log(n_indices) <
+ # n_indices and index_to_samples approach otherwise.
+ # O(n_samples * log(n_indices)) is the running time of binary
+ # search and O(n_indices) is the running time of coloring
+ # technique.
+ if ((1 - is_samples_sorted[0]) * n_samples * log(n_samples) +
+ n_samples * log(n_indices) < 0.1 * n_indices):
@glouppe Owner
glouppe added a note

How did you set the 0.1 constant? What is the impact of this value?

@arjoly Owner
arjoly added a note

I benchmark a bit on 20 newsgroup to get that value.

@arjoly Owner
arjoly added a note

Though this might be more optimised.
Here the bench that I have made to address one of the @larsmans's comment.

# 20 news group (density ~ 0.0012)
rf = RandomForestClassifier(max_features=0.1, random_state=0)
rf without                88.6212s   0.0903s     0.3586  
rf with 0.1 * n_indices   95.1268s   0.1226s     0.3586  
rf with 0.2 * n_indices   89.6108s   0.1077s     0.3586  

# Covertype (density ~ 0.22)
dt without               436.5677s   0.0164s     0.0426  
dt with 0.1 * n_indices   68.9361s   0.0193s     0.0426 
dt with 0.2 * n_indices   60.6809s   0.0147s     0.0426
@glouppe Owner
glouppe added a note

0.2 appears as a better value then , doesn't it?

@arjoly Owner
arjoly added a note

True, I will make re-bench this constant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((87 lines not shown))
+
+
+cdef inline void extract_nnz_binary_search(INT32_t* X_indices,
+ DTYPE_t* X_data,
+ INT32_t indptr_start,
+ INT32_t indptr_end,
+ SIZE_t* samples,
+ SIZE_t start,
+ SIZE_t end,
+ SIZE_t* index_to_samples,
+ DTYPE_t* Xf,
+ SIZE_t* end_negative,
+ SIZE_t* start_positive,
+ SIZE_t* sorted_samples,
+ bint* is_samples_sorted) nogil:
+ """Extracting and partitioning values for a given feature using binary search
@glouppe Owner
glouppe added a note

Extract and partition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((159 lines not shown))
+
+
+cdef inline void extract_nnz(INT32_t* X_indices,
+ DTYPE_t* X_data,
+ INT32_t indptr_start,
+ INT32_t indptr_end,
+ SIZE_t* samples,
+ SIZE_t start,
+ SIZE_t end,
+ SIZE_t* index_to_samples,
+ DTYPE_t* Xf,
+ SIZE_t* end_negative,
+ SIZE_t* start_positive,
+ SIZE_t* sorted_samples,
+ bint* is_samples_sorted) nogil:
+ """Extracting and partitioning values for a given feature
@glouppe Owner
glouppe added a note

Extract and partition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((11 lines not shown))
cdef inline double log(double x) nogil:
return ln(x) / ln(2.0)
+
+
+# =============================================================================
+# Non zero value extraction with sparse matrices
+# =============================================================================
+cdef int compare_SIZE_t(const void * a, const void * b) nogil:
@glouppe Owner
glouppe added a note

don't insert whitespaces before *

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@glouppe
Owner

I just did a first round of review. Except my comments above mostly regarding style consistency, I have not much to say about the content itself. It is quite understandable and appears to do the job efficiently.

When comparing Random Forests and Extra-Trees with linear models in document_classification_20newsgroups.py, performance of forests appears to be worse than linear models (from 0 to 10% worse in terms of f1-score) while taking a lot longer to build (20-40s for 100 trees on my machine, vs less than 1s for all linear models). I am not that surprised though. I did not expect forests to do very well on this kind of data. But at least they are nearly as good. I'm looking forwards to seeing GBRT in action though!

All in all, once my comments are fixed, I am +1 for merge.

sklearn/ensemble/forest.py
@@ -199,8 +209,9 @@ def fit(self, X, y, sample_weight=None):
Parameters
----------
- X : array-like of shape = [n_samples, n_features]
- The training input samples.
+ X : array-like or sparse matrix of shape = [n_samples, n_features]
+ The training input samples. Use csc sparse matrix with
@vene Owner
vene added a note

Same comment (Andy's should also apply)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((338 lines not shown))
+ # element in features[:n_known_constants] must be preserved for sibling
+ # and child nodes
+ memcpy(features, constant_features, sizeof(SIZE_t) * n_known_constants)
+
+ # Copy newly found constant features
+ memcpy(constant_features + n_known_constants,
+ features + n_known_constants,
+ sizeof(SIZE_t) * n_found_constants)
+
+ # Return values
+ split[0] = best
+ n_constant_features[0] = n_total_constants
+
+
+cdef class RandomSparseSplitter(SparseSplitter):
+ """Splitter for finding the best split, using the sparse data."""
@vene Owner
vene added a note

This docstring is the same as BestSparseSplitter. Is it accurate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((348 lines not shown))
+ split[0] = best
+ n_constant_features[0] = n_total_constants
+
+
+cdef class RandomSparseSplitter(SparseSplitter):
+ """Splitter for finding the best split, using the sparse data."""
+
+ def __reduce__(self):
+ return (RandomSparseSplitter, (self.criterion,
+ self.max_features,
+ self.min_samples_leaf,
+ self.random_state), self.__getstate__())
+
+ cdef void node_split(self, double impurity, SplitRecord* split,
+ SIZE_t* n_constant_features) nogil:
+ """Find the best split on node samples[start:end], using sparse
@vene Owner
vene added a note

Same, is this "best random split"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((34 lines not shown))
+ pivot = start + (end - start) / 2
+
+ if sorted_array[pivot] == value:
+ index[0] = pivot
+ start = pivot + 1
+ break
+
+ if sorted_array[pivot] < value:
+ start = pivot + 1
+ else:
+ end = pivot
+ new_start[0] = start
+
+
+
+cdef inline void extra_nnz_index_to_samples(INT32_t* X_indices,
@vene Owner
vene added a note

two spaces between void and function name

@vene Owner
vene added a note

Is this supposed to be "extra" or "extract"? The docstring confused me.

@vene Owner
vene added a note

Maybe add "See extra_nnz" to the docstring? Should these functions (this and the next) be underscored?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((206 lines not shown))
+ negative values are in [start:end_negative[0]] and positive values
+ are in [start_positive[0]:end].
+
+ sorted_samples, is_samples_sorted : c-array of SIZE_t, bint,
+ If is_samples_sorted, then sorted_samples[start:end] will be the sorted
+ version of samples[start:end], else is_samples_sorted is set to True
+ and samples[start:end]
+
+ """
+ cdef SIZE_t n_indices = <SIZE_t>(indptr_end - indptr_start)
+ cdef SIZE_t n_samples = end - start
+
+ # Use binary search if n_samples * log(n_indices) <
+ # n_indices and index_to_samples approach otherwise.
+ # O(n_samples * log(n_indices)) is the running time of binary
+ # search and O(n_indices) is the running time of coloring
@vene Owner
vene added a note

"coloring technique" is never mentioned before. It would be better to introduce the term in the docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tests/test_tree.py
((14 lines not shown))
+X_sparse_pos[X_sparse_pos <= 0.8] = 0.
+y_random = random_state.randint(0, 4, size=(20, ))
+X_sparse_mix = sparse_random_matrix(20, 10, density=0.25, random_state=0)
+
+
+DATASETS = {
+ "iris": {"X": iris.data, "y": iris.target},
+ "boston": {"X": boston.data, "y": boston.target},
+ "digits": {"X": digits.data, "y": digits.target},
+ "toy": {"X": X, "y": y},
+ "clf_small": {"X": X_small, "y": y_small},
+ "reg_small": {"X": X_small, "y": y_small_reg},
+ "multilabel": {"X": X_multilabel, "y": y_multilabel},
+ "sparse-pos": {"X": X_sparse_pos, "y": y_random},
+ "sparse-neg": {"X": - X_sparse_pos, "y": y_random},
+ "sparse-mix": {"X": - X_sparse_mix, "y": y_random},
@vene Owner
vene added a note

Why is there a negative sign here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tests/test_tree.py
((128 lines not shown))
+ "{0} with dense and sparse format gave different "
+ "trees".format(tree))
+ assert_array_almost_equal(s.predict(X), d.predict(X))
+
+def test_sparse_criterion():
+ for tree, dataset in product(SPARSE_TREES,
+ ["sparse-pos", "sparse-neg", "sparse-mix",
+ "zeros"]):
+ yield (check_sparse_criterion, tree, dataset)
+
+
+def check_explicit_sparse_zeros(tree, max_depth=3,
+ n_features=10):
+ TreeEstimator = ALL_TREES[tree]
+
+ # n_samples is set n_feature to ease construction of a simultaneous
@vene Owner
vene added a note

set to n_features

to ease simultaneous construction of a csr and csc matrix

@amueller Owner

why don't you just convert them? Or do you want really big matrices? Usually I prefer n_samples != n_features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene commented on the diff
sklearn/tree/tests/test_tree.py
((159 lines not shown))
+ data.append(data_i)
+ offset += n_nonzero_i
+ indptr.append(offset)
+
+ indices = np.concatenate(indices)
+ data = np.array(np.concatenate(data), dtype=np.float32)
+ X_sparse = csc_matrix((data, indices, indptr),
+ shape=(n_samples, n_features))
+ X = X_sparse.toarray()
+ X_sparse_test = csr_matrix((data, indices, indptr),
+ shape=(n_samples, n_features))
+ X_test = X_sparse_test.toarray()
+ y = random_state.randint(0, 3, size=(n_samples, ))
+
+ # Ensure that X_sparse_test owns its data, indices and indptr array
+ X_sparse_test = X_sparse_test.copy()
@vene Owner
vene added a note

Just curious, since you just explicitly built it, how can this not be the case?

@larsmans Owner

If the values and indices vectors are of the right type and contiguity, the CSR constructor doesn't copy them:

>>> from scipy.sparse import csr_matrix
>>> A = csr_matrix(np.arange(9).reshape(3, 3))  # to get these vectors quickly
>>> data, indices, indptr = A.data, A.indices, A.indptr
>>> A = csr_matrix((data, indices, indptr))
>>> data[:] = 1
>>> A.toarray()
array([[0, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans commented on the diff
sklearn/tree/_tree.pyx
((14 lines not shown))
+
+
+# =============================================================================
+# Non zero value extraction with sparse matrices
+# =============================================================================
+cdef int compare_SIZE_t(const void * a, const void * b) nogil:
+ """Comparison function for sort"""
+ return <int>((<SIZE_t*>a)[0] - (<SIZE_t*>b)[0])
+
+
+cdef inline void binary_search(INT32_t* sorted_array, INT32_t start, INT32_t end,
+ SIZE_t value, SIZE_t* index,
+ INT32_t* new_start) nogil:
+ """Return the index of value in the sorted array
+
+ If not found, return -1. new_start is the last pivot + 1
@larsmans Owner

Why return by pointer in a void function?

Also, why not just roll sorted_array and start into one, i.e. call with sorted_array + start?

Finally, if you're using qsort, why not reuse its comparison function with the standard C function bsearch?

@amueller Owner

+1

@arjoly Owner
arjoly added a note

You can make things faster by taking into account that you want to perform the intersection between two arrays instead of repeating the search for one element in an array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans commented on the diff
sklearn/tree/_tree.pyx
@@ -2379,28 +2955,129 @@ cdef class Tree:
out = out.reshape(X.shape[0], self.max_n_classes)
return out
@larsmans Owner

[Maybe not for this PR] I have the feeling that it's rather senseless to have this function in Cython. It's just calling NumPy functions.

@arjoly Owner
arjoly added a note

+1 factoring this in another pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/weight_boosting.py
((5 lines not shown))
- if (X.ndim != 2 and not issparse(X)):
+ X = safe_asarray(X)
+ if X.ndim != 2:
@amueller Owner

why not use atleast_2d_or_csr? Because we don't want to convert the csc ones? The helper functions seem a bit confusing after not having looked at them for a while ;)

@arjoly Owner
arjoly added a note

You might know that your algorithm will require data in csr format, so you don't want to impose csr format.

@amueller Owner

Sorry, can you explain again? I didn't get this.

@arjoly Owner
arjoly added a note

I don't know but there might some other algorithm of the scikit that work only with csc and you don't want to enforce this format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/weight_boosting.py
((9 lines not shown))
X = array2d(X)
- if(self.base_estimator is None or
- isinstance(self.base_estimator, (BaseDecisionTree, BaseForest))):
- X, = check_arrays(X, dtype=DTYPE)
+ if (self.base_estimator is None or
@amueller Owner

maybe put these lines in a small helper?

@arjoly Owner
arjoly added a note

It's done when this is needed (predict / predict_proba / decision_function)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
sklearn/tree/tests/test_tree.py
@@ -54,6 +62,39 @@
ALL_TREES.update(CLF_TREES)
ALL_TREES.update(REG_TREES)
+SPARSE_TREES = [name for name, Tree in ALL_TREES.items()
+ if Tree().splitter in SPARSE_SPLITTERS]
+
+
+X_small = np.array([
@amueller Owner

What is the motivation for this data? can you explain how it was generated or why it is used?

@arjoly Owner
arjoly added a note

This was first brought by @fareshedayati. I don't know where it comes from.

@amueller Owner

Ok, but what is the motiviation? Anything that makes this good data for testing? I feel it is kinda weird to hard-code a largish matrix in the tests.

@arjoly Owner
arjoly added a note

I think it's what you have described. I can remove that dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

There are some enhancements in the tests that are unrelated to the sparse support, right? Could you maybe put them in a separate PR? This one is kinda big ;)

@amueller
Owner

Btw, have you checked whether the trees on dense input are as fast as before? I would imagine the small changes don't make any difference but it would be good to double-check.

sklearn/tree/_tree.pyx
@@ -2531,9 +3207,239 @@ cdef inline SIZE_t rand_int(SIZE_t end, UINT32_t* random_state) nogil:
"""Generate a random integer in [0; end)."""
return our_rand_r(random_state) % end
-cdef inline double rand_double(UINT32_t* random_state) nogil:
- """Generate a random double in [0; 1)."""
- return <double> our_rand_r(random_state) / <double> RAND_R_MAX
+cdef inline double rand_uniform(double low, double high,
+ UINT32_t* random_state) nogil:
+ """Generate a random double in [low; high)."""
+ return low + (high - low) * <double> our_rand_r(random_state) / <double> RAND_R_MAX
@jnothman Owner

Split this onto two lines?

@jnothman Owner

And I think rand_int should have the same interface as rand_uniform (i.e. give it low and high).

@arjoly Owner
arjoly added a note

OK, I will extract this factoring from this pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/_tree.pyx
((74 lines not shown))
+ index = index_to_samples[X_indices[k]]
+ sparse_swap(index_to_samples, samples, index, start_positive_)
+
+
+ elif X_data[k] < 0:
+ Xf[end_negative_] = X_data[k]
+ index = index_to_samples[X_indices[k]]
+ sparse_swap(index_to_samples, samples, index, end_negative_)
+ end_negative_ += 1
+
+ # Returned values
+ end_negative[0] = end_negative_
+ start_positive[0] = start_positive_
+
+
+cdef inline void extract_nnz_binary_search(INT32_t* X_indices,
@amueller Owner

I don't want to burden you too much, but do you think it makes sense to unit-test these helper functions? I believe that they work but fine-grained unit tests make refactoring easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
((6 lines not shown))
The training input samples. Use ``dtype=np.float32`` for maximum
- efficiency.
+ efficiency. Sparse matrices are also supported, use sparse
@amueller Owner

From what I understood, it will be converted to csc for training, right? From the current docstring, a user might think using crs is just slower, not that it uses more memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
@@ -281,16 +299,21 @@ def predict(self, X):
Parameters
----------
- X : array-like of shape = [n_samples, n_features]
- The input samples.
+ X : array-like or sparse matrix of shape = [n_samples, n_features]
+ The input samples. Use ``dtype=np.float32`` for maximum
+ efficiency. Sparse matrices are also supported, use sparse
@amueller Owner

Same as above. Maybe say "will be converted to np.float32 and csr internally"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
@@ -546,8 +574,10 @@ def predict_log_proba(self, X):
Parameters
----------
- X : array-like of shape = [n_samples, n_features]
- The input samples.
+ X : array-like or sparse matrix of shape = [n_samples, n_features]
+ The input samples. Use ``dtype=np.float32`` for maximum
@amueller Owner

same as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/weight_boosting.py
((5 lines not shown))
from .base import BaseEnsemble
from ..base import ClassifierMixin, RegressorMixin
from ..externals import six
-from ..externals.six.moves import xrange, zip
+from ..externals.six.moves import zip
+from ..external.six.moves import xrange as range

isn't external missing trailing s?

@arjoly Owner
arjoly added a note

indeed.

@amueller Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

Looks good apart from the minor comments. I can't claim I went through it line-by-line, though.

@arjoly
Owner

Btw, have you checked whether the trees on dense input are as fast as before? I would imagine the small changes don't make any difference but it would be good to double-check.

I can if you point me which one.

@amueller
Owner
@arjoly
Owner

Sorry @amueller I mean to point which line of code could be extracted from this pr.
:-)

There are some enhancements in the tests that are unrelated to the sparse support, right? Could you maybe put them in a separate PR? This one is kinda big ;)

sklearn/ensemble/tests/test_forest.py
((33 lines not shown))
+ X, y)
+
+ else:
+ # Unfitted / no bootstrap / no oob_score
+ for oob_score, bootstrap in [(True, False), (False, True),
+ (False, False)]:
+ est = ForestEstimator(oob_score=oob_score, bootstrap=bootstrap,
+ random_state=0)
+ assert_false(hasattr(est, "oob_score_"))
+
+ # No bootstrap
+ assert_raises(ValueError, ForestEstimator(oob_score=True,
+ bootstrap=False).fit, X, y)
+
+
+def test_oob_score_raise_error():
@amueller Owner

these tests are not sparse-related, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@fareshedayati

@arjoly
I really would like to see this merged, can I help with anything?
thanks

@arjoly
Owner

Not much remained to be done, I will try to work on this.
Mostly a lot of rebase conflict to handle and some puzzling things to do.
I will try to finish this soon.

@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 196a211 on arjoly:sparse-tree into fd4ba4d on scikit-learn:master.

@hamilyon

Any news on this?

I am sure this is quite important pull and many are watching it.

@arjoly
Owner

I have still to do some benchmarks and then it will be good for a final round of review.

@arjoly
Owner

The benchmarking is in progress.

@arjoly
Owner

I have benchmarked the constant and it seems to appear that the current one is not that bad. Though it's hard to get an optimal since it depends on the algorithm parameters.

@arjoly
Owner

I think it's ready for a final round of review.

sklearn/tree/_tree.pyx
@@ -51,6 +56,12 @@ cdef DTYPE_t MIN_IMPURITY_SPLIT = 1e-7
# Mitigate precision differences between 32 bit and 64 bit
cdef DTYPE_t FEATURE_THRESHOLD = 1e-7
+# Constant to switch between algorithm non zero value extract algorithm
+# in SparseSplitter
+import os
+cdef DTYPE_t EXTRACT_NNZ_SWITCH = float(os.environ.get("NNZ_SWITCH", 0.1))
+print("NNZ_SWITCH = %s" % EXTRACT_NNZ_SWITCH)
@arjoly Owner
arjoly added a note

Note to myself change this before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
benchmarks/bench_20newsgroups.py
((2 lines not shown))
+from time import time
+import argparse
+import numpy as np
+
+from sklearn.dummy import DummyClassifier
+
+from sklearn.datasets import fetch_20newsgroups_vectorized
+from sklearn.metrics import accuracy_score
+from sklearn.utils.validation import check_array
+
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.ensemble import ExtraTreesClassifier
+
+
+ESTIMATORS = {
+ "dummy": DummyClassifier(),
@ogrisel Owner
ogrisel added a note

I would add a linear model such as LogisticRegression and MultinomialNB as common baselines for text classification here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
benchmarks/bench_20newsgroups.py
((46 lines not shown))
+
+ # print("20 newsgroups")
+ # print("=============")
+ # print("X_train.shape = {0}".format(X_train.shape))
+ # print("X_train.format = {0}".format(X_train.format))
+ # print("X_train.dtype = {0}".format(X_train.dtype))
+ # print("X_train density = {0}"
+ # "".format(X_train.nnz / np.product(X_train.shape)))
+ # print("y_train {0}".format(y_train.shape))
+ # print("X_test {0}".format(X_test.shape))
+ # print("X_test.format = {0}".format(X_test.format))
+ # print("X_test.dtype = {0}".format(X_test.dtype))
+ # print("y_test {0}".format(y_test.shape))
+ # print()
+ # print("Classifier Training")
+ # print("===================")
@ogrisel Owner
ogrisel added a note

Please either remove the comment lines and keep them but uncommented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
@@ -182,7 +188,9 @@ def fit(self, X, y, sample_weight=None):
# Convert data
# ensure_2d=False because there are actually unit test checking we fail
# for 1d. FIXME make this consistent in the future.
- X = check_array(X, dtype=DTYPE, ensure_2d=False)
+ X = check_array(X, dtype=DTYPE, ensure_2d=False, accept_sparse="csc")
+ if issparse(X):
+ X.sort_indices()
@ogrisel Owner
ogrisel added a note

Maybe you can add an inline comment to explain why this is required?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
((9 lines not shown))
rnd = check_random_state(self.random_state)
y = rnd.uniform(size=X.shape[0])
super(RandomTreesEmbedding, self).fit(X, y,
sample_weight=sample_weight)
+ if issparse(X):
+ X = X.tocsr()
@ogrisel Owner
ogrisel added a note

Why is it necessary to do that here? Won't OneHotEncoder.fit_transform do that conversion internally if necessary?

@ogrisel Owner
ogrisel added a note

Actually it's the apply method that should do that internally, not OneHotEncoder.fit_transform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/ensemble/forest.py
@@ -1386,8 +1415,10 @@ def transform(self, X):
Parameters
----------
- X : array-like, shape=(n_samples, n_features)
- Input data to be transformed.
+ X : array-like or sparse matrix, shape=(n_samples, n_features)
+ Input data to be transformed. Use ``dtype=np.float32`` for maximum
+ efficiency. Sparse matrices are also supported, use sparse
+ ``csc_matrix`` for maximum efficieny.
@ogrisel Owner
ogrisel added a note

I am confused now, is CSC or CSR more efficient for the internal call to apply.

@ogrisel Owner
ogrisel added a note

I think it should be CSR here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
benchmarks/bench_20newsgroups.py
((10 lines not shown))
+from sklearn.utils.validation import check_array
+
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.ensemble import ExtraTreesClassifier
+
+
+ESTIMATORS = {
+ "dummy": DummyClassifier(),
+ "random_forest": RandomForestClassifier(n_estimators=100,
+ max_features="sqrt",
+ # min_samples_split=10
+ ),
+ "extra_trees": ExtraTreesClassifier(n_estimators=100,
+ max_features="sqrt",
+ # min_samples_split=10
+ ),
@ogrisel Owner
ogrisel added a note

Could you also add an AdaboostClassifier(n_estimators=10) here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly
Owner

@ogrisel All your comments have been taken into account.

@arjoly
Owner

rebased on top of master

@ogrisel
Owner

Thanks, it looks good to me. Would be great to get another round of reviews. Maybe @amueller @glouppe @pprett @ndawe?

@jnothman
Owner

I was fairly happy with it a while ago. My plate is a bit full this week, but perhaps next week I'll give this another look-in.

@arjoly
Owner

Thanks @ogrisel !

@arjoly
Owner

And also thanks joel, i would really appreciate your review if you have some time.

@jnothman jnothman commented on the diff
sklearn/tree/_tree.pyx
((133 lines not shown))
+ negative values are in self.feature_values[start:end_negative[0]]
+ and positive values are in
+ self.feature_values[start_positive[0]:end].
+
+ is_samples_sorted : bint*,
+ If is_samples_sorted, then self.sorted_samples[start:end] will be
+ the sorted version of self.samples[start:end].
+
+ """
+ cdef SIZE_t indptr_start = self.X_indptr[feature],
+ cdef SIZE_t indptr_end = self.X_indptr[feature + 1]
+ cdef SIZE_t n_indices = <SIZE_t>(indptr_end - indptr_start)
+ cdef SIZE_t n_samples = self.end - self.start
+
+ # Use binary search if n_samples * log(n_indices) <
+ # n_indices and index_to_samples approach otherwise.
@jnothman Owner

n_indices * EXTRACT_NNZ_SWITCH would be more precise... but it's repeated in the code, so I'm not sure this comment needs to be so verbose in any case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
Owner

You've effectively got 3 +1s on an earlier version of this code (from me, @amueller and @glouppe), so asking for another review is likely to lead to over-optimisation.

On which basis:

  • Should {Dense,Sparse}Splitter be called Base{Dense,Sparse}Splitter?
  • Does extract_nnz take up a substantial portion of runtime? (If not, why do you require two implementations of it?) Given how frequently sparse datasets are nonnegative, should we have a way to short-circuit if we know this is the case (i.e. keep a record that the entire dataset or each feature is nonnegative).
  • Since I last reviewed this you changed the min-weight handling so that the criterion update happened before the min-weight check. Is this something (i.e. its former failure) that is currently tested?

I've not given this a thorough pass now, but gather that mostly cosmetic changes have been applied since the last review. I think this PR is of a maturity that it needs no further review to be merged.

@arjoly
Owner

Should {Dense,Sparse}Splitter be called Base{Dense,Sparse}Splitter?

Sound good.

Does extract_nnz take up a substantial portion of runtime? (If not, why do you require two implementations of it?)

The second implementation is there to be able to cope with a degenerated column which wouldn't be sparse.

Given how frequently sparse datasets are nonnegative, should we have a way to short-circuit if we know this is the case (i.e. keep a record that the entire dataset or each feature is nonnegative).

Do you think there would be high gain to do that?

Since I last reviewed this you changed the min-weight handling so that the criterion update happened before the min-weight check. Is this something (i.e. its former failure) that is currently tested?

There are tests for this. Though, those could be better.

@jnothman
Owner

Do you think there would be high gain to do that?

That's why I asked "Does extract_nnz take up a substantial portion of runtime?" Obviously testing the non-negativity of a sparse matrix takes O(nnz) too.

But in short, I think you shouldn't be looking for more reviews. The code can be tightened once it's merged.

@arjoly
Owner

"Does extract_nnz take up a substantial portion of runtime?"

Formally, I haven't tested. It should take most of the time especially near the bottom of the tree.

@jnothman
Owner

Are we okay to merge?

@ogrisel
Owner

Yes let's merge, this is already useful as it is and the public API should not change so we can always decide to change implementation details on whether or not to special case the handling non-negative sparse data later.

@ogrisel ogrisel merged commit 8621f00 into scikit-learn:master

1 check passed

Details continuous-integration/travis-ci The Travis CI build passed
@ogrisel
Owner

Thanks @arjoly! I think many users were waiting for this!

@ogrisel
Owner

I will add an entry to whats new.

@glouppe
Owner

Great job everyone :)

@ogrisel
Owner

:beers:

@GaelVaroquaux
@pprett
Owner

congrats @arjoly and everybody involved -- didnt expect sparse matrix support for the decision trees -- great job!

@arjoly
Owner

Finally, this is in !!! :-D Thanks to all people that have brought a contribution to this feature!

@arjoly arjoly deleted the arjoly:sparse-tree branch
@dan-blanchard dan-blanchard referenced this pull request in EducationalTestingService/skll
Closed

Remove decision tree and random forest learners from requires_dense set #207

@dan-blanchard dan-blanchard commented on the diff
sklearn/ensemble/forest.py
((20 lines not shown))
from ..base import ClassifierMixin, RegressorMixin
from ..externals.joblib import Parallel, delayed
from ..externals import six
-from ..externals.six.moves import xrange

I was just looking through this and noticed that this file is missing

from ..externals.six.moves import xrange as range

This is also true of tree.py, so maybe this was intentional. It just seems strange from an outsider's perspective to make 2.7 memory usage worse.

@ogrisel Owner
ogrisel added a note

Indeed, but as far as I can see this is only for iterations over n_outputs_ which should be much smaller than n_samples in typical cases so users won't notice it in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

Wohoo congrats @arjoly! This is awesome!

@fareshedayati

Great work @arjoly thank you for all your hard work!

@jnothman
Owner

congrats, Arnaud and Fares!

@pprett

@arjoly don't we miss a self.min_weight_leaf here?

Owner

Indeed. I will make a patch.

Owner

Actually, i don't think it's possible to make raise an error since the splitter is never saved with the tree estimator, except in gbrt.
Though, I am going to fix as it is still a bug.

Owner

FIX in c95703a

@pprett

@arjoly don't we miss a self.min_weight_leaf here?

Owner

Fix in c95703a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Nov 6, 2014
  1. @arjoly

    ENH Bring sparse input support to tree-based methods

    arjoly authored
    Author:     Arnaud Joly <arnaud.v.joly@gmail.com>
                Fares Hedayati <fares.hedayati@gmail.com>
  2. @arjoly
  3. @arjoly
  4. @arjoly
  5. @arjoly

    ENH while -> for loop

    arjoly authored
  6. @arjoly

    ENH reduce number of parameters

    arjoly authored
  7. @arjoly
  8. @arjoly
  9. @arjoly

    ENH remove spurious code

    arjoly authored
  10. @arjoly

    cosmit

    arjoly authored
  11. @arjoly
  12. @arjoly

    COSMIT simplify function call

    arjoly authored
  13. @arjoly

    ENH expand ternary operator

    arjoly authored
  14. @arjoly

    Revert previous version

    arjoly authored
  15. @arjoly

    ENH move utils near its use

    arjoly authored
  16. @arjoly
  17. @arjoly
  18. @arjoly

    Lower number of trees

    arjoly authored
  19. @arjoly

    wip benchmark

    arjoly authored
  20. @arjoly
  21. @arjoly
  22. @arjoly