ENH Bring sparse input support to tree-based methods

Author: Arnaud Joly <arnaud.v.joly@gmail.com> Fares Hedayati <fares.hedayati@gmail.com>
scikit-learn · Nov 6, 2014 · c03c01a · c03c01a
1 parent e01f225
commit c03c01a
Show file tree

Hide file tree

Showing 11 changed files with 18,442 additions and 11,538 deletions.
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
@@ -108,7 +108,7 @@ construction.  The prediction of the ensemble is given as the averaged
 prediction of the individual classifiers.
 
 As other classifiers, forest classifiers have to be fitted with two
-arrays: an array X of size ``[n_samples, n_features]`` holding the
+arrays: a sparse or dense array X of size ``[n_samples, n_features]`` holding the
 training samples, and an array Y of size ``[n_samples]`` holding the
 target values (class labels) for the training samples::
 

diff --git a/doc/modules/tree.rst b/doc/modules/tree.rst
@@ -90,10 +90,10 @@ Classification
 :class:`DecisionTreeClassifier` is a class capable of performing multi-class
 classification on a dataset.
 
-As other classifiers, :class:`DecisionTreeClassifier` take as input two
-arrays: an array X of size ``[n_samples, n_features]`` holding the training
-samples, and an array Y of integer values, size ``[n_samples]``, holding
-the class labels for the training samples::
+As other classifiers, :class:`DecisionTreeClassifier` take as input two arrays:
+an array X, sparse or dense, of size ``[n_samples, n_features]``  holding the
+training samples, and an array Y of integer values, size ``[n_samples]``,
+holding the class labels for the training samples::
 
     >>> from sklearn import tree
     >>> X = [[0, 0], [1, 1]]
@@ -157,7 +157,7 @@ a PDF file (or any other supported file type) directly in Python::
 
 After being fitted, the model can then be used to predict new values::
 
-    >>> clf.predict(iris.data[0, :])
+    >>> clf.predict(iris.data[:1, :])
     array([0])
 
 .. figure:: ../auto_examples/tree/images/plot_iris_001.png
@@ -195,7 +195,6 @@ instead of integer values::
     >>> clf.predict([[1, 1]])
     array([ 0.5])
 
-
 .. topic:: Examples:
 
  * :ref:`example_tree_plot_tree_regression.py`
@@ -337,6 +336,13 @@ Tips on practical use
   * All decision trees use ``np.float32`` arrays internally.
     If training data is not in this format, a copy of the dataset will be made.
 
+  * If the input matrix X is very sparse, it is recommended to convert to sparse
+    ``csc_matrix` before calling fit and sparse ``csr_matrix`` before calling
+    predict. Training time can be orders of magnitude faster for a sparse
+    matrix input compared to a dense matrix when features have zero values in
+    most of the samples.
+
+
 
 .. _tree_algorithms:
 

diff --git a/sklearn/ensemble/_gradient_boosting.c b/sklearn/ensemble/_gradient_boosting.c