Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Merge pull request #478 from glouppe/tree

MRG: Feature importances with trees
  • Loading branch information...
commit 886a00440996afe753d7ae5baa4c01435a58f29e 2 parents bdfc019 + 244dcab
@GaelVaroquaux GaelVaroquaux authored
View
2  doc/modules/ensemble.rst
@@ -31,7 +31,7 @@ Two families of ensemble methods are usually distinguished:
Forests of randomized trees
===========================
-The ``sklearn.ensemble`` module includes two averaging algorithms based on
+The :mod:`sklearn.ensemble` module includes two averaging algorithms based on
randomized :ref:`decision trees <tree>`: the RandomForest algorithm and the
Extra-Trees method. Both algorithms are perturb-and-combine techniques
specifically designed for trees::
View
27 doc/modules/feature_selection.rst
@@ -7,7 +7,7 @@ Feature selection
.. currentmodule:: sklearn.feature_selection
-The classes in the ``sklearn.feature_selection`` module can be used
+The classes in the :mod:`sklearn.feature_selection` module can be used
for feature selection/dimensionality reduction on sample sets, either to
improve estimators' accuracy scores or to boost their performance on very
high-dimensional datasets.
@@ -19,7 +19,7 @@ Univariate feature selection
Univariate feature selection works by selecting the best features based on
univariate statistical tests. It can seen as a preprocessing step
-to an estimator. The `scikit.learn` exposes feature selection routines
+to an estimator. Scikit-Learn exposes feature selection routines
a objects that implement the `transform` method. The k-best features
can be selected based on:
@@ -91,6 +91,7 @@ select is eventually reached.
elimination example with automatic tuning of the number of features
selected with cross-validation.
+
L1-based feature selection
==========================
@@ -100,9 +101,9 @@ Linear models penalized with the L1 norm have sparse solutions. When the goal
is to reduce the dimensionality of the data to use with another classifier, the
`transform` method of `LogisticRegression` and `LinearSVC` can be used::
- >>> from sklearn import datasets
>>> from sklearn.svm import LinearSVC
- >>> iris = datasets.load_iris()
+ >>> from sklearn.datasets import load_iris
+ >>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
@@ -117,3 +118,21 @@ The parameter C controls the sparsity: the smaller the fewer features.
* :ref:`example_document_classification_20newsgroups.py`: Comparison
of different algorithms for document classification including L1-based
feature selection.
+
+
+Tree-based feature selection
+============================
+
+Tree-based estimators (see the :mod:`sklearn.tree` module and forest of trees in
+the :mod:`sklearn.ensemble` module) can be used to compute feature importances,
+which in turn can be used to discard irrelevant features::
+
+ >>> from sklearn.ensemble import ExtraTreesClassifier
+ >>> from sklearn.datasets import load_iris
+ >>> iris = datasets.load_iris()
+ >>> X, y = iris.data, iris.target
+ >>> X.shape
+ (150, 4)
+ >>> X_new = ExtraTreesClassifier().fit(X, y).transform(X)
+ >>> X_new.shape
+ (150, 2)
View
5 doc/whats_new.rst
@@ -52,7 +52,7 @@ Changelog
approximation for fast SGD on non-linear kernels by
`Andreas Müller`_.
- - Fix a bug due to atom swapping in :ref:`OMP` by `Vlad Niculae`_.
+ - Fixed a bug due to atom swapping in :ref:`OMP` by `Vlad Niculae`_.
- :ref:`SparseCoder` by `Vlad Niculae`_.
@@ -76,6 +76,9 @@ Changelog
:class:`sklearn.preprocessing.Scaler` work on sparse matrices by
`Olivier Grisel`_
+ - Feature importances using decision trees and/or forest of trees,
+ by `Gilles Louppe`_.
+
API changes summary
-------------------
View
55 examples/ensemble/plot_forest_importances.py
@@ -0,0 +1,55 @@
+"""
+=========================================
+Feature importances with forests of trees
+=========================================
+
+This examples shows the use of forests of trees to evaluate the importance of
+features on an artifical classification task. The red plots are the feature
+importances of each individual tree, and the blue plot is the feature importance
+of the whole forest.
+
+As expected, the knee in the blue plot suggests that 3 features are informative,
+while the remaining are not.
+"""
+print __doc__
+
+import numpy as np
+
+from sklearn.datasets import make_classification
+from sklearn.ensemble import ExtraTreesClassifier
+
+# Build a classification task using 3 informative features
+X, y = make_classification(n_samples=1000,
+ n_features=10,
+ n_informative=3,
+ n_redundant=0,
+ n_repeated=0,
+ n_classes=2,
+ random_state=0,
+ shuffle=False)
+
+# Build a forest and compute the feature importances
+forest = ExtraTreesClassifier(n_estimators=250,
+ compute_importances=True,
+ random_state=0)
+
+forest.fit(X, y)
+importances = forest.feature_importances_
+indices = np.argsort(importances)[::-1]
+
+# Print the feature ranking
+print "Feature ranking:"
+
+for f in xrange(10):
+ print "%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]])
+
+# Plot the feature importances of the trees and of the forest
+import pylab as pl
+pl.figure()
+pl.title("Feature importances")
+
+for tree in forest.estimators_:
+ pl.plot(xrange(10), tree.feature_importances_[indices], "r")
+
+pl.plot(xrange(10), importances[indices], "b")
+pl.show()
View
40 examples/ensemble/plot_forest_importances_faces.py
@@ -0,0 +1,40 @@
+"""
+=======================================
+Pixel importances with forests of trees
+=======================================
+
+This example shows the use of forests of trees to evaluate the importance
+of the pixels in an image classification task (faces). The hotter the pixel,
+the more important.
+"""
+print __doc__
+
+import pylab as pl
+
+from sklearn.datasets import fetch_olivetti_faces
+from sklearn.ensemble import ExtraTreesClassifier
+
+# Loading the digits dataset
+data = fetch_olivetti_faces()
+X = data.images.reshape((len(data.images), -1))
+y = data.target
+
+mask = y < 5 # Limit to 5 classes
+X = X[mask]
+y = y[mask]
+
+# Build a forest and compute the pixel importances
+forest = ExtraTreesClassifier(n_estimators=1000,
+ max_features=128,
+ compute_importances=True,
+ random_state=0)
+
+forest.fit(X, y)
+importances = forest.feature_importances_
+importances = importances.reshape(data.images[0].shape)
+
+# Plot pixel importances
+pl.matshow(importances, cmap=pl.cm.hot)
+pl.colorbar()
+pl.title("Pixel importances with forests of trees")
+pl.show()
View
106 sklearn/ensemble/forest.py
@@ -4,7 +4,7 @@
The module structure is the following:
-- The ``Forest`` base class implements a common ``fit`` method for all
+- The ``BaseForest`` base class implements a common ``fit`` method for all
the estimators in the module. The ``fit`` method of the base ``Forest``
class calls the ``fit`` method of each sub-estimator on random samples
(with replacement, a.k.a. bootstrap) of the training set.
@@ -36,6 +36,7 @@ class calls the ``fit`` method of each sub-estimator on random samples
import numpy as np
from ..base import ClassifierMixin, RegressorMixin
+from ..feature_selection import SelectorMixin
from ..tree import DecisionTreeClassifier, DecisionTreeRegressor, \
ExtraTreeClassifier, ExtraTreeRegressor
from ..utils import check_random_state
@@ -48,7 +49,7 @@ class calls the ``fit`` method of each sub-estimator on random samples
"ExtraTreesRegressor"]
-class Forest(BaseEnsemble):
+class BaseForest(BaseEnsemble, SelectorMixin):
"""Base class for forests of trees.
Warning: This class should not be used directly. Use derived classes
@@ -58,15 +59,19 @@ def __init__(self, base_estimator,
n_estimators=10,
estimator_params=[],
bootstrap=False,
+ compute_importances=False,
random_state=None):
- super(Forest, self).__init__(
+ super(BaseForest, self).__init__(
base_estimator=base_estimator,
n_estimators=n_estimators,
estimator_params=estimator_params)
self.bootstrap = bootstrap
+ self.compute_importances = compute_importances
self.random_state = check_random_state(random_state)
+ self.feature_importances_ = None
+
def fit(self, X, y):
"""Build a forest of trees from the training set (X, y).
@@ -88,6 +93,10 @@ def fit(self, X, y):
X = np.atleast_2d(X)
y = np.atleast_1d(y)
+ sample_mask = np.ones((X.shape[0],), dtype=np.bool)
+ X_argsorted = np.asfortranarray(
+ np.argsort(X.T, axis=1).astype(np.int32).T)
+
if isinstance(self.base_estimator, ClassifierMixin):
self.classes_ = np.unique(y)
self.n_classes_ = len(self.classes_)
@@ -95,19 +104,32 @@ def fit(self, X, y):
for i in xrange(self.n_estimators):
tree = self._make_estimator()
+ tree.set_params(compute_importances=self.compute_importances)
if self.bootstrap:
n_samples = X.shape[0]
indices = self.random_state.randint(0, n_samples, n_samples)
- tree.fit(X[indices], y[indices])
+ tree.fit(X[indices], y[indices],
+ sample_mask=None, X_argsorted=None)
else:
- tree.fit(X, y)
+ tree.fit(X, y,
+ sample_mask=sample_mask, X_argsorted=X_argsorted)
+
+ # Build the importances
+ if self.compute_importances:
+ importances = np.zeros(self.estimators_[0].n_features_)
+
+ for tree in self.estimators_:
+ importances += tree.feature_importances_
+
+ importances /= self.n_estimators
+ self.feature_importances_ = importances
return self
-class ForestClassifier(Forest, ClassifierMixin):
+class ForestClassifier(BaseForest, ClassifierMixin):
"""Base class for forest of trees-based classifiers.
Warning: This class should not be used directly. Use derived classes
@@ -117,12 +139,14 @@ def __init__(self, base_estimator,
n_estimators=10,
estimator_params=[],
bootstrap=False,
+ compute_importances=False,
random_state=None):
super(ForestClassifier, self).__init__(
base_estimator,
n_estimators=n_estimators,
estimator_params=estimator_params,
bootstrap=bootstrap,
+ compute_importances=compute_importances,
random_state=random_state)
def predict(self, X):
@@ -198,7 +222,7 @@ def predict_log_proba(self, X):
return np.log(self.predict_proba(X))
-class ForestRegressor(Forest, RegressorMixin):
+class ForestRegressor(BaseForest, RegressorMixin):
"""Base class for forest of trees-based regressors.
Warning: This class should not be used directly. Use derived classes
@@ -208,12 +232,14 @@ def __init__(self, base_estimator,
n_estimators=10,
estimator_params=[],
bootstrap=False,
+ compute_importances=False,
random_state=None):
super(ForestRegressor, self).__init__(
base_estimator,
n_estimators=n_estimators,
estimator_params=estimator_params,
bootstrap=bootstrap,
+ compute_importances=compute_importances,
random_state=random_state)
def predict(self, X):
@@ -283,14 +309,25 @@ class RandomForestClassifier(ForestClassifier):
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
+ compute_importances : boolean, optional (default=True)
+ Whether feature importances are computed and stored into the
+ ``feature_importances_`` attribute when calling fit.
+
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
+ Attributes
+ ----------
+ feature_importances_ : array of shape = [n_features]
+ The feature mportances (the higher, the more important the feature).
+
Notes
-----
+ **References**:
+
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
See also
@@ -304,6 +341,7 @@ def __init__(self, n_estimators=10,
min_density=0.1,
max_features=None,
bootstrap=True,
+ compute_importances=False,
random_state=None):
super(RandomForestClassifier, self).__init__(
base_estimator=DecisionTreeClassifier(),
@@ -311,6 +349,7 @@ def __init__(self, n_estimators=10,
estimator_params=("criterion", "max_depth", "min_split",
"min_density", "max_features", "random_state"),
bootstrap=bootstrap,
+ compute_importances=compute_importances,
random_state=random_state)
self.criterion = criterion
@@ -360,14 +399,25 @@ class RandomForestRegressor(ForestRegressor):
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
+ compute_importances : boolean, optional (default=True)
+ Whether feature importances are computed and stored into the
+ ``feature_importances_`` attribute when calling fit.
+
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
+ Attributes
+ ----------
+ feature_importances_ : array of shape = [n_features]
+ The feature mportances (the higher, the more important the feature).
+
Notes
-----
+ **References**:
+
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
See also
@@ -381,6 +431,7 @@ def __init__(self, n_estimators=10,
min_density=0.1,
max_features=None,
bootstrap=True,
+ compute_importances=False,
random_state=None):
super(RandomForestRegressor, self).__init__(
base_estimator=DecisionTreeRegressor(),
@@ -388,6 +439,7 @@ def __init__(self, n_estimators=10,
estimator_params=("criterion", "max_depth", "min_split",
"min_density", "max_features", "random_state"),
bootstrap=bootstrap,
+ compute_importances=compute_importances,
random_state=random_state)
self.criterion = criterion
@@ -435,23 +487,35 @@ class ExtraTreesClassifier(ForestClassifier):
If None, all features are considered, otherwise max_features are chosen
at random.
- bootstrap : boolean, optional (default=True)
+ bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees.
+ compute_importances : boolean, optional (default=True)
+ Whether feature importances are computed and stored into the
+ ``feature_importances_`` attribute when calling fit.
+
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
+ Attributes
+ ----------
+ feature_importances_ : array of shape = [n_features]
+ The feature mportances (the higher, the more important the feature).
+
Notes
-----
+ **References**:
+
.. [1] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized trees",
+ Machine Learning, 63(1), 3-42, 2006.
See also
--------
ExtraTreesRegressor, RandomForestClassifier, RandomForestRegressor
- Machine Learning, 63(1), 3-42, 2006.
+
"""
def __init__(self, n_estimators=10,
criterion="gini",
@@ -459,7 +523,8 @@ def __init__(self, n_estimators=10,
min_split=1,
min_density=0.1,
max_features=None,
- bootstrap=True,
+ bootstrap=False,
+ compute_importances=False,
random_state=None):
super(ExtraTreesClassifier, self).__init__(
base_estimator=ExtraTreeClassifier(),
@@ -467,6 +532,7 @@ def __init__(self, n_estimators=10,
estimator_params=("criterion", "max_depth", "min_split",
"min_density", "max_features", "random_state"),
bootstrap=bootstrap,
+ compute_importances=compute_importances,
random_state=random_state)
self.criterion = criterion
@@ -514,23 +580,35 @@ class ExtraTreesRegressor(ForestRegressor):
If None, all features are considered, otherwise max_features are chosen
at random.
- bootstrap : boolean, optional (default=True)
+ bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees.
+ compute_importances : boolean, optional (default=True)
+ Whether feature importances are computed and stored into the
+ ``feature_importances_`` attribute when calling fit.
+
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
+ Attributes
+ ----------
+ feature_importances_ : array of shape = [n_features]
+ The feature mportances (the higher, the more important the feature).
+
Notes
-----
+ **References**:
+
.. [1] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized trees",
+ Machine Learning, 63(1), 3-42, 2006.
See also
--------
ExtraTreesRegressor, RandomForestClassifier, RandomForestRegressor
- Machine Learning, 63(1), 3-42, 2006.
+
"""
def __init__(self, n_estimators=10,
criterion="mse",
@@ -538,7 +616,8 @@ def __init__(self, n_estimators=10,
min_split=1,
min_density=0.1,
max_features=None,
- bootstrap=True,
+ bootstrap=False,
+ compute_importances=False,
random_state=None):
super(ExtraTreesRegressor, self).__init__(
base_estimator=ExtraTreeRegressor(),
@@ -546,6 +625,7 @@ def __init__(self, n_estimators=10,
estimator_params=("criterion", "max_depth", "min_split",
"min_density", "max_features", "random_state"),
bootstrap=bootstrap,
+ compute_importances=compute_importances,
random_state=random_state)
self.criterion = criterion
View
21 sklearn/ensemble/tests/test_forest.py
@@ -149,6 +149,27 @@ def test_probability():
np.exp(clf.predict_log_proba(iris.data)))
+def test_importances():
+ """Check variable importances."""
+ X, y = datasets.make_classification(n_samples=1000,
+ n_features=10,
+ n_informative=3,
+ n_redundant=0,
+ n_repeated=0,
+ shuffle=False,
+ random_state=0)
+
+ clf = RandomForestClassifier(n_estimators=10, compute_importances=True)
+ clf.fit(X, y)
+ importances = clf.feature_importances_
+ n_important = sum(importances > 0.1)
+
+ assert_equal(importances.shape[0], 10)
+ assert_equal(n_important, 3)
+
+ X_new = clf.transform(X, threshold="mean")
+ assert 0 < X_new.shape[1] < X.shape[1]
+
def test_gridsearch():
"""Check that base trees can be grid-searched."""
# Random forest
View
86 sklearn/feature_selection/__init__.py
@@ -16,3 +16,89 @@
from .rfe import RFE
from .rfe import RFECV
+
+import numpy as np
+
+from ..base import TransformerMixin
+
+
+class SelectorMixin(TransformerMixin):
+ """"Transformer mixin selecting features based on importance weights.
+
+ This implementation can be mixin on any estimator that exposes a
+ ``feature_importances_`` or ``coef_`` attribute to evaluate the relative
+ importance of individual features for feature selection.
+ """
+ def transform(self, X, threshold=None):
+ """Reduce X to its most important features.
+
+ Parameters
+ ----------
+ X : array of shape [n_samples, n_features]
+ The input samples.
+
+ threshold : string, float or None, optional (default=None)
+ The threshold value to use for feature selection. Features whose
+ importance is greater or equal are kept while the others are
+ discarded. If "median" (resp. "mean"), then the threshold value is
+ the median (resp. the mean) of the feature importances. A scaling
+ factor (e.g., "1.25*mean") may also be used. If None and if
+ available, the object attribute ``threshold`` is used. Otherwise,
+ "mean" is used by default.
+
+ Returns
+ -------
+ X_r : array of shape [n_samples, n_selected_features]
+ The input samples with only the selected features.
+ """
+ # Retrieve importance vector
+ if hasattr(self, "feature_importances_"):
+ importances = self.feature_importances_
+
+ elif hasattr(self, "coef_"):
+ if self.coef_.ndim == 1:
+ importances = np.abs(self.coef_)
+
+ else:
+ importances = np.sum(np.abs(self.coef_), axis=0)
+
+ else:
+ raise ValueError("Missing `feature_importances_` or `coef_`"
+ " attribute.")
+
+ # Retrieve threshold
+ if threshold is None:
+ threshold = getattr(self, "threshold", "mean")
+
+ if isinstance(threshold, basestring):
+ if "*" in threshold:
+ scale, reference = threshold.split("*")
+ scale = float(scale.strip())
+ reference = reference.strip()
+
+ if reference == "median":
+ reference = np.median(importances)
+ elif reference == "mean":
+ reference = np.mean(importances)
+ else:
+ raise ValueError("Unknown reference: " + reference)
+
+ threshold = scale * reference
+
+ elif threshold == "median":
+ threshold = np.median(importances)
+
+ elif threshold == "mean":
+ threshold = np.mean(importances)
+
+ else:
+ threshold = float(threshold)
+
+ # Selection
+ mask = importances >= threshold
+
+ if np.any(mask):
+ return X[:, mask]
+
+ else:
+ raise ValueError("Invalid threshold: all features are discarded.")
View
22 sklearn/tree/tests/test_tree.py
@@ -202,6 +202,28 @@ def test_numerical_stability():
np.seterr(**old_settings)
+def test_importances():
+ """Check variable importances."""
+ X, y = datasets.make_classification(n_samples=1000,
+ n_features=10,
+ n_informative=3,
+ n_redundant=0,
+ n_repeated=0,
+ shuffle=False,
+ random_state=0)
+
+ clf = tree.DecisionTreeClassifier(compute_importances=True)
+ clf.fit(X, y)
+ importances = clf.feature_importances_
+ n_important = sum(importances > 0.1)
+
+ assert_equal(importances.shape[0], 10)
+ assert_equal(n_important, 3)
+
+ X_new = clf.transform(X, threshold="mean")
+ assert 0 < X_new.shape[1] < X.shape[1]
+
+
def test_error():
"""Test that it gives proper exception on deficient input."""
# Invalid values for parameters
View
100 sklearn/tree/tree.py
@@ -14,6 +14,7 @@
import numpy as np
from ..base import BaseEstimator, ClassifierMixin, RegressorMixin
+from ..feature_selection import SelectorMixin
from ..utils import array2d, check_random_state
from . import _tree
@@ -148,9 +149,9 @@ class Tree(object):
Number of nodes (internal nodes + leaves) in the tree.
children : np.ndarray, shape=(node_count, 2), dtype=int32
- `children[i,0]` holds the node id of the left child of node `i`.
- `children[i,1]` holds the node id of the right child of node `i`.
- For leaves `children[i,0] == children[i, 1] == Tree.LEAF == -1`.
+ `children[i, 0]` holds the node id of the left child of node `i`.
+ `children[i, 1]` holds the node id of the right child of node `i`.
+ For leaves `children[i, 0] == children[i, 1] == Tree.LEAF == -1`.
feature : np.ndarray of int32
The feature to split on (only for internal nodes).
@@ -356,12 +357,12 @@ def recursive_partition(X, X_argsorted, y, sample_mask, depth,
np.argsort(X.T, axis=1).astype(np.int32).T)
recursive_partition(X, X_argsorted, y, sample_mask, 0, -1, False)
-
tree.resize(tree.node_count)
+
return tree
-class BaseDecisionTree(BaseEstimator):
+class BaseDecisionTree(BaseEstimator, SelectorMixin):
"""Base class for decision trees.
Warning: This class should not be used directly.
@@ -372,12 +373,14 @@ def __init__(self, criterion,
min_split,
min_density,
max_features,
+ compute_importances,
random_state):
self.criterion = criterion
self.max_depth = max_depth
self.min_split = min_split
self.min_density = min_density
self.max_features = max_features
+ self.compute_importances = compute_importances
self.random_state = check_random_state(random_state)
self.n_features_ = None
@@ -386,8 +389,9 @@ def __init__(self, criterion,
self.find_split_ = _tree._find_best_split
self.tree_ = None
+ self.feature_importances_ = None
- def fit(self, X, y):
+ def fit(self, X, y, sample_mask=None, X_argsorted=None):
"""Build a decision tree from the training set (X, y).
Parameters
@@ -445,7 +449,27 @@ def fit(self, X, y):
max_depth, self.min_split,
self.min_density, max_features,
self.random_state, self.n_classes_,
- self.find_split_)
+ self.find_split_, sample_mask=sample_mask,
+ X_argsorted=X_argsorted)
+
+ # Compute feature importances
+ if self.compute_importances:
+ importances = np.zeros(self.n_features_)
+
+ for node in xrange(self.tree_.node_count):
+ if (self.tree_.children[node, 0]
+ == self.tree_.children[node, 1]
+ == Tree.LEAF):
+ continue
+
+ else:
+ importances[self.tree_.feature[node]] += \
+ self.tree_.n_samples[node] * \
+ (self.tree_.init_error[node] -
+ self.tree_.best_error[node])
+
+ importances /= np.sum(importances)
+ self.feature_importances_ = importances
return self
@@ -517,12 +541,31 @@ class DecisionTreeClassifier(BaseDecisionTree, ClassifierMixin):
If None, all features are considered, otherwise max_features are chosen
at random.
+ compute_importances : boolean, optional (default=True)
+ Whether feature importances are computed and stored into the
+ ``feature_importances_`` attribute when calling fit.
+
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
+ Attributes
+ ----------
+ tree_ : Tree object
+ The underlying Tree object.
+
+ feature_importances_ : array of shape = [n_features]
+ The feature mportances (the higher, the more important the feature).
+ The importance I(f) of a feature f is computed as the (normalized)
+ total reduction of error brought by that feature. It is also known as
+ the Gini importance [4].
+
+ .. math::
+
+ I(f) = \sum_{nodes A for which f is used} n_samples(A) * \Delta err
+
See also
--------
DecisionTreeRegressor
@@ -539,6 +582,13 @@ class DecisionTreeClassifier(BaseDecisionTree, ClassifierMixin):
.. [3] T. Hastie, R. Tibshirani and J. Friedman. "Elements of Statistical
Learning", Springer, 2009.
+ .. [4] L. Breiman, and A. Cutler, "Random Forests",
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
+
+ See also
+ --------
+ DecisionTreeRegressor
+
Examples
--------
>>> from sklearn.datasets import load_iris
@@ -559,12 +609,14 @@ def __init__(self, criterion="gini",
min_split=1,
min_density=0.1,
max_features=None,
+ compute_importances=False,
random_state=None):
super(DecisionTreeClassifier, self).__init__(criterion,
max_depth,
min_split,
min_density,
max_features,
+ compute_importances,
random_state)
def predict_proba(self, X):
@@ -644,15 +696,30 @@ class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):
If None, all features are considered, otherwise max_features are chosen
at random.
+ compute_importances : boolean, optional (default=True)
+ Whether feature importances are computed and stored into the
+ ``feature_importances_`` attribute when calling fit.
+
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
- See also
- --------
- DecisionTreeClassifier
+ Attributes
+ ----------
+ tree_ : Tree object
+ The underlying Tree object.
+
+ feature_importances_ : array of shape = [n_features]
+ The feature mportances (the higher, the more important the feature).
+ The importance I(f) of a feature f is computed as the (normalized)
+ total reduction of error brought by that feature. It is also known as
+ the Gini importance [4].
+
+ .. math::
+
+ I(f) = \sum_{nodes A for which f is used} n_samples(A) * \Delta err
Notes
-----
@@ -666,6 +733,13 @@ class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):
.. [3] T. Hastie, R. Tibshirani and J. Friedman. "Elements of Statistical
Learning", Springer, 2009.
+ .. [4] L. Breiman, and A. Cutler, "Random Forests",
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
+
+ See also
+ --------
+ DecisionTreeClassifier
+
Examples
--------
>>> from sklearn.datasets import load_boston
@@ -688,12 +762,14 @@ def __init__(self, criterion="mse",
min_split=1,
min_density=0.1,
max_features=None,
+ compute_importances=False,
random_state=None):
super(DecisionTreeRegressor, self).__init__(criterion,
max_depth,
min_split,
min_density,
max_features,
+ compute_importances,
random_state)
@@ -725,12 +801,14 @@ def __init__(self, criterion="gini",
min_split=1,
min_density=0.1,
max_features=None,
+ compute_importances=False,
random_state=None):
super(ExtraTreeClassifier, self).__init__(criterion,
max_depth,
min_split,
min_density,
max_features,
+ compute_importances,
random_state)
self.find_split_ = _tree._find_best_random_split
@@ -764,12 +842,14 @@ def __init__(self, criterion="mse",
min_split=1,
min_density=0.1,
max_features=None,
+ compute_importances=False,
random_state=None):
super(ExtraTreeRegressor, self).__init__(criterion,
max_depth,
min_split,
min_density,
max_features,
+ compute_importances,
random_state)
self.find_split_ = _tree._find_best_random_split
Please sign in to comment.
Something went wrong with that request. Please try again.