Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Adding a pruning method to the tree #941

Open
wants to merge 14 commits into from

9 participants

@sgenoud

I have added a pruning method to the decision trees method. The idea with decision tree is to build a huge one, prune it via a weakest link algorithm until it reaches a size that is reasonable (neither overfitting nor under fitting).

I also have build a helper function, cv_scores_vs_n_leaves, that computes the cross validated scores for different sizes of the tree. This can be plotted with a function such as

def plot_cross_validated_scores(scores):
    """Plots the cross validated scores versus the number of leaves of trees"""
    import matplotlib.pyplot as plt
    means = np.array([np.mean(s) for s in scores])
    stds = np.array([np.std(s) for s in scores])

    x = range(len(scores)+1, 1, -1)

    plt.plot(x, means)
    plt.plot(x, means+stds, lw=1, c='0.7')
    plt.plot(x, means-stds, lw=1, c='0.7')

    plt.xlabel('Number of leaves')
    plt.ylabel('Cross validated score')

Then we choose which size is the best for the data.

Just a couple of notes:

  • I need to add some tests (I am not sure how to make them)
  • This also needs to be documented

Before doing that work, I would gladly have some feedback on these modifications

sklearn/tree/tree.py
((15 lines not shown))
+ return new_tree
+
+
+ def pruning_order(self):
+ """
+ Compute the order for which the tree should be pruned.
+
+ The algorithm used is weakest link pruning. It removes first the nodes
+ that improve the tree the least.
+
+
+ Parameters
+ ----------
+ tree : binary tree object
+ The binary tree for which to compute the complexity costs.
+
@jakevdp Collaborator
jakevdp added a note

The object doesn't need to be listed as a parameter here

@sgenoud
sgenoud added a note

sorry, I forgot to remove that when I refactored my code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jakevdp
Collaborator

Nice addition. I haven't had a chance to try it out in detail, but I read over the code and it looks good.
One suggestion: in the pruning_order function, it would be more efficient if we were able to pass an argument like max_to_prune which would allow you to specify the maximum number of nodes you're interested in pruning. As it's currently written, I think the function will sort all the nodes each time it's called.

@amueller amueller commented on the diff
sklearn/tree/tree.py
@@ -266,6 +291,105 @@ def _add_leaf(self, parent, is_left_child, value, error, n_samples):
return node_id
+ def _copy(self):
@amueller Owner

Is there a reason not to use clone here? Not sure if that copies the arrays, though. But I'm sure there is a method that does (for serialization for example).

@sgenoud
sgenoud added a note

clone clones the parameters of the Estimator object. Tree results from the fit of a DecisionTree -- it is not copied. Also DecisionTree has a Tree and is an Estimator, but a Tree is not an estimator (it inherits directly from object).

If someone knows a copying function, I will gladly use it.

@glouppe Owner
glouppe added a note

copy or deepcopy from the copy module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
sklearn/tree/tree.py
((4 lines not shown))
+ def prune(self, n_leaves):
+ """
+ Prunes the decision tree
+
+ This method is necessary to avoid overfitting tree models. While broad
+ decision trees should be computed in the first place, pruning them
+ allows for optimal, smaller trees.
+
+ Parameters
+ ----------
+ n_leaves : int
+ the number of leaves of the pruned tree
+
+ """
+ self.tree_ = self.tree_.prune(n_leaves)
+
@amueller Owner

should probably return self.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

Thanks for your contribution.
I didn't look at the code in detail, so just some very general remarks.

It would be nice to have some mentioning in the narrative docs and maybe extent an already existing example to show how the pruning works and what effect is hast.

I'm not sure what would be a good test for the pruning but maybe you can come up with something, for example testing on some toy data with known expected outcome.

Looking forward to seeing this in action :)

sklearn/tree/tree.py
((9 lines not shown))
+def _get_terminal_nodes(children):
+ """Lists the nodes that only have leaves as children"""
+ leaves = _get_leaves(children)
+ child_is_leaf = np.in1d(children, leaves).reshape(children.shape)
+ return np.where(np.all(child_is_leaf, axis=1))[0]
+
+def _next_to_prune(tree, children=None):
+ """Weakest link pruning for the subtree defined by children"""
+
+ if children is None:
+ children = tree.children
+
+ t_nodes = _get_terminal_nodes(children)
+ g_i = tree.init_error[t_nodes] - tree.best_error[t_nodes]
+
+ #alpha_i = np.min(g_i) / len(t_nodes)
@mrjbq7
mrjbq7 added a note

If this isn't needed, should remove it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jakevdp
Collaborator

I agree narrative docs and an example would be very helpful. You could modify these:

You could add a fit to the plot which uses a pruned tree. Hopefully it would do better than the green line, but not over-fit as much as the red line.

Also, a new example with your plot_cross_validated_scores() function would be very useful. It could go in the same directory, and be included in the same narrative document.

Sorry to ask for more work when you've already done so much! Let us know if you need any help. I'm excited to see the results! :grin:

sklearn/tree/tree.py
((58 lines not shown))
+ if (len(nodes) == max_to_prune) or (node == 0):
+ return np.array(nodes)
+
+ #Remove the subtree from the children array
+ children[children[node], :] = Tree.UNDEFINED
+ children[node, :] = Tree.LEAF
+
+ def prune(self, n_leaves):
+ """
+ Prunes the tree in order to obtain the optimal tree with n_leaves
+ leaves.
+
+
+ Parameters
+ ----------
+ n_leaves : binary tree object
@mrjbq7
mrjbq7 added a note

Isn't this an int, what does it mean to be a binary tree object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
@@ -130,6 +130,31 @@ def recurse(tree, node_id, parent=None):
return out_file
+#Helper functions for the pruning algorithm
@mrjbq7
mrjbq7 added a note

Might be cleaner to make these functions local to tree_.prune, sort of similar to how build has a recursive_partition.

@glouppe Owner
glouppe added a note

+1. I would pack all this into a single prune method in the Tree class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jaquesgrobler

Seeing as _get_leaves(..) and _get_terminal_nodes(..) aren't very big functions, you can condense them into _next_to_prune(..), as they aren't used anywhere else. IMHO, it just makes the code a bit easier to maintain and read. _next_to_prune(..) is still small enough to rather just include the code of the two smaller helper functions, than to have to break it up into smaller parts and having more functions to maintain. I hope I'm making sense here :)
Nice work on this, btw!
:v:

@sgenoud

Thanks for all your feedback!

I will for sure make some docs, but I wanted to see if what I did was worth pulling before doing more work.

About my helper functions, one little details: I use _get_leaves in the new property leaves that I added to tree (and use it to compute the number of nodes to prune). I usually like to have small helper functions, it makes things easier for me to read afterwards. Nevertheless if merging them in seems a better design to most of you, I'll do it.

Also it would be nice if one of you plays a bit with the feature to have an external "test" on the function. I have used it for my own needs (and compared it in one case with some equivalent R function), but external confirmation is usually a good thing.

Steve Genoud added some commits
Steve Genoud Incorporated feedback
Correction of small details (documentation, commented code and return value)
35a8b3d
Steve Genoud Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 4310876
Steve Genoud Made n_output an optional value 0409c05
sklearn/tree/tree.py
((52 lines not shown))
+ nodes = list()
+
+ while True:
+ node = _next_to_prune(self, children)
+ nodes.append(node)
+
+ if (len(nodes) == max_to_prune) or (node == 0):
+ return np.array(nodes)
+
+ #Remove the subtree from the children array
+ children[children[node], :] = Tree.UNDEFINED
+ children[node, :] = Tree.LEAF
+
+ def prune(self, n_leaves):
+ """
+ Prunes the tree in order to obtain the optimal tree with n_leaves
@glouppe Owner
glouppe added a note

I would rather say the optimal subtree.

@amueller Owner

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
@@ -453,6 +575,23 @@ def __init__(self, criterion,
self.tree_ = None
self.feature_importances_ = None
+ def prune(self, n_leaves):
+ """
+ Prunes the decision tree
+
+ This method is necessary to avoid overfitting tree models. While broad
+ decision trees should be computed in the first place, pruning them
+ allows for optimal, smaller trees.
@glouppe Owner
glouppe added a note

They are suboptimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

I just had a quick look at the plot_prune_boston.py example.
I noticed a couple of things:

  • You should add plt.show() at the end, so that people that just call the example also see something.
  • From the plot it is not immediately clear what one should see. I think you should write a small paragraph about what the example shows.
  • If I interpret the example correctly, the more leaves the better. This means not pruning is better then pruning (maybe 20 leaves is still pruned but I don't see performance deteriorating), right?
  • Finally, please run pep8 over it, there are some minor issues.
@amueller
Owner

The synthetic example is pretty cool on the other hand and nicely illustrates the effect of pruning.

It leaves me with one question, though: are there cases when pruning is better than regularizing via "min_leaf" or "max_depth"?
The example shows nicely that regularizing is better than not regularizing, but it is not immediately clear what the benefit of one or the other way of regularizing is. Do you think it is possible to include this in the example somehow? At the moment, the example looks like "if you don't prune, you overfit' to me.

@amueller
Owner

I don't know what the other tree-growers think, but I feel having a separate "prune" method breaks the API pretty badly. I think n_leaves should just be an additional parameter and prune should be _prune. If one want's more details, one can use cv_scores_vs_n_leaves. Which I would by the way rename to something like prune_path or similar, as this seems quite related to other regularization path functions.

@amueller
Owner

I just realized that having a separate prune method means that the pruning can not be used together with GridSearchCV, which is basically a no-go.

@sgenoud

If I interpret the example correctly, the more leaves the better. This means not pruning is better then pruning (maybe 20 leaves is still pruned but I don't see performance deteriorating), right?

I think the heuristics in this case is to go for the smallest tree with the value of the plateau. Actually R automatically uses such an heuristic. As you said, I should explain more how to read the graphs.

It leaves me with one question, though: are there cases when pruning is better than regularizing via "min_leaf" or "max_depth"?

I should probably show how the trees can be different in a case where we prune or not. Via pruning the resulting tree is "more optimal". It is possible that, while growing we do not go into a branch because the next step does not improve much the score, while the following step would improve it greatly -- but will not because this path is not chosen.

I think n_leaves should just be an additional parameter and prune should be _prune.

What do you think of a mixed approach? We could add an auto_prune argument to the fit method (in addition to n_leaves) and keep prune a method. This would mean that by default trees are grown and pruned, but users who want to be more picky about it can prune the trees themselves. Also, GridSearchCV would work.

Note that this would change the interface of this particular method (by that I mean that the default behaviour would become linked with pruning, while it was not so far). We would need to discuss a bit further what the default behaviour is (do we ask only for n_leaves by default grow a three with 3 more depth levels?).

Ok, if I sum it up, to follow better the API we need to change prune from an action that is made on a fitted Estimator to the default method for which trees are sized. This would have consequences on how we present it in the docs. I would propose the following:

  • Introducing a section that discusses methods of sizing a tree
  • Another section about the prune_path with the synthetic example
@amueller
Owner

@sgenoud Thanks for your quick feedback. Before you get to work, I would prefer if someone else would also voice there opinion, so that you don't do any unnecessary work regarding refactoring.

If you could find an example that illustrates the benefits of pruning compared to other regularization methods, that would be great. For the synthetic example that you made, I fear that "more optimal" means "more overfitted" and for example "min_samples_leaf" does better. Maybe you should also mention somewhere that the pruning fits more strongly to the data.

As a default behavior, I would have imagined to have no pruning, to be backward compatible and also because the ensemble module doesn't need pruning.

Steve Genoud pep8 corrections 92c6afb
@amueller
Owner

@glouppe @pprett @bdholt1 comments on this?

@glouppe
Owner

By default, I would indeed turn post-pruning off.

Regarding the API, I have to review the code before saying anything. I'll do that on Monday (sorry for the delay).

@bdholt1
Collaborator

I agree to leave pruning off by default.

Will take a closer look on monday as well...

@glouppe glouppe commented on the diff
doc/modules/tree.rst
@@ -183,6 +183,75 @@ instead of integer values::
* :ref:`example_tree_plot_tree_regression.py`
+
+.. _tree_pruning:
+
+Pruning
@glouppe Owner
glouppe added a note

To me, this section looks written as if post-pruning was the only way to prune a tree. I would rather introduce both pre-pruning (using min_samples_split or min_samples_leaf) and post-pruning and compare them both. What do you think?

@amueller Owner

I wouldn't call "min_samples_split" pruning but I think think comparing these different regularization methods would be good.

@glouppe Owner
glouppe added a note

Well, it is called a pre-pruning method in the literature.

@amueller Owner

Really? Ok then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@glouppe
Owner

Regarding the API I was thinking maybe we could go for something like this:

1) Add an n_leaves parameter into the constructor of decision trees. Then n_leaves could either be

  • None (default value) to turn post-pruning off,
  • an integer corresponding to the (maximum) number of leaves we want the tree to have, or
  • "auto" to let the algorithm automatically find the best number of leaves. In that case, a part the training set should be put away to determine the best value of n_leaves.

2) Have a prune(self, X, y, n_leaves=auto|int) method to prune the tree on the set (X, y). That way, users could still prune a tree after fitting, on a new independent test set of their choice.

From an implementation point of view, fit in 1) should re-use prune of 2).

What do you think?

@amueller
Owner

I think 1) a) and b) a must. This is the way to go to provide the usual interface.

for 1 c), I'm not entirely sure. This basically computes the regularization path and then takes the estimator that is best on a hold-out set, right? This is somewhat different from other regularization path functions. But maybe this would be better here? It certainly seems convenient.

Having 2) allows for maximum flexibility and I don't see a reason not to include it.

@glouppe
Owner

From an algorithmic point of view, wouldn't it also be better to implement the full usual weakest-link algorithm?

It is common to define a cost-complexity function C(alpha, T) = L(T) + alpha * |T| where

  • L(T) is the loss associated with the tree T
  • |T| is the number of leaves in T
  • alpha is a regularization parameter controlling the trade-off between L and the number of leaves.

and then to find the subtree that minimizes C(alpha, T).

(If we go for that, then that n_leaves parameter disappears in favor of alpha.)

@amueller
Owner

@glouppe sorry I don't have my trusty ESL with me. In how far is the algorithm different?

@glouppe
Owner

I have been quickly re-reading the literature on the weakest-link algorithm and it actually does not use any independent pruning set. Sorry for the confusion. Hence prune does not need X and y.

(I had in mind another post-pruning algorithm that uses an independent pruning set to find the subtree that minimizes some cost function over that set.)

@sgenoud

In a first version of the algorithm, I let to the user the choice to choose between alpha or n_leaves. Then I realised that alpha is just a mathematical trick to formulate the weakest link algorithm.

Practically it has no more importance than n_leaves. As it is not dimentionless (it is a cost/leave) we don't know what a good absolute value is a priori.

The algorithm I used is the same as in ESL (as far as I didn't make a mistake). The stopping parameter (n_leaves or alpha) is different. What you can do is compute alpha as well and use that as a stopping parameter. But you still need to build the cv_score... to find out what a good value of alpha is. It seemed to me that introducing a new concept, while it is mathematically purer, makes it more difficult to learn the algorithm.

@glouppe
Owner

@amueller Not that much, basically instead of stoppping when the tree has exactly n_leaves, you instead pick the subtree that minimizes C(alpha, T). The same sequence of subtrees can be used if I recall correclty. (I admit though that I am not really an expert of post-pruning methods, I never usethem.)

@glouppe
Owner

@sgenoud Yes I fully agree with you. However don't you think it would be better to implement the method that textbooks describe? Only not to confuse those that may have some background knowledge.

@sgenoud

@glouppe It is probably a question of taste in the end. Is there a scikit-learn policy for this kind of choice?

I would propose to make the point of alpha and n_leaves being equivalent in the documentation. But to keep a simpler interface for the function based only on the number of leaves. Users familiar with the method will understand quickly that there is not much difference and be confused until they read the docs.

Finally, we could also argue that it would make pre and post pruning more similar.

@amueller
Owner

I would favor n_leaves as it is much more intuitive. Also it is clear that the smallest change that has an effect is 1, and as @sgenoud said, it is more interpretable in the context of the other parameters.

@glouppe
Owner

Okay then, I am fine with that.

@sgenoud

I have tried to integrate your feedback, there is still two things to do:

  • edit the documentation
  • merge it with the master (the Tree object has been moved to a cython file I think)
@sgenoud

@glouppe would you mind merging my modifications of the Tree object with your refactoring? I am no Cython expert and you are the one who refactored the object (and therefore know it well).

@glouppe
Owner

@sgenoud I can do it, but it'll unfortunately have to wait until mid-Augustus. I am soon leaving for holidays and have a few remaining things to handle before.

@amueller
Owner

@sgenoud maybe you can work on the doc in the meantime?

@amueller
Owner

Anything new?

@mrjbq7

Need help with this? @sgenoud, what is the status of the patch?

@sgenoud

Hi guys, sorry I have started a new job that takes me a lot of time. I will have more for this in December normally. I'll keep you posted

@sgenoud sgenoud closed this
@sgenoud sgenoud reopened this
@erg
erg commented

This patch is basically completely busted now because it works on the old tree before things were ported to cython trees.

@aflaxman

I would like to do some pruning, so I've made an attempt to update the code in this pull request. I'm not sure if it is the best way to do things now, but at least it brings us close to where we were before.

The updated code is here: https://github.com/aflaxman/scikit-learn/tree/tree_pruning

Is there still interest in this?

@aflaxman aflaxman referenced this pull request from a commit
@glouppe glouppe Tree refactoring (19) 2fe48dc
@amueller
Owner

Yes, I think there is interest as long as it doesn't add overhead that can't be avoided for the forests.
I think having a pruning option for single trees would be great.
Maybe you should submit your own pull request with a reference to this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 10, 2012
  1. Introduced pruning to decision trees

    Steve Genoud authored
  2. Fixed documentation of pruning_order function

    Steve Genoud authored
  3. Added a max_to_prune argument to pruning_order

    Steve Genoud authored
Commits on Jul 11, 2012
  1. Incorporated feedback

    Steve Genoud authored
    Correction of small details (documentation, commented code and return value)
  2. Made n_output an optional value

    Steve Genoud authored
  3. Cleaned the function documentation (again)

    Steve Genoud authored
  4. Use shuffle and split to cross validate

    Steve Genoud authored
  5. First draft of the documentation

    Steve Genoud authored
Commits on Jul 12, 2012
  1. pep8 corrections

    Steve Genoud authored
Commits on Jul 24, 2012
  1. @sgenoud

    Renamed cv_scores_vs_n_leaves as pruned_path

    sgenoud authored
    Also renamed the helper function that plot the result of the function
  2. @sgenoud
  3. @sgenoud
This page is out of date. Refresh to see the latest.
View
77 doc/modules/tree.rst
@@ -55,10 +55,10 @@ Some advantages of decision trees are:
The disadvantages of decision trees include:
- Decision-tree learners can create over-complex trees that do not
- generalise the data well. This is called overfitting. Mechanisms
- such as pruning (not currently supported), setting the minimum
- number of samples required at a leaf node or setting the maximum
- depth of the tree are necessary to avoid this problem.
+ generalise the data well. This is called overfitting. Mechanisms such as
+ pruning, setting the minimum number of samples required at a leaf node or
+ setting the maximum depth of the tree are necessary to avoid this
+ problem.
- Decision trees can be unstable because small variations in the
data might result in a completely different tree being generated.
@@ -183,6 +183,75 @@ instead of integer values::
* :ref:`example_tree_plot_tree_regression.py`
+
+.. _tree_pruning:
+
+Pruning
@glouppe Owner
glouppe added a note

To me, this section looks written as if post-pruning was the only way to prune a tree. I would rather introduce both pre-pruning (using min_samples_split or min_samples_leaf) and post-pruning and compare them both. What do you think?

@amueller Owner

I wouldn't call "min_samples_split" pruning but I think think comparing these different regularization methods would be good.

@glouppe Owner
glouppe added a note

Well, it is called a pre-pruning method in the literature.

@amueller Owner

Really? Ok then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+=======
+
+A common approach to get the best possible tree is to grow a huge tree (for
+instance with ``max_depth=8``) and then prune it to an optimum size. As well as
+providing a `prune` method for both :class:`DecisionTreeRegressor` and
+:class:`DecisionTreeClassifier`, the function ``prune_path`` is useful
+to find what the optimum size is for a tree.
+
+The prune method just takes as argument the number of leaves the fitted tree
+should have (an int)::
+
+ >>> from sklearn.datasets import load_boston
+ >>> from sklearn import tree
+ >>> boston = load_boston()
+ >>> clf = tree.DecisionTreeRegressor(max_depth=8)
+ >>> clf = clf.fit(boston.data, boston.target)
+ >>> clf = clf.prune(8)
+
+In order to find the optimal number of leaves we can use cross validated scores
+on the data::
+
+ >>> from sklearn.datasets import load_boston
+ >>> from sklearn import tree
+ >>> boston = load_boston()
+ >>> clf = tree.DecisionTreeRegressor(max_depth=8)
+ >>> scores = tree.prune_path(clf, boston.data, boston.target,
+ ... max_n_leaves=20, n_iterations=10, random_state=0)
+
+In order to plot the scores one can use the following function::
+
+ def plot_pruned_path(scores, with_std=True):
+ """Plots the cross validated scores versus the number of leaves of trees"""
+ import matplotlib.pyplot as plt
+ means = np.array([np.mean(s) for s in scores])
+ stds = np.array([np.std(s) for s in scores]) / np.sqrt(len(scores[1]))
+
+ x = range(len(scores) + 1, 1, -1)
+
+ plt.plot(x, means)
+ if with_std:
+ plt.plot(x, means + 2 * stds, lw=1, c='0.7')
+ plt.plot(x, means - 2 * stds, lw=1, c='0.7')
+
+ plt.xlabel('Number of leaves')
+ plt.ylabel('Cross validated score')
+
+
+For instance, using the Boston dataset we obtain such a graph
+
+.. figure:: ../auto_examples/tree/images/plot_prune_boston_1.png
+ :target: ../auto_examples/tree/plot_prune_boston.html
+ :align: center
+ :scale: 75
+
+Here we see clearly that the optimum number of leaves is between 6 and 9. After
+that additional leaves do not improve (or diminish) the score of the cross
+validation.
+
+.. topic:: Examples:
+
+ * :ref:`example_tree_plot_prune_boston.py`
+ * :ref:`example_tree_plot_overfitting_cv.py`
+
+
+
.. _tree_multioutput:
Multi-output problems
View
72 examples/tree/plot_overfitting_cv.py
@@ -0,0 +1,72 @@
+"""
+====================================================
+Comparison of cross validated score with overfitting
+====================================================
+
+These two plots compare the cross validated score of a the regression of
+a simple function. We see that before the maximum value of 7 the regression is
+far for the real function. On the other hand, for higher number of leaves we
+clearly overfit.
+
+"""
+print __doc__
+
+import numpy as np
+from sklearn import tree
+
+
+def plot_pruned_path(scores, with_std=True):
+ """Plots the cross validated scores versus the number of leaves of trees"""
+ import matplotlib.pyplot as plt
+ means = np.array([np.mean(s) for s in scores])
+ stds = np.array([np.std(s) for s in scores]) / np.sqrt(len(scores[1]))
+
+ x = range(len(scores) + 1, 1, -1)
+
+ plt.plot(x, means)
+ if with_std:
+ plt.plot(x, means + 2 * stds, lw=1, c='0.7')
+ plt.plot(x, means - 2 * stds, lw=1, c='0.7')
+
+ plt.xlabel('Number of leaves')
+ plt.ylabel('Cross validated score')
+
+
+# Create a random dataset
+rng = np.random.RandomState(1)
+X = np.sort(5 * rng.rand(80, 1), axis=0)
+y = np.sin(X).ravel()
+y[1::5] += 3 * (0.5 - rng.rand(16))
+
+
+clf = tree.DecisionTreeRegressor(max_depth=20)
+scores = tree.prune_path(clf, X, y, max_n_leaves=20,
+ n_iterations=100, random_state=0)
+plot_pruned_path(scores)
+
+clf = tree.DecisionTreeRegressor(max_depth=20, n_leaves=15)
+clf.fit(X, y)
+X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
+
+#Prepare the different pruned level
+y_15 = clf.predict(X_test)
+
+clf = clf.prune(6)
+y_7 = clf.predict(X_test)
+
+clf = clf.prune(2)
+y_2 = clf.predict(X_test)
+
+# Plot the results
+import pylab as pl
+
+pl.figure()
+pl.scatter(X, y, c="k", label="data")
+pl.plot(X_test, y_2, c="g", label="n_leaves=2", linewidth=2)
+pl.plot(X_test, y_7, c="b", label="n_leaves=7", linewidth=2)
+pl.plot(X_test, y_15, c="r", label="n_leaves=15", linewidth=2)
+pl.xlabel("data")
+pl.ylabel("target")
+pl.title("Decision Tree Regression with levels of pruning")
+pl.legend()
+pl.show()
View
39 examples/tree/plot_prune_boston.py
@@ -0,0 +1,39 @@
+"""
+============================================
+Cross validated scores of the boston dataset
+============================================
+
+"""
+print __doc__
+
+import numpy as np
+from sklearn.datasets import load_boston
+from sklearn import tree
+
+
+def plot_pruned_path(scores, with_std=True):
+ """Plots the cross validated scores versus the number of leaves of trees"""
+ import matplotlib.pyplot as plt
+ means = np.array([np.mean(s) for s in scores])
+ stds = np.array([np.std(s) for s in scores]) / np.sqrt(len(scores[1]))
+
+ x = range(len(scores) + 1, 1, -1)
+
+ plt.plot(x, means)
+ if with_std:
+ plt.plot(x, means + 2 * stds, lw=1, c='0.7')
+ plt.plot(x, means - 2 * stds, lw=1, c='0.7')
+
+ plt.xlabel('Number of leaves')
+ plt.ylabel('Cross validated score')
+
+
+boston = load_boston()
+clf = tree.DecisionTreeRegressor(max_depth=8)
+
+#Compute the cross validated scores
+scores = tree.prune_path(clf, boston.data, boston.target,
+ max_n_leaves=20, n_iterations=10,
+ random_state=0)
+
+plot_pruned_path(scores)
View
1  sklearn/tree/__init__.py
@@ -8,3 +8,4 @@
from .tree import ExtraTreeClassifier
from .tree import ExtraTreeRegressor
from .tree import export_graphviz
+from .tree import prune_path
View
220 sklearn/tree/tree.py
@@ -174,7 +174,7 @@ class Tree(object):
LEAF = -1
UNDEFINED = -2
- def __init__(self, n_classes, n_features, n_outputs, capacity=3):
+ def __init__(self, n_classes, n_features, n_outputs=1, capacity=3):
self.n_classes = n_classes
self.n_features = n_features
self.n_outputs = n_outputs
@@ -266,6 +266,122 @@ def _add_leaf(self, parent, is_left_child, value, error, n_samples):
return node_id
+ def _copy(self):
@amueller Owner

Is there a reason not to use clone here? Not sure if that copies the arrays, though. But I'm sure there is a method that does (for serialization for example).

@sgenoud
sgenoud added a note

clone clones the parameters of the Estimator object. Tree results from the fit of a DecisionTree -- it is not copied. Also DecisionTree has a Tree and is an Estimator, but a Tree is not an estimator (it inherits directly from object).

If someone knows a copying function, I will gladly use it.

@glouppe Owner
glouppe added a note

copy or deepcopy from the copy module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ new_tree = Tree(self.n_classes, self.n_features, self.n_outputs)
+ new_tree.node_count = self.node_count
+ new_tree.children = self.children.copy()
+ new_tree.feature = self.feature.copy()
+ new_tree.threshold = self.threshold.copy()
+ new_tree.value = self.value.copy()
+ new_tree.best_error = self.best_error.copy()
+ new_tree.init_error = self.init_error.copy()
+ new_tree.n_samples = self.n_samples.copy()
+
+ return new_tree
+
+ @staticmethod
+ def _get_leaves(children):
+ """Lists the leaves from the children array of a tree object"""
+ return np.where(np.all(children == Tree.LEAF, axis=1))[0]
+
+ @property
+ def leaves(self):
+ return self._get_leaves(self.children)
+
+ def pruning_order(self, max_to_prune=None):
+ """Compute the order for which the tree should be pruned.
+
+ The algorithm used is weakest link pruning. It removes first the nodes
+ that improve the tree the least.
+
+
+ Parameters
+ ----------
+ max_to_prune : int, optional (default=all the nodes)
+ maximum number of nodes to prune
+
+ Returns
+ -------
+ nodes : numpy array
+ list of the nodes to remove to get to the optimal subtree.
+
+ References
+ ----------
+
+ .. [1] J. Friedman and T. Hastie, "The elements of statistical
+ learning", 2001, section 9.2.1
+
+ """
+
+ def _get_terminal_nodes(children):
+ """Lists the nodes that only have leaves as children"""
+ leaves = self._get_leaves(children)
+ child_is_leaf = np.in1d(children, leaves).reshape(children.shape)
+ return np.where(np.all(child_is_leaf, axis=1))[0]
+
+ def _next_to_prune(tree, children=None):
+ """Weakest link pruning for the subtree defined by children"""
+
+ if children is None:
+ children = tree.children
+
+ t_nodes = _get_terminal_nodes(children)
+ g_i = tree.init_error[t_nodes] - tree.best_error[t_nodes]
+
+ return t_nodes[np.argmin(g_i)]
+
+ if max_to_prune is None:
+ max_to_prune = self.node_count
+
+ children = self.children.copy()
+ nodes = list()
+
+ while True:
+ node = _next_to_prune(self, children)
+ nodes.append(node)
+
+ if (len(nodes) == max_to_prune) or (node == 0):
+ return np.array(nodes)
+
+ #Remove the subtree from the children array
+ children[children[node], :] = Tree.UNDEFINED
+ children[node, :] = Tree.LEAF
+
+ def prune(self, n_leaves):
+ """Prunes the tree to obtain the optimal subtree with n_leaves leaves.
+
+
+ Parameters
+ ----------
+ n_leaves : int
+ The final number of leaves the algorithm should bring
+
+ Returns
+ -------
+ tree : a Tree object
+ returns a new, pruned, tree
+
+ References
+ ----------
+
+ .. [1] J. Friedman and T. Hastie, "The elements of statistical
+ learning", 2001, section 9.2.1
+
+ """
+
+ to_remove_count = self.node_count - len(self.leaves) - n_leaves + 1
+ nodes_to_remove = self.pruning_order(to_remove_count)
+
+ out_tree = self._copy()
+
+ for node in nodes_to_remove:
+ #TODO: Add a Tree method to remove a branch of a tree
+ out_tree.children[out_tree.children[node], :] = Tree.UNDEFINED
+ out_tree.children[node, :] = Tree.LEAF
+ out_tree.node_count -= 2
+
+ return out_tree
+
def build(self, X, y, criterion, max_depth, min_samples_split,
min_samples_leaf, min_density, max_features, random_state,
find_split, sample_mask=None, X_argsorted=None):
@@ -440,6 +556,7 @@ def __init__(self, criterion,
max_depth,
min_samples_split,
min_samples_leaf,
+ n_leaves,
min_density,
max_features,
compute_importances,
@@ -448,6 +565,7 @@ def __init__(self, criterion,
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
+ self.n_leaves = n_leaves
self.min_density = min_density
self.max_features = max_features
self.compute_importances = compute_importances
@@ -462,6 +580,22 @@ def __init__(self, criterion,
self.tree_ = None
self.feature_importances_ = None
+ def prune(self, n_leaves):
+ """Prunes the decision tree
+
+ This method is necessary to avoid overfitting tree models. While broad
+ decision trees should be computed in the first place, pruning them
+ allows for smaller trees.
+
+ Parameters
+ ----------
+ n_leaves : int
+ the number of leaves of the pruned tree
+
+ """
+ self.tree_ = self.tree_.prune(n_leaves)
+ return self
@amueller Owner

should probably return self.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
def fit(self, X, y, sample_mask=None, X_argsorted=None):
"""Build a decision tree from the training set (X, y).
@@ -565,6 +699,9 @@ def fit(self, X, y, sample_mask=None, X_argsorted=None):
self.feature_importances_ = \
self.tree_.compute_feature_importances()
+ if self.n_leaves is not None:
+ self.prune(self.n_leaves)
+
return self
def predict(self, X):
@@ -634,6 +771,10 @@ class DecisionTreeClassifier(BaseDecisionTree, ClassifierMixin):
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
+ n_leaves : integer, optional (default=None)
+ The number of leaves of the post-pruned tree. If None, no post-pruning
+ will be run.
+
min_density : float, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It
controls the minimum density of the `sample_mask` (i.e. the
@@ -711,6 +852,7 @@ def __init__(self, criterion="gini",
max_depth=None,
min_samples_split=1,
min_samples_leaf=1,
+ n_leaves=None,
min_density=0.1,
max_features=None,
compute_importances=False,
@@ -719,6 +861,7 @@ def __init__(self, criterion="gini",
max_depth,
min_samples_split,
min_samples_leaf,
+ n_leaves,
min_density,
max_features,
compute_importances,
@@ -814,6 +957,10 @@ class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
+ n_leaves : integer, optional (default=None)
+ The number of leaves of the post-pruned tree. If None, no post-pruning
+ will be run.
+
min_density : float, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It
controls the minimum density of the `sample_mask` (i.e. the
@@ -893,6 +1040,7 @@ def __init__(self, criterion="mse",
max_depth=None,
min_samples_split=1,
min_samples_leaf=1,
+ n_leaves=None,
min_density=0.1,
max_features=None,
compute_importances=False,
@@ -901,6 +1049,7 @@ def __init__(self, criterion="mse",
max_depth,
min_samples_split,
min_samples_leaf,
+ n_leaves,
min_density,
max_features,
compute_importances,
@@ -933,6 +1082,7 @@ def __init__(self, criterion="gini",
max_depth=None,
min_samples_split=1,
min_samples_leaf=1,
+ n_leaves=None,
min_density=0.1,
max_features="auto",
compute_importances=False,
@@ -941,6 +1091,7 @@ def __init__(self, criterion="gini",
max_depth,
min_samples_split,
min_samples_leaf,
+ n_leaves,
min_density,
max_features,
compute_importances,
@@ -979,6 +1130,7 @@ def __init__(self, criterion="mse",
max_depth=None,
min_samples_split=1,
min_samples_leaf=1,
+ n_leaves=None,
min_density=0.1,
max_features="auto",
compute_importances=False,
@@ -987,9 +1139,75 @@ def __init__(self, criterion="mse",
max_depth,
min_samples_split,
min_samples_leaf,
+ n_leaves,
min_density,
max_features,
compute_importances,
random_state)
self.find_split_ = _tree._find_best_random_split
+
+
+def prune_path(clf, X, y, max_n_leaves=10, n_iterations=10,
+ test_size=0.1, random_state=None):
+ """Cross validation of scores for different values of the decision tree.
+
+ This function allows to test what the optimal size of the post-pruned
+ decision tree should be. It computes cross validated scores for different
+ size of the tree.
+
+ Parameters
+ ----------
+ clf: decision tree estimator object
+ The object to use to fit the data
+
+ X: array-like of shape at least 2D
+ The data to fit.
+
+ y: array-like
+ The target variable to try to predict.
+
+ max_n_leaves : int, optional (default=10)
+ maximum number of leaves of the tree to prune
+
+ n_iterations : int, optional (default=10)
+ Number of re-shuffling & splitting iterations.
+
+ test_size : float (default=0.1) or int
+ If float, should be between 0.0 and 1.0 and represent the
+ proportion of the dataset to include in the test split. If
+ int, represents the absolute number of test samples.
+
+ random_state : int or RandomState
+ Pseudo-random number generator state used for random sampling.
+
+ Returns
+ -------
+ scores : list of list of floats
+ The scores of the computed cross validated trees grouped by tree size.
+ scores[0] correspond to the values of trees of size max_n_leaves and
+ scores[-1] to the tree with just two leaves.
+
+ """
+
+ from ..base import clone
+ from ..cross_validation import ShuffleSplit
+
+ scores = list()
+
+ kf = ShuffleSplit(len(y), n_iterations, test_size,
+ random_state=random_state)
+ for train, test in kf:
+ estimator = clone(clf)
+ fitted = estimator.fit(X[train], y[train])
+
+ loc_scores = list()
+ for i in range(max_n_leaves, 1, -1):
+ #We loop from the bigger values to the smaller ones in order to be
+ #able to compute the original tree once, and then make it smaller
+ fitted.prune(n_leaves=i)
+ loc_scores.append(fitted.score(X[test], y[test]))
+
+ scores.append(loc_scores)
+
+ return zip(*scores)
Something went wrong with that request. Please try again.