# scikit-learn/scikit-learn

### Subversion checkout URL

You can clone with
or
.

# Adding a pruning method to the tree#941

Open
wants to merge 14 commits into from
+404 −5

### 9 participants

I have added a pruning method to the decision trees method. The idea with decision tree is to build a huge one, prune it via a weakest link algorithm until it reaches a size that is reasonable (neither overfitting nor under fitting).

I also have build a helper function, cv_scores_vs_n_leaves, that computes the cross validated scores for different sizes of the tree. This can be plotted with a function such as

def plot_cross_validated_scores(scores):
"""Plots the cross validated scores versus the number of leaves of trees"""
import matplotlib.pyplot as plt
means = np.array([np.mean(s) for s in scores])
stds = np.array([np.std(s) for s in scores])

x = range(len(scores)+1, 1, -1)

plt.plot(x, means)
plt.plot(x, means+stds, lw=1, c='0.7')
plt.plot(x, means-stds, lw=1, c='0.7')

plt.xlabel('Number of leaves')
plt.ylabel('Cross validated score')

Then we choose which size is the best for the data.

Just a couple of notes:

• I need to add some tests (I am not sure how to make them)
• This also needs to be documented

Before doing that work, I would gladly have some feedback on these modifications

 Steve Genoud Introduced pruning to decision trees 4a75a4a
sklearn/tree/tree.py
 ((15 lines not shown)) + return new_tree + + + def pruning_order(self): + """ + Compute the order for which the tree should be pruned. + + The algorithm used is weakest link pruning. It removes first the nodes + that improve the tree the least. + + + Parameters + ---------- + tree : binary tree object + The binary tree for which to compute the complexity costs. +
 Collaborator jakevdp added a note Jul 10, 2012 The object doesn't need to be listed as a parameter here sgenoud added a note Jul 10, 2012 sorry, I forgot to remove that when I refactored my code. to join this conversation on GitHub. Already have an account? Sign in to comment
Collaborator

Nice addition. I haven't had a chance to try it out in detail, but I read over the code and it looks good.
One suggestion: in the pruning_order function, it would be more efficient if we were able to pass an argument like max_to_prune which would allow you to specify the maximum number of nodes you're interested in pruning. As it's currently written, I think the function will sort all the nodes each time it's called.

 Steve Genoud Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 48fad0c Steve Genoud Fixed documentation of pruning_order function c51698a Steve Genoud Added a max_to_prune argument to pruning_order c139f6a
commented on the diff
sklearn/tree/tree.py
 @@ -266,6 +291,105 @@ def _add_leaf(self, parent, is_left_child, value, error, n_samples): return node_id + def _copy(self):
 Owner amueller added a note Jul 10, 2012 Is there a reason not to use clone here? Not sure if that copies the arrays, though. But I'm sure there is a method that does (for serialization for example). sgenoud added a note Jul 11, 2012 clone clones the parameters of the Estimator object. Tree results from the fit of a DecisionTree -- it is not copied. Also DecisionTree has a Tree and is an Estimator, but a Tree is not an estimator (it inherits directly from object). If someone knows a copying function, I will gladly use it. Owner glouppe added a note Jul 11, 2012 copy or deepcopy from the copy module. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/tree/tree.py
 ((4 lines not shown)) + def prune(self, n_leaves): + """ + Prunes the decision tree + + This method is necessary to avoid overfitting tree models. While broad + decision trees should be computed in the first place, pruning them + allows for optimal, smaller trees. + + Parameters + ---------- + n_leaves : int + the number of leaves of the pruned tree + + """ + self.tree_ = self.tree_.prune(n_leaves) +
 Owner amueller added a note Jul 10, 2012 should probably return self. to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

I didn't look at the code in detail, so just some very general remarks.

It would be nice to have some mentioning in the narrative docs and maybe extent an already existing example to show how the pruning works and what effect is hast.

I'm not sure what would be a good test for the pruning but maybe you can come up with something, for example testing on some toy data with known expected outcome.

Looking forward to seeing this in action :)

sklearn/tree/tree.py
 ((9 lines not shown)) +def _get_terminal_nodes(children): + """Lists the nodes that only have leaves as children""" + leaves = _get_leaves(children) + child_is_leaf = np.in1d(children, leaves).reshape(children.shape) + return np.where(np.all(child_is_leaf, axis=1))[0] + +def _next_to_prune(tree, children=None): + """Weakest link pruning for the subtree defined by children""" + + if children is None: + children = tree.children + + t_nodes = _get_terminal_nodes(children) + g_i = tree.init_error[t_nodes] - tree.best_error[t_nodes] + + #alpha_i = np.min(g_i) / len(t_nodes)
 mrjbq7 added a note Jul 10, 2012 If this isn't needed, should remove it... to join this conversation on GitHub. Already have an account? Sign in to comment
Collaborator

I agree narrative docs and an example would be very helpful. You could modify these:

You could add a fit to the plot which uses a pruned tree. Hopefully it would do better than the green line, but not over-fit as much as the red line.

Also, a new example with your plot_cross_validated_scores() function would be very useful. It could go in the same directory, and be included in the same narrative document.

Sorry to ask for more work when you've already done so much! Let us know if you need any help. I'm excited to see the results!

sklearn/tree/tree.py
 ((58 lines not shown)) + if (len(nodes) == max_to_prune) or (node == 0): + return np.array(nodes) + + #Remove the subtree from the children array + children[children[node], :] = Tree.UNDEFINED + children[node, :] = Tree.LEAF + + def prune(self, n_leaves): + """ + Prunes the tree in order to obtain the optimal tree with n_leaves + leaves. + + + Parameters + ---------- + n_leaves : binary tree object
 mrjbq7 added a note Jul 10, 2012 Isn't this an int, what does it mean to be a binary tree object? to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
 @@ -130,6 +130,31 @@ def recurse(tree, node_id, parent=None): return out_file +#Helper functions for the pruning algorithm
 mrjbq7 added a note Jul 10, 2012 Might be cleaner to make these functions local to tree_.prune, sort of similar to how build has a recursive_partition. Owner glouppe added a note Jul 23, 2012 +1. I would pack all this into a single prune method in the Tree class. to join this conversation on GitHub. Already have an account? Sign in to comment

Seeing as _get_leaves(..) and _get_terminal_nodes(..) aren't very big functions, you can condense them into _next_to_prune(..), as they aren't used anywhere else. IMHO, it just makes the code a bit easier to maintain and read. _next_to_prune(..) is still small enough to rather just include the code of the two smaller helper functions, than to have to break it up into smaller parts and having more functions to maintain. I hope I'm making sense here :)
Nice work on this, btw!

I will for sure make some docs, but I wanted to see if what I did was worth pulling before doing more work.

About my helper functions, one little details: I use _get_leaves in the new property leaves that I added to tree (and use it to compute the number of nodes to prune). I usually like to have small helper functions, it makes things easier for me to read afterwards. Nevertheless if merging them in seems a better design to most of you, I'll do it.

Also it would be nice if one of you plays a bit with the feature to have an external "test" on the function. I have used it for my own needs (and compared it in one case with some equivalent R function), but external confirmation is usually a good thing.

 Steve Genoud Incorporated feedback Correction of small details (documentation, commented code and return value) 35a8b3d Steve Genoud Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 4310876 Steve Genoud Made n_output an optional value 0409c05
sklearn/tree/tree.py
 ((52 lines not shown)) + nodes = list() + + while True: + node = _next_to_prune(self, children) + nodes.append(node) + + if (len(nodes) == max_to_prune) or (node == 0): + return np.array(nodes) + + #Remove the subtree from the children array + children[children[node], :] = Tree.UNDEFINED + children[node, :] = Tree.LEAF + + def prune(self, n_leaves): + """ + Prunes the tree in order to obtain the optimal tree with n_leaves
 Owner glouppe added a note Jul 11, 2012 I would rather say the optimal subtree. Owner amueller added a note Jul 11, 2012 +1 to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tree/tree.py
 @@ -453,6 +575,23 @@ def __init__(self, criterion, self.tree_ = None self.feature_importances_ = None + def prune(self, n_leaves): + """ + Prunes the decision tree + + This method is necessary to avoid overfitting tree models. While broad + decision trees should be computed in the first place, pruning them + allows for optimal, smaller trees.
 Owner glouppe added a note Jul 11, 2012 They are suboptimal. to join this conversation on GitHub. Already have an account? Sign in to comment
 Steve Genoud Cleaned the function documentation (again) 89af373 Steve Genoud Use shuffle and split to cross validate 45adf40 Steve Genoud First draft of the documentation 9e41d88
Owner

I just had a quick look at the plot_prune_boston.py example.
I noticed a couple of things:

• You should add plt.show() at the end, so that people that just call the example also see something.
• From the plot it is not immediately clear what one should see. I think you should write a small paragraph about what the example shows.
• If I interpret the example correctly, the more leaves the better. This means not pruning is better then pruning (maybe 20 leaves is still pruned but I don't see performance deteriorating), right?
• Finally, please run pep8 over it, there are some minor issues.
Owner

The synthetic example is pretty cool on the other hand and nicely illustrates the effect of pruning.

It leaves me with one question, though: are there cases when pruning is better than regularizing via "min_leaf" or "max_depth"?
The example shows nicely that regularizing is better than not regularizing, but it is not immediately clear what the benefit of one or the other way of regularizing is. Do you think it is possible to include this in the example somehow? At the moment, the example looks like "if you don't prune, you overfit' to me.

Owner

I don't know what the other tree-growers think, but I feel having a separate "prune" method breaks the API pretty badly. I think n_leaves should just be an additional parameter and prune should be _prune. If one want's more details, one can use cv_scores_vs_n_leaves. Which I would by the way rename to something like prune_path or similar, as this seems quite related to other regularization path functions.

Owner

I just realized that having a separate prune method means that the pruning can not be used together with GridSearchCV, which is basically a no-go.

If I interpret the example correctly, the more leaves the better. This means not pruning is better then pruning (maybe 20 leaves is still pruned but I don't see performance deteriorating), right?

I think the heuristics in this case is to go for the smallest tree with the value of the plateau. Actually R automatically uses such an heuristic. As you said, I should explain more how to read the graphs.

It leaves me with one question, though: are there cases when pruning is better than regularizing via "min_leaf" or "max_depth"?

I should probably show how the trees can be different in a case where we prune or not. Via pruning the resulting tree is "more optimal". It is possible that, while growing we do not go into a branch because the next step does not improve much the score, while the following step would improve it greatly -- but will not because this path is not chosen.

I think n_leaves should just be an additional parameter and prune should be _prune.

What do you think of a mixed approach? We could add an auto_prune argument to the fit method (in addition to n_leaves) and keep prune a method. This would mean that by default trees are grown and pruned, but users who want to be more picky about it can prune the trees themselves. Also, GridSearchCV would work.

Note that this would change the interface of this particular method (by that I mean that the default behaviour would become linked with pruning, while it was not so far). We would need to discuss a bit further what the default behaviour is (do we ask only for n_leaves by default grow a three with 3 more depth levels?).

Ok, if I sum it up, to follow better the API we need to change prune from an action that is made on a fitted Estimator to the default method for which trees are sized. This would have consequences on how we present it in the docs. I would propose the following:

• Introducing a section that discusses methods of sizing a tree
• Another section about the prune_path with the synthetic example
Owner

@sgenoud Thanks for your quick feedback. Before you get to work, I would prefer if someone else would also voice there opinion, so that you don't do any unnecessary work regarding refactoring.

If you could find an example that illustrates the benefits of pruning compared to other regularization methods, that would be great. For the synthetic example that you made, I fear that "more optimal" means "more overfitted" and for example "min_samples_leaf" does better. Maybe you should also mention somewhere that the pruning fits more strongly to the data.

As a default behavior, I would have imagined to have no pruning, to be backward compatible and also because the ensemble module doesn't need pruning.

 Steve Genoud pep8 corrections 92c6afb
Owner

@glouppe @pprett @bdholt1 comments on this?

Owner

By default, I would indeed turn post-pruning off.

Regarding the API, I have to review the code before saying anything. I'll do that on Monday (sorry for the delay).

Collaborator

I agree to leave pruning off by default.

Will take a closer look on monday as well...

commented on the diff
doc/modules/tree.rst
 @@ -183,6 +183,75 @@ instead of integer values:: * :ref:`example_tree_plot_tree_regression.py` + +.. _tree_pruning: + +Pruning
 Owner glouppe added a note Jul 23, 2012 To me, this section looks written as if post-pruning was the only way to prune a tree. I would rather introduce both pre-pruning (using min_samples_split or min_samples_leaf) and post-pruning and compare them both. What do you think? Owner amueller added a note Jul 23, 2012 I wouldn't call "min_samples_split" pruning but I think think comparing these different regularization methods would be good. Owner glouppe added a note Jul 23, 2012 Well, it is called a pre-pruning method in the literature. Owner amueller added a note Jul 23, 2012 Really? Ok then. to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

Regarding the API I was thinking maybe we could go for something like this:

1) Add an n_leaves parameter into the constructor of decision trees. Then n_leaves could either be

• None (default value) to turn post-pruning off,
• an integer corresponding to the (maximum) number of leaves we want the tree to have, or
• "auto" to let the algorithm automatically find the best number of leaves. In that case, a part the training set should be put away to determine the best value of n_leaves.

2) Have a prune(self, X, y, n_leaves=auto|int) method to prune the tree on the set (X, y). That way, users could still prune a tree after fitting, on a new independent test set of their choice.

From an implementation point of view, fit in 1) should re-use prune of 2).

What do you think?

Owner

I think 1) a) and b) a must. This is the way to go to provide the usual interface.

for 1 c), I'm not entirely sure. This basically computes the regularization path and then takes the estimator that is best on a hold-out set, right? This is somewhat different from other regularization path functions. But maybe this would be better here? It certainly seems convenient.

Having 2) allows for maximum flexibility and I don't see a reason not to include it.

Owner

From an algorithmic point of view, wouldn't it also be better to implement the full usual weakest-link algorithm?

It is common to define a cost-complexity function C(alpha, T) = L(T) + alpha * |T| where

• L(T) is the loss associated with the tree T
• |T| is the number of leaves in T
• alpha is a regularization parameter controlling the trade-off between L and the number of leaves.

and then to find the subtree that minimizes C(alpha, T).

(If we go for that, then that n_leaves parameter disappears in favor of alpha.)

Owner

@glouppe sorry I don't have my trusty ESL with me. In how far is the algorithm different?

Owner

I have been quickly re-reading the literature on the weakest-link algorithm and it actually does not use any independent pruning set. Sorry for the confusion. Hence prune does not need X and y.

(I had in mind another post-pruning algorithm that uses an independent pruning set to find the subtree that minimizes some cost function over that set.)

In a first version of the algorithm, I let to the user the choice to choose between alpha or n_leaves. Then I realised that alpha is just a mathematical trick to formulate the weakest link algorithm.

Practically it has no more importance than n_leaves. As it is not dimentionless (it is a cost/leave) we don't know what a good absolute value is a priori.

The algorithm I used is the same as in ESL (as far as I didn't make a mistake). The stopping parameter (n_leaves or alpha) is different. What you can do is compute alpha as well and use that as a stopping parameter. But you still need to build the cv_score... to find out what a good value of alpha is. It seemed to me that introducing a new concept, while it is mathematically purer, makes it more difficult to learn the algorithm.

Owner

@amueller Not that much, basically instead of stoppping when the tree has exactly n_leaves, you instead pick the subtree that minimizes C(alpha, T). The same sequence of subtrees can be used if I recall correclty. (I admit though that I am not really an expert of post-pruning methods, I never usethem.)

Owner

@sgenoud Yes I fully agree with you. However don't you think it would be better to implement the method that textbooks describe? Only not to confuse those that may have some background knowledge.

@glouppe It is probably a question of taste in the end. Is there a scikit-learn policy for this kind of choice?

I would propose to make the point of alpha and n_leaves being equivalent in the documentation. But to keep a simpler interface for the function based only on the number of leaves. Users familiar with the method will understand quickly that there is not much difference and be confused until they read the docs.

Finally, we could also argue that it would make pre and post pruning more similar.

Owner

I would favor n_leaves as it is much more intuitive. Also it is clear that the smallest change that has an effect is 1, and as @sgenoud said, it is more interpretable in the context of the other parameters.

Owner

Okay then, I am fine with that.

 sgenoud Renamed cv_scores_vs_n_leaves as pruned_path Also renamed the helper function that plot the result of the function acd4048 sgenoud Added n_leaves to the DecisionTree estimator 98255d8 sgenoud Moved the helper functions inside the Tree object ea8e460

I have tried to integrate your feedback, there is still two things to do:

• edit the documentation
• merge it with the master (the Tree object has been moved to a cython file I think)

@glouppe would you mind merging my modifications of the Tree object with your refactoring? I am no Cython expert and you are the one who refactored the object (and therefore know it well).

Owner

@sgenoud I can do it, but it'll unfortunately have to wait until mid-Augustus. I am soon leaving for holidays and have a few remaining things to handle before.

Owner

@sgenoud maybe you can work on the doc in the meantime?

Owner

Anything new?

Need help with this? @sgenoud, what is the status of the patch?

Hi guys, sorry I have started a new job that takes me a lot of time. I will have more for this in December normally. I'll keep you posted

closed this
reopened this
commented

This patch is basically completely busted now because it works on the old tree before things were ported to cython trees.

I would like to do some pruning, so I've made an attempt to update the code in this pull request. I'm not sure if it is the best way to do things now, but at least it brings us close to where we were before.

The updated code is here: https://github.com/aflaxman/scikit-learn/tree/tree_pruning

Is there still interest in this?

referenced this pull request from a commit
 glouppe Tree refactoring (19) 2fe48dc
Owner

Yes, I think there is interest as long as it doesn't add overhead that can't be avoided for the forests.
I think having a pruning option for single trees would be great.
Maybe you should submit your own pull request with a reference to this one?

Commits on Jul 10, 2012
1. Steve Genoud authored
2. Steve Genoud authored
3. Steve Genoud authored
4. Steve Genoud authored
Commits on Jul 11, 2012
1. Steve Genoud authored
Correction of small details (documentation, commented code and return value)
2. Steve Genoud authored
3. Steve Genoud authored
4. Steve Genoud authored
5. Steve Genoud authored
6. Steve Genoud authored
Commits on Jul 12, 2012
1. Steve Genoud authored
Commits on Jul 24, 2012
1. sgenoud authored
Also renamed the helper function that plot the result of the function
2. sgenoud authored
3. sgenoud authored