{{ message }}

# [MRG] Adds Minimal Cost-Complexity Pruning to Decision Trees#12887

Merged
merged 94 commits into from Aug 20, 2019
Merged

# [MRG] Adds Minimal Cost-Complexity Pruning to Decision Trees#12887

merged 94 commits into from Aug 20, 2019

## Conversation

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

### thomasjpfan commented Dec 29, 2018

Fixes #6557

#### What does this implement/fix? Explain your changes.

This PR implements Minimal Cost-Complexity Pruning based on L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification and Regression Trees", Wadsworth, Belmont, CA, 1984.

Most of this implementation is the same as the literature. There are two differences:

1. In Breiman, r(t) is the an estimate of the probability of misclassification. This PR, uses the impurity as r(t).
2. The weighted number of samples is used to compute the probability of a point landing on a nodes.

A cost complexity parameter, alpha, was added to __init__ to control cost complexity pruning. The post pruning is done at the end of fit.

The code performing Minimal Cost-Complexity Pruning is mostly done in Python. The Python part produces the node ids that will become the leaves of the new subtree. These leaves are passed to a Cython function called build_pruned_tree that builds a tree. This was written in Cython since the tree building API is in Cython.

In Cython, the Stack class is used to go through the tree. Not all the fields of the StackRecord is used. This is a trade off between the code complexity of adding yet another Stack class, and being a little memory inefficient.

Currently, prune_tree is public, which allows for the following use case:

clf = DecisionTreeClassifier(alpha=0.0)
clf.fit(X, y)
clf.set_params(alpha=0.1)
clf.prune_tree()

If we prefer, we can make prune_tree private and not encourage this use case.

added 15 commits Dec 28, 2018
 ENH: Adds cost complexity pruning 
 a5f295a 
 Merge remote-tracking branch 'upstream/master' into ccp_prune_tree 
 9569e9f 
 DOC: Update 
 1a554f6 
 DOC: Adds comments to algorithm 
 84dbc05 
 RFC: Small 
 5e10962 
 RFC: Moves some logic to cython 
 745cd18 
 DOC: More comments 
 c1cd149 
 Merge remote-tracking branch 'upstream/master' into ccp_prune_tree 
 90c294e 
 DOC: Removes unused parameter 
 5c36185 
 DOC: Rewords 
 4b277b9 
 Merge remote-tracking branch 'upstream/master' into ccp_prune_tree 
 b83b135 
 ENH: Adds support for extra trees 
 ffece26 
 DOC: Updates whats_new 
 b2e2a52 
 RFC: Makes prune_tree public 
 e95829f 
 RFC: Less diffs 
 c313151 
reviewed

### jnothman left a comment

Some questions for the novice:

• does calling prune_tree with the same alpha repeatedly return the same tree?
• does calling prune_tree with increasing alpha return a strict sub-tree?

sklearn/tree/tree.py Outdated Show resolved Hide resolved
sklearn/tree/tree.py Outdated Show resolved Hide resolved
 @@ -510,6 +515,110 @@ def decision_path(self, X, check_input=True): X = self._validate_X_predict(X, check_input) return self.tree_.decision_path(X) def prune_tree(self):

### jnothman Jan 1, 2019

I think this needs to be called after fit automatically to facilitate cross validation etc.

I wonder if this should instead be a public function?

### thomasjpfan Jan 1, 2019

Currently, prune_tree is a public function that is called at the end of fit. It should work with our cross validation classes/functions.

### thomasjpfan commented Jan 1, 2019

 does calling prune_tree with the same alpha repeatedly return the same tree? As long as the original tree is the same, using the same alpha will return the same tree. I will add a test for this behavior. does calling prune_tree with increasing alpha return a strict sub-tree? When alpha gets high enough, the entire tree can be pruned, leaving just the root node.

added 2 commits Jan 1, 2019
 RFC: Moves prune_tree closer to the end of fit 
 fd5be88 
 Merge remote-tracking branch 'upstream/master' into ccp_prune_tree 
 2e348db 
reviewed
 ############################################################################### # Plot training and test scores vs alpha # -------------------------------------- # Calcuate and plot the the training scores and test accuracy scores

### NicolasHug Jan 1, 2019

Just a few typos (I'll document myself on tree pruning and will try to provide mode in-depth review later)

• Calcuate
• the the
• above "a smaller trees"

also I think you should avoid the :math: notation in comments

### thomasjpfan Jan 1, 2019

The :math: notation is currently used in other examples such as: https://github.com/scikit-learn/scikit-learn/blob/master/examples/svm/plot_svm_scale_c.py. Are we discouraging the usage of :math: in our examples?

### NicolasHug Jan 1, 2019

It's OK in the docstrings since it will be rendered like regular rst by sphinx, but in the comments it is not necessary.

### NicolasHug Jan 1, 2019

Ooh ok I didn't know it worked like that, sorry

added 2 commits Jan 1, 2019
 BUG: Fix 
 75709a0 
 BUG: Fix 
 efe9793 
reviewed

### NicolasHug left a comment

A first pass of cosmestic comments.

@thomasjpfan it seems to me that in general, and unless there's a compelling reason not to, sklearn code uses (potentially long) descriptive variable names.

For example par_idx could be renamed to parent_idx.
Same for cur_alpha, cur_idx, etc.

 # bubble up values to ancestor nodes for idx in leaf_idicies: cur_R = r_node[idx]

### NicolasHug Jan 1, 2019

Avoid upper-case in variable names (same for R_diff)

 leaves_in_subtree = np.zeros(shape=n_nodes, dtype=np.uint8) stack = [(0, -1)] while len(stack) > 0:

### NicolasHug Jan 1, 2019

while stack is more pythonic (same below)

 stack = [(0, -1)] while len(stack) > 0: node_id, parent = stack.pop()

### NicolasHug Jan 1, 2019

node_idx to stay consistent with the rest of the function.
I would also suggest parent_idx instead of parent,

and the parents array could just be named parent.

 # computes number of leaves in all branches and the overall impurity of # the branch. The overall impurity is the sum of r_node in its leaves. n_leaves = np.zeros(shape=n_nodes, dtype=np.int32) leaf_idicies, = np.where(leaves_in_subtree)

### NicolasHug Jan 1, 2019

leaf_indicies

 r_branch[leaf_idicies] = r_node[leaf_idicies] # bubble up values to ancestor nodes for idx in leaf_idicies:

### NicolasHug Jan 1, 2019

for leaf_idx in...?

 # descendants of branch are not in subtree stack = [cur_idx] while len(stack) > 0:

### NicolasHug Jan 1, 2019

while stack

 inner_nodes[idx] = False leaves_in_subtree[idx] = 0 in_subtree[idx] = False n_left, n_right = child_l[idx], child_r[idx]

### NicolasHug Jan 1, 2019

Usually n_something denotes a count of a number. Here those are just indices right?

 leaves_in_subtree[cur_idx] = 1 # updates number of leaves cur_leaves, n_leaves[cur_idx] = n_leaves[cur_idx], 0

### NicolasHug Jan 1, 2019 • edited

I would propose

n_pruned_leaves = n_leaves[cur_idx] - 1
n_leaves[cur_idx] = 0


and accordingly update n_leaves[cur_idx] below

 # bubble up values to ancestors cur_idx = parents[cur_idx] while cur_idx != _tree.TREE_LEAF:

### NicolasHug Jan 1, 2019 • edited

It's a bit weird to bubble up to a leaf.

Whatever you're comparing to here should explicitly be the same value as what you used for defining the root's parent above (stack = [(0, -1)])

I would simply use while cur_idx != -1:

### adrinjalali commented Jan 1, 2019

 Some random thoughts: In the context of ensembles and random forests, the parameter needs to be exposed there as well. It'd be nice if we knew the overhead of the pruning. Specially once the user uses it in the context of random forests, then the overhead is multiplied by the number of trees, and that times the number of fits in a grid search, might be significant. Related to that, two point come to mind: having some numbers related to the overhead would be nice. a potential warm_start for the pruning, maybe (cause the rest of fit doesn't have to be run again for different alpha values). contemplating moving the code to cython might be an idea. I'm not sure if it's necessary to create a copy of the tree for the pruned one. Probably having a masked version of the tree would be optimal for trying out multiple alpha values and a potential warm_start. That also depends on how much overhead that copying has.

### NicolasHug commented Jan 1, 2019

 @jnothman , to add to @thomasjpfan answers: does calling prune_tree with the same alpha repeatedly return the same tree? The procedure is deterministic so calling prune_tree with same alpha and same original tree will give you the same pruned tree. Also as far as I understand, tree.prune_tree(alpha) == tree.prune_tree(alpha).prune_tree(alpha). does calling prune_tree with increasing alpha return a strict sub-tree? A subtree yes, but not necessarily a strict one: with slpha1 > alpha2, tree.prune_tree(alpha_1) is a subtree of tree.prune_tree(alpha_2) but they may also be equal. This is because the alpha parameter is only used as a threshold here.

reviewed
 in_subtree = np.ones(shape=n_nodes, dtype=np.bool) cur_alpha = 0 while cur_alpha < self.alpha:

### NicolasHug Jan 1, 2019

2 thoughts:

• on the resources that I read (here and here), the very first pruning step is to remove all the pure leaves (equivalent to using alpha=0 apparently). This is not done here since cur_alpha is immediately overwritten. I wonder if this is done by default in the tree growing algorithm.
• As you check for cur_alpha < self.alpha and cur_alpha is computed before the tree is pruned in the loop, this means that the alpha of the returned pruned tree will be greater than self.alpha. It would seem more natural to me to return a tree whose alpha is less than self.alpha. In any case we would need to explain how alpha is used in the docs, something like "subtrees whose scores are less than alpha are discarded. The score is computed as ..."

### thomasjpfan Jan 7, 2019 • edited

• When alpha = 0, the number of leaves does not contribute to the cost-complexity measure, which I interpreted as "do not prune". Removing the leaves when alpha=0, will increase the cost-complexity measure.

• Returning a tree whose alpha is less than self.alpha makes sense and should documented.

### NicolasHug Jan 7, 2019

Removing the leaves when alpha=0, will increase the cost-complexity measure.

It cannot increase the cost-complexity: the first step is to prune (some of) the pure leaves. That is if a node N has 2 child leaves where all the samples in both leaves belong to the same class, then the first step will remove those 2 leaves and make N a leaf (which will still be pure). The process is repeated with N and its sibling if needed.

### thomasjpfan Jan 7, 2019

That is if a node N has 2 child leaves where all the samples in both leaves belong to the same class, then the first step will remove those 2 leaves and make N a leaf

This makes sense. I will review the tree building code to see if this can happen.

To prevent future confusion, I want to get on the same page with our definition of a pure leaf. From my understanding, a pure leaf is a leaf whose samples belong to the same class, independent of all other leaves. From reading your response, you consider two leaves to be pure if they are siblings and their samples belong to the same class. Is this correct?

### NicolasHug Jan 7, 2019

From my understanding, a pure leaf is a leaf whose samples belong to the same class, independent of all other leaves

I meant this as well

### NicolasHug Jan 7, 2019

I just looked at the tree code, I think we can assume that this "first step" is not needed here after all, since a node is made a leaf according to the min_impurity_decrease (or deprecated min_impurity_split) parameter.

That is, if a node is pure according to min_impurity_decrease, it will be made a leaf, and thus the example case I mentioned above (a node with 2 pure leaves) cannot exist.

### jnothman commented Jan 2, 2019

 Where I was going with my questions was the idea of warm_start as well...

### thomasjpfan commented Jan 7, 2019

 Exposing to the ensemble trees make sense. I will do some experimenting to benchmark the overhead of pruning, in its current form and a Cython version of it. I'll post the results here. The masked tree together with the warm_start parameter are great ideas. A masked tree would allow for the level of pruning to be adjusted without growing the tree again, which looks really nice. The current copying approach, allows for the original tree to be deleted, and the pruned tree will take up less space.

### jnothman commented Jan 8, 2019

 I'm not sure if we want masking, but further pruning an existing tree might be reasonable.

 RFC: Addresses code review 
 568eb04 
added 6 commits Aug 16, 2019
 CLN Address NicolasHug's comments 
 7994897 
 CLN Refactors tests to use pruning_path 
 2a42e0c 
 TST Adds single node tree test 
 e8e3967 
 STY flake8 
 17b4112 
 TST Adds test on impurities from path 
 1a8f07e 
 DOC Adds words 
 17d3888 
approved these changes

### NicolasHug left a comment

Thanks Thomas, last minor comments but LGTM!

 :math:t, and its branch, :math:T_t, can be equal depending on :math:\alpha. We define the effective :math:\alpha of a node to be the value where they are equal, :math:R_\alpha(T_t)=R_\alpha(t) or :math:\alpha_{eff}(t)=(R(t)-R(T_t))/(|\tilde{T}|-1). A non-terminal node

### NicolasHug Aug 16, 2019

Suggested change
 :math:\alpha_{eff}(t)=(R(t)-R(T_t))/(|\tilde{T}|-1). A non-terminal node :math:\alpha_{eff}(t)=(R(t)-R(T_t))/(|T|-1). A non-terminal node

removed tilde

 Minimal Cost-Complexity Pruning =============================== Minimal cost-complexity pruning is an algorithm used to prune a tree after it

### NicolasHug Aug 16, 2019

Please add ref here to L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees (Chapter 3)

 ax.set_xlabel("alpha") ax.set_ylabel("accuracy") ax.set_title("Accuracy vs alpha for training and testing sets") ax.plot(ccp_alphas, train_scores, label="train", drawstyle="steps-post")

### NicolasHug Aug 16, 2019

Suggested change
 ax.plot(ccp_alphas, train_scores, label="train", drawstyle="steps-post") ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")

### NicolasHug Aug 16, 2019

same below

 Tree orig_tree, DOUBLE_t ccp_alpha): """Build a pruned tree from the original tree by transforming the nodes in leaves_in_subtree into leaves.

### NicolasHug Aug 16, 2019

Remove this one??

 cdef: UINT32_t total_items = path_finder.count np.ndarray ccp_alphas = np.empty(shape=total_items,

### NicolasHug Aug 16, 2019

ping but not super important I think

added 5 commits Aug 16, 2019
 DOC Adds words 
 9b01fc8 
 Merge remote-tracking branch 'upstream/master' into ccp_prune_tree 
 073fd00 
 DOC Better words 
 1774b8c 
 DOC Adds docstring to ccp_pruning_path 
 73cdf1e 
 DOC Uses new standrad 
 a688f60 
reviewed

### jnothman left a comment

 where :math:|\tilde{T}| is the number of terminal nodes in :math:T and :math:R(T) is traditionally defined as the total misclassification rate of the terminal nodes. Alternatively, scikit-learn uses the total sample weighted impurity of the terminal nodes for :math:R(T). As shown in the previous

### jnothman Aug 18, 2019

best to say "above" rather than "in the previous section" or link to it so it can withstand change.

 :math:t, and its branch, :math:T_t, can be equal depending on :math:\alpha. We define the effective :math:\alpha of a node to be the value where they are equal, :math:R_\alpha(T_t)=R_\alpha(t) or :math:\alpha_{eff}=(R(t)-R(T_t))/(|\tilde{T}|-1). A non-terminal node with

### jnothman Aug 18, 2019

use \frac since this is not easily readable anyway.

 =============================== Minimal cost-complexity pruning is an algorithm used to prune a tree, described in Chapter 3 of [BRE]_. This algorithm is parameterized by :math:\alpha\ge0

### jnothman Aug 18, 2019

It might be worth adding a small note to say that this is one method used to avoid over-fitting in trees.

 :mod:sklearn.tree ................... - |Feature| Adds minimal cost complexity pruning to

### jnothman Aug 18, 2019

might be worth mentioning what the public api is... i.e. is it just ccp_alpha?

 CLN Address joels comments 
 82f3aa1 
merged commit 67c94c7 into scikit-learn:master Aug 20, 2019
15 of 17 checks passed

### NicolasHug commented Aug 20, 2019

 Failure is unrelated and Joel's comments were addressed so I guess it's good to merge 🎉 Thanks @thomasjpfan !

### jnothman commented Aug 20, 2019

 Exciting! Great work!!

added a commit to sebp/scikit-survival that referenced this issue Apr 9, 2020
 WIP: Add support for scikit-learn 0.22 
 c374789 
- Deprecate presort (scikit-learn/scikit-learn#14907)
- Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887)
- Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)
added a commit to sebp/scikit-survival that referenced this issue Apr 9, 2020
 WIP: Add support for scikit-learn 0.22 
 4fcfd88 
- Deprecate presort (scikit-learn/scikit-learn#14907)
- Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887)
- Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)
- Fix deprecated imports
added a commit to sebp/scikit-survival that referenced this issue Apr 10, 2020
 WIP: Add support for scikit-learn 0.22 
 8d09e75 
- Deprecate presort (scikit-learn/scikit-learn#14907)
- Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887)
- Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)
- Fix deprecated imports (scikit-learn/scikit-learn#9250)
added a commit to sebp/scikit-survival that referenced this issue Apr 10, 2020
 Add support for scikit-learn 0.22 
 2171032 
- Deprecate presort (scikit-learn/scikit-learn#14907)
- Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887)
- Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)
- Fix deprecated imports (scikit-learn/scikit-learn#9250)

Do not add ccp_alpha to SurvivalTree, because
it relies node_impurity, which is not set for SurvivalTree.
added a commit to sebp/scikit-survival that referenced this issue Apr 10, 2020
 Add support for scikit-learn 0.22 
 cc71565 
- Deprecate presort (scikit-learn/scikit-learn#14907)
- Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887)
- Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)
- Fix deprecated imports (scikit-learn/scikit-learn#9250)

Do not add ccp_alpha to SurvivalTree, because
it relies node_impurity, which is not set for SurvivalTree.
added a commit to sebp/scikit-survival that referenced this issue Apr 10, 2020
 Add support for scikit-learn 0.22 
 f11d047 
- Deprecate presort (scikit-learn/scikit-learn#14907)
- Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887)
- Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)
- Fix deprecated imports (scikit-learn/scikit-learn#9250)

Do not add ccp_alpha to SurvivalTree, because
it relies node_impurity, which is not set for SurvivalTree.

### TrigonaMinima commented May 14, 2020 • edited

 I have a question: in the literature[1], the authors first prune the max grown tree and then prune it according to the different alpha. Following that, they either use a test set or cross validation to find the best alpha or the corresponding "best pruned tree". Here, we have selected the tree before alpha crosses the ccp_alpha. Am I right or did I miss something? Is the activity of selecting the "best" ccp_alpha left to the user? 1: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

### thomasjpfan commented May 14, 2020

 Am I right or did I miss something? Is the activity of selecting the "best" ccp_alpha left to the user? Yes this needs to be done with our cross-validation tools. There is a more interesting way to do this by setting aside some of the training data for validation, in such a way that the tree can automatically find an alpha. This has not been implemented here.

### LEEPEIQIN commented Jun 6, 2020

 Thank you for your great work and it benifits me a lot. However, I am curious about the reason why you use GINI Impurity instead of misclassification rate as the cost function? Could you give a reference or any key words to let me search on google? Thank you very much!

### LEEPEIQIN commented Jun 6, 2020

 I find some further discussion in "performance learning" (Johannes Fürnkranz， Eyke Hüllermeier). p87-88. Thank you very much.

### thomasjpfan commented Jun 6, 2020

 Using the criterion allows pruning to be extended to regression trees. (the criterion for classification defaults to gini impurity)

### LEEPEIQIN commented Jun 7, 2020

 Thank you very much.

mentioned this pull request Apr 24, 2022