Score decision tree using only samples that propagate to leaves trained with >N datapoints #20537

rohilverma · 2021-07-15T22:27:05Z

After training a particular model, there is typically a decent amount of threshold tuning in order to select the appropriate decision rule for the application. In general, the goal is to identify rules (either based on threshold, or other properties of the algorithm) that enable us to exclude the model from making bad decisions while still being applicable to the majority of the dataset.

For decision trees, one such possible rule is to exclude predictions which fall into leaves lacking a sufficient number of datapoints, where sufficient (>N) is specified by the user. For example, one might say that a leaf was added during training based on only 2 datapoints. During testing/scoring, we could skip samples which fall into this leaf to prevent low-confidence decision making.

Alternatives/nuance: it is possible that rather than raw number of samples, we may also want to consider impurity within the leaf, or perhaps some weighted metric weighing impurity against number of samples.

Feedback on whether this is something others have considered (I could not find any similar issues in the tracker) or whether there are clear drawbacks to this approach would be appreciated.

jnothman · 2021-07-24T12:32:12Z

This would be an alternative pruning strategy? How does it compare to cost-complexity pruning (https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html)? There are indeed several issues related to pruning in our tracker.

ayushi2019031 · 2021-08-23T16:38:16Z

Hi I want to work on this.

thomasjpfan · 2021-08-28T15:13:14Z

@ayushi2019031 To move forward, we would need references to these alternative pruning strategies and see if it meets our inclusion criteria.

rohilverma added the New Feature label Jul 15, 2021

cmarmo added the module:tree label Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score decision tree using only samples that propagate to leaves trained with >N datapoints #20537

Score decision tree using only samples that propagate to leaves trained with >N datapoints #20537

rohilverma commented Jul 15, 2021

jnothman commented Jul 24, 2021

ayushi2019031 commented Aug 23, 2021

thomasjpfan commented Aug 28, 2021 •

edited

Score decision tree using only samples that propagate to leaves trained with >N datapoints #20537

Score decision tree using only samples that propagate to leaves trained with >N datapoints #20537

Comments

rohilverma commented Jul 15, 2021

jnothman commented Jul 24, 2021

ayushi2019031 commented Aug 23, 2021

thomasjpfan commented Aug 28, 2021 • edited

thomasjpfan commented Aug 28, 2021 •

edited