Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecisionTreeClassifier max_leaf_nodes tree.tree_.value is set to 0 for non leaf nodes. Normal behaviour? #4644

Closed
robdempsey opened this issue Apr 29, 2015 · 8 comments

Comments

@robdempsey
Copy link

Hi

>>> sklearn.__version__
'0.15.2'

If I leave out max_leaf_nodes or set it to None when accessing tree.tree_.value all non leaf nodes values are set to zero and leaf node is correct. If I set the value to something like 10 the values are present?

dtree = tree.DecisionTreeClassifier(criterion = "entropy", min_samples_leaf = 1000, max_depth=4, max_leaf_nodes = None)

values    = tree.tree_.value

Iterating through a branch

[[ 0.  0.]]
[[ 0.  0.]]
[[ 0.  0.]]
[[ 1767.  6014.]]
18

dtree = tree.DecisionTreeClassifier(criterion = "entropy", min_samples_leaf = 1000, max_depth=4, max_leaf_nodes = 10)

values    = tree.tree_.value

Iterating through a branch

[[  45455.  166666.]]
[[  41958.  142857.]]
[[  8392.  28572.]]
18

Rob

@amueller
Copy link
Member

I think that is because different splitting stragegies are used depending on whether you set max_leaf_nodes or not. Some of them don't store the internal counts, as they are not needed for prediction.

@glouppe
Copy link
Contributor

glouppe commented Apr 30, 2015

The reason was only to speed things up.

@arjoly I wonder however if this was indeed needed at all. It seems that these counts should be there already (since we need them to compute the impurity decrease), hence storing them shouldnt hurt much. What do you think?

@arjoly
Copy link
Member

arjoly commented Apr 30, 2015

If benchmarks show that the tree growing procedure is not significantly slower, I am +1.
Note that there are more efficient memory-wise alternatives to store the node values for multi-output decision tree.

since we need them to compute the impurity decrease

At the moment, we don't need those. Could you clarify?

@robdempsey
Copy link
Author

They are a useful output if you want to present the results to a non-technical audience and don’t want to explain entropy :-)

@amueller
Copy link
Member

well you could reconstruct them from the output you have, there is a 1:1 mapping between leaves and path ;)

@trevorstephens
Copy link
Contributor

#3735 ... just sayin... ;-)

@andosa
Copy link

andosa commented Apr 30, 2015

PR created #4655

@glouppe
Copy link
Contributor

glouppe commented May 4, 2015

This is now fixed :)

@glouppe glouppe closed this as completed May 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants