-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
When using the DecisionTreeClassifier() sample_weight parameter, and weighting examples by class lable such as 2:1 for class A versus class B, the nvalues in the leaves of the tree produced by tree.export_graphviz() shows a duplicate count for the number of examples for the weighted class.
This is misleading because the sum of each classes nvalues should equal the total number of examples in the parent node. It seems that there is a bug where the feature weighting factor is not removed when exporting the decision tree.
Here is the classifier with weighting:
clf = tree.DecisionTreeClassifier(
criterion='gini', splitter='best', max_leaf_nodes=10,
min_samples_split=10, min_samples_leaf=2, max_features=None,
random_state=None).fit(X, y, sample_weight=df.LABEL.map({'A':2, 'B':1}))
Here is the produced tree.dot file with the duplicate values shown in bold:
digraph Tree {
0 [label="X[106] <= 0.0203\ngini = 0.0731494237327\nsamples = 119377", shape="box"] ;
1 [label="X[34] <= 1266.6650\ngini = 0.0661567536396\nsamples = 118307", shape="box"] ;
0 -> 1 ;
3 [label="gini = 0.0578\nsamples = 111941\nvalue = [ 210926. 6478.]", shape="box"] ;
1 -> 3 ;
4 [label="gini = 0.2103\nsamples = 6366\nvalue = [ 10016. 1358.]", shape="box"] ;
1 -> 4 ;
2 [label="X[83] <= 6728.4551\ngini = 0.386307949063\nsamples = 1070", shape="box"] ;
0 -> 2 ;
5 [label="gini = 0.4994\nsamples = 364\nvalue = [ 254. 237.]", shape="box"] ;
2 -> 5 ;
6 [label="gini = 0.1669\nsamples = 706\nvalue = [ 68. 672.]", shape="box"] ;
2 -> 6 ;
}
Here is what the tree looks like with duplicate leaf example counts for the weighted class: