Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified BaseDecisionTree so that min_weight_fraction_leaf works when… #6947

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
15 changes: 10 additions & 5 deletions sklearn/ensemble/forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -797,7 +797,8 @@ class RandomForestClassifier(ForestClassifier):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by `sample_weight` in the fit
method.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -991,7 +992,8 @@ class RandomForestRegressor(ForestRegressor):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by `sample_weight` in the fit
method.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -1150,7 +1152,8 @@ class ExtraTreesClassifier(ForestClassifier):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by `sample_weight` in the fit
method.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -1343,7 +1346,8 @@ class ExtraTreesRegressor(ForestRegressor):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by `sample_weight` in the fit
method.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -1488,7 +1492,8 @@ class RandomTreesEmbedding(BaseForest):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by `sample_weight` in the fit
method.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down
6 changes: 4 additions & 2 deletions sklearn/ensemble/gradient_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -1315,7 +1315,8 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by `sample_weight` in the fit
method.

subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base
Expand Down Expand Up @@ -1661,7 +1662,8 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by `sample_weight` in the fit
method.

subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base
Expand Down
10 changes: 7 additions & 3 deletions sklearn/tree/tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,9 +297,13 @@ def fit(self, X, y, sample_weight=None, check_input=True,
sample_weight = expanded_class_weight

# Set min_weight_leaf from min_weight_fraction_leaf
if self.min_weight_fraction_leaf != 0. and sample_weight is not None:
if self.min_weight_fraction_leaf != 0.:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need this if

if sample_weight is None:
sample_weight = np.repeat(1., n_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, we don't need to explicitly do this, since all we want is the sum of weights, i.e. "total weight". So you could just do min_weight_leaf = self.min_weight_fraction_leaf * n_samples.

Secondly, I see why you might want this, but it now replicates the function of min_samples_leaf. So if this is the way we go, i.e. rather than a warning, you can actually just use

min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples)))

In terms of testing, you should look at existing tests for min_samples_leaf and check that min_weight_fraction_leaf offers the same behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sample_weight is None, then samples are equally weighted in the Cython code. Does this addition really change anything? I believe behaviours should be identical. Do you have counter-examples where the trees that are built are actually different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sample_weight is None then min_weight_fraction_leaf has no effect. I think that's quite clear from the else case here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad, I was not aware of that else clause. Indeed. Ouch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we fix this in the cython code?


min_weight_leaf = (self.min_weight_fraction_leaf *
np.sum(sample_weight))

else:
min_weight_leaf = 0.

Expand Down Expand Up @@ -577,7 +581,7 @@ class DecisionTreeClassifier(BaseDecisionTree, ClassifierMixin):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by sample_weight in the fit method.

max_leaf_nodes : int or None, optional (default=None)
Grow a tree with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -831,7 +835,7 @@ class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by sample_weight in the fit method.

max_leaf_nodes : int or None, optional (default=None)
Grow a tree with ``max_leaf_nodes`` in best-first fashion.
Expand Down