Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Use float as percentage in min_samples_split and min_samples_leaf of DecisionTree #3359

Closed
wants to merge 13 commits into
from

Conversation

Projects
None yet
7 participants
Contributor

yelite commented Jul 10, 2014

A float number can be used as percentage in these two parameters in case the sample size varies greatly.

I found a new parameter min_weight_fraction_leaf is added recently. It works in a similar way. But it is still convenient to use float in min_samples_split for those sample without weighting.

Coverage Status

Coverage decreased (-0.0%) when pulling 8f715dd on yelite:tree-min-param-enh into 3acda36 on scikit-learn:master.

Coverage Status

Coverage decreased (-0.0%) when pulling db60daf on yelite:tree-min-param-enh into 3acda36 on scikit-learn:master.

Coverage Status

Coverage increased (+0.0%) when pulling db60daf on yelite:tree-min-param-enh into 3acda36 on scikit-learn:master.

Coverage Status

Coverage increased (+0.0%) when pulling fbd8f60 on yelite:tree-min-param-enh into 3acda36 on scikit-learn:master.

Contributor

yelite commented Jul 10, 2014

It‘s quite strange.... All test passed in my forked repo but failed in this.

https://travis-ci.org/yelite/scikit-learn/builds/29584547
https://travis-ci.org/scikit-learn/scikit-learn/builds/29584562

Coverage Status

Coverage increased (+0.0%) when pulling b3f9d65 on yelite:tree-min-param-enh into 3acda36 on scikit-learn:master.

Owner

arjoly commented Jul 10, 2014

The failing tests seems to be due to joblib and not this pr.

Thanks for this pr !

@arjoly arjoly commented on an outdated diff Jul 10, 2014

sklearn/tree/tests/test_tree.py
@@ -463,6 +463,20 @@ def test_min_samples_leaf():
assert_greater(np.min(leaf_count), 4,
"Failed with {0}".format(name))
+ for max_leaf_nodes in (None, 1000):
+ for name, TreeEstimator in ALL_TREES.items():
+ est = TreeEstimator(min_samples_leaf=0.1,
+ min_samples_split=0.01,
@arjoly

arjoly Jul 10, 2014

Owner

Can you separate the test for min_samples_leaf and min_samples_split?

Owner

arjoly commented Jul 10, 2014

Could you update the documentation and test in the forest and gradient tree boosting module ?

Owner

arjoly commented Jul 10, 2014

It would be great to test edge cases for those two parameters. For instance, I would test negative values, 0. value and values greater than 1.

Contributor

yelite commented Jul 11, 2014

@arjoly
I did git rebase. Other commits can be seen in this pr. Is that ok?

Owner

arjoly commented Jul 11, 2014

This would make review difficult. Could you open a new pull request?

Contributor

yelite commented Jul 11, 2014

@arjoly
Learned some git trick : )

Coverage Status

Coverage decreased (-0.03%) when pulling 6ba7b5a on yelite:tree-min-param-enh into 1fe371e on scikit-learn:master.

Owner

arjoly commented Jul 11, 2014

great :-) !

@arjoly arjoly and 1 other commented on an outdated diff Jul 12, 2014

sklearn/tree/tree.py
@@ -244,8 +244,21 @@ def fit(self, X, y, sample_mask=None, X_argsorted=None, check_input=True,
else:
min_weight_leaf = 0.
+ if isinstance(self.min_samples_leaf, float):
+ if self.min_samples_leaf > 1:
+ raise ValueError("The float min_samples_leaf must be less than 1.0")
@arjoly

arjoly Jul 12, 2014

Owner

Is it necessary to raise an error? This will lead to make only one root node as it is already the case.

@yelite

yelite Jul 15, 2014

Contributor

What if a user pass 10.0 to it and expect it works as if the min_samples_leaf is 10? I think we should do something. what about a warning?

@arjoly arjoly commented on an outdated diff Jul 12, 2014

sklearn/tree/tree.py
# Set min_samples_split sensibly
- min_samples_split = max(self.min_samples_split,
+ if isinstance(self.min_samples_split, float):
+ if self.min_samples_split > 1:
+ raise ValueError("The float min_samples_split must be less than 1.0")
@arjoly

arjoly Jul 12, 2014

Owner

Is it necessary to raise an error? This will lead to make only one root node as it is already the case.

@arjoly arjoly commented on an outdated diff Jul 12, 2014

sklearn/tree/tree.py
# Set min_samples_split sensibly
- min_samples_split = max(self.min_samples_split,
+ if isinstance(self.min_samples_split, float):
+ if self.min_samples_split > 1:
+ raise ValueError("The float min_samples_split must be less than 1.0")
+ min_samples_split = int(np.ceil(self.min_samples_split * n_samples))
+ else:
+ min_samples_split = self.min_samples_split
+ min_samples_split = max(min_samples_split,
2 * self.min_samples_leaf)
@arjoly

arjoly Jul 12, 2014

Owner

I think this could stand on one line.

@arjoly arjoly commented on an outdated diff Jul 12, 2014

sklearn/tree/tree.py
# Set min_samples_split sensibly
- min_samples_split = max(self.min_samples_split,
+ if isinstance(self.min_samples_split, float):
+ if self.min_samples_split > 1:
+ raise ValueError("The float min_samples_split must be less than 1.0")
+ min_samples_split = int(np.ceil(self.min_samples_split * n_samples))
+ else:
+ min_samples_split = self.min_samples_split
+ min_samples_split = max(min_samples_split,
@arjoly

arjoly Jul 12, 2014

Owner

Those lines should go after the inference of the max_features parameters.

@arjoly

arjoly Jul 12, 2014

Owner

Or better just after max_leaf_nodes = (-1 if self.max_leaf_nodes is None else self.max_leaf_nodes)

@arjoly

arjoly Jul 12, 2014

Owner

This would more consistent with the overall code structure and re-use the already present parameter check.

@arjoly arjoly commented on an outdated diff Jul 12, 2014

sklearn/tree/tree.py
# Set min_samples_split sensibly
- min_samples_split = max(self.min_samples_split,
+ if isinstance(self.min_samples_split, float):
+ if self.min_samples_split > 1:
@arjoly

arjoly Jul 12, 2014

Owner

The maximum would be 0.5.

Owner

arjoly commented Jul 12, 2014

In test_gradient_boosting.py, I would add test that checks that it works.
In test_forest.py, I would add test that check that it doesn't work with boundary values.

Owner

arjoly commented Jul 12, 2014

I think that a note in the narrative doc explaining why it's usefull would be great.

Coverage Status

Coverage increased (+0.01%) when pulling 3747522 on yelite:tree-min-param-enh into f7e9527 on scikit-learn:master.

Contributor

yelite commented Jul 15, 2014

@arjoly Thanks for your advice!

Can you explain what you mean on 'it doesn't work with boundary values'? You mean the test on parameter checking?

Owner

arjoly commented Jul 16, 2014

In test_forest.py, I would add test that check that it doesn't work with boundary values.

By bounday values, I meant values that are illegal, e.g. -1..

Owner

amueller commented Jul 16, 2014

This pr has the same issue that @GaelVaroquaux didn't like in #2861, right?

Owner

arjoly commented Jul 16, 2014

This pr has the same issue that @GaelVaroquaux didn't like in #2861, right?

No the range of interval is (0,1.] and (0., 0.5] for min_samples_split and min_samples_leaf.
So only min_samples_split=1. might be problematic.

Owner

GaelVaroquaux commented Jul 16, 2014

No the range of interval is (0,1.] and (0., 0.5] for min_samples_split and
min_samples_leaf.

Yes, I agree, it is more explicit

So only min_samples_split=1. might be problematic.

And I would suggest to raise an error and suggest to use .999 (or would
this be a stupid suggestion)? I don't like the possibility of an
ambiguity in specifying parameters.

Contributor

yelite commented Jul 21, 2014

@arjoly
The range of a float min_sample_split is (0, 1) in my commit.

This will lead to make only one root node as it is already the case.

Is a decision with only one root node useful? I can hardly imagine a circumstances that people want such a tree.

Coverage Status

Coverage increased (+0.01%) when pulling 04012d0 on yelite:tree-min-param-enh into 7a7fca8 on scikit-learn:master.

@arjoly arjoly and 1 other commented on an outdated diff Jul 22, 2014

sklearn/ensemble/tests/test_forest.py
@@ -547,10 +547,61 @@ def test_max_leaf_nodes_max_depth():
yield check_max_leaf_nodes_max_depth, name, X, y
+def check_min_samples_split(name, X, y):
+ """Test min_samples_split parameter"""
+ ForestEstimator = FOREST_ESTIMATORS[name]
+
+ # test boundary value
+ assert_raises(ValueError,
+ ForestEstimator(min_samples_split=-1).fit, X, y)
+ assert_raises(ValueError,
+ ForestEstimator(min_samples_split=0).fit, X, y)
+ assert_raises(ValueError,
+ ForestEstimator(min_samples_split=1.0).fit, X, y)
@arjoly

arjoly Jul 22, 2014

Owner

Unfortunately, this will create a regression

In [4]: RandomForestClassifier(min_samples_split=1.).fit(np.arange(6).reshape((3, 2)), np.arange(3))
Out[4]: 
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=1.0, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0)
@arjoly

arjoly Jul 22, 2014

Owner

Should we raise an error when min_samples_split < 2?

@yelite

yelite Jul 26, 2014

Contributor

Unfortunately, this will create a regression

Isn't it a classifier?

Should we raise an error when min_samples_split < 2?

I think we should. A min_samples_split < 2 is meaningless because we make min_samples_split = max(min_samples_split, 2 * self.min_samples_leaf) when inferring.

@arjoly arjoly commented on an outdated diff Jul 22, 2014

sklearn/ensemble/tests/test_forest.py
+ "Failed with {0}".format(name))
+
+ est = ForestEstimator(min_samples_split=0.1,
+ max_leaf_nodes=max_leaf_nodes,
+ random_state=0)
+ est.fit(X, y)
+ node_idx = est.estimators_[0].tree_.children_left != -1
+ node_samples = est.estimators_[0].tree_.n_node_samples[node_idx]
+
+ assert_greater(np.min(node_samples), 9,
+ "Failed with {0}".format(name))
+
+
+def test_min_samples_split():
+ X, y = datasets.make_hastie_10_2(n_samples=100, random_state=1)
+ X = X.astype(np.float32)
@arjoly

arjoly Jul 22, 2014

Owner

Could you generate the data inside the check function? This allows to have less verbose output when calling nose.
n_samples could be set to a lower value such as 20.

@arjoly

arjoly Jul 22, 2014

Owner

This could also be generated as a constant for this test module as to avoid recreating the hastie_10_2 daset as boston and iris.

@arjoly arjoly commented on an outdated diff Jul 22, 2014

sklearn/tree/tests/test_tree.py
@@ -443,6 +446,39 @@ def test_error():
assert_raises(ValueError, clf.predict, Xt)
+def test_min_samples_split():
+ """Test min_samples_split parameter"""
+ X = np.asfortranarray(iris.data.astype(tree._tree.DTYPE))
+ y = iris.target
+
+ # test both DepthFirstTreeBuilder and BestFirstTreeBuilder
+ # by setting max_leaf_nodes
+ for max_leaf_nodes in (None, 1000):
+ for name, TreeEstimator in ALL_TREES.items():
@arjoly

arjoly Jul 22, 2014

Owner

Here (and in other places) you could use the product function from itertools to reduce the number of nested level:

for max_leaf_nodes, name in product((None, 1000), ALL_TREES):
     TreeEstimator = ALL_TREES[name]

@arjoly arjoly and 1 other commented on an outdated diff Jul 22, 2014

sklearn/tree/tree.py
@@ -209,6 +221,12 @@ def fit(self, X, y, sample_weight=None, check_input=True, sample_mask=None,
raise ValueError("min_samples_split must be greater than zero.")
if self.min_samples_leaf <= 0:
raise ValueError("min_samples_leaf must be greater than zero.")
+ if isinstance(self.min_samples_leaf, float) \
+ and not 0 < self.min_samples_leaf <= 0.5:
@arjoly

arjoly Jul 22, 2014

Owner

We try to avoid the \. You can use parenthesis to the same without.

Why not checking directly the inferred value of min_samples_leaf?

@glouppe

glouppe Jul 22, 2014

Owner

Why forcing 0.5 as an upper instead of 1.0? Values from 0.5 to 1.0 are valid and would correspond to the case where we have a single node which is a leaf. I would document this behaviour but not prevent it.

@glouppe

glouppe Jul 22, 2014

Owner

In the same way, for integer values, we don't force 1 <= min_samples_leaf < n_samples/2.

@arjoly

arjoly Jul 22, 2014

Owner

Hm, i think this is enforced in min weight fraction split. I am fine either way if It doesn't lead to ambiguity.

@glouppe

glouppe Jul 22, 2014

Owner

Either way, the behaviour should be consistent across all criteria. I am +1 for allowing a single root node (maybe with a warning if you feel this is necessary).

Owner

GaelVaroquaux commented Jul 22, 2014

Is a decision with only one root node useful? I can hardly imagine a
circumstances that people want such a tree.

Isn't that the definition of a decision stump, which are widely used.

Owner

glouppe commented Jul 22, 2014

Isn't that the definition of a decision stump, which are widely used.

A stump is one internal node + 2 leaves.

A single root node would correspond to a single leaf, with no decision. It might be useful in some cases since the corresponding return value is the class which is the most likely over the dataset (resp. the mean output value in regression).

Contributor

yelite commented Jul 26, 2014

In my latest commit,

range of min_samples_split is [1, n_samples] and a float value 1.0 is forbidden to avoid ambiguity.

range of min_samples_split is [2, n_samples], which is likely a break to original api ( I found a lot of code in test which set min_samples_split=1 ). But I think it would be fine because the default value is 2 and min_samples_split=1 makes no sense. Also, this can eliminate the confusion between 1.0 and 1.

Coverage Status

Coverage increased (+0.0%) when pulling 8ec747c on yelite:tree-min-param-enh into c0afd46 on scikit-learn:master.

Coverage Status

Coverage increased (+0.0%) when pulling 9a85b7f on yelite:tree-min-param-enh into c0afd46 on scikit-learn:master.

@jnothman jnothman commented on an outdated diff Jul 27, 2014

doc/modules/tree.rst
@@ -314,7 +314,9 @@ Tips on practical use
* Use ``min_samples_split`` or ``min_samples_leaf`` to control the number of
samples at a leaf node. A very small number will usually mean the tree
will overfit, whereas a large number will prevent the tree from learning
- the data. Try ``min_samples_leaf=5`` as an initial value.
+ the data. Try ``min_samples_leaf=5`` as an initial value. If the sample size
+ varies greatly, a float number can be used as percentage in these two
@jnothman

jnothman Jul 27, 2014

Owner

you repeat "sample size varies greatly".

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
Note: this parameter is tree-specific.
- min_samples_leaf : integer, optional (default=1)
- The minimum number of samples in newly created leaves. A split is
- discarded if after the split, one of the leaves would contain less then
- ``min_samples_leaf`` samples.
+ min_samples_leaf : int, float, optional (default=1)
+ The minimum number of samples required to be at a leaf node:
+ - If int, then consider `min_samples_leaf` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Please follow the formatting at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/forest.py#L674 including the blank line before and after the list, and the unindented list items.

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
@@ -854,14 +860,20 @@ class RandomForestRegressor(ForestRegressor):
Ignored if ``max_samples_leaf`` is not None.
Note: this parameter is tree-specific.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
@@ -692,14 +692,20 @@ class RandomForestClassifier(ForestClassifier):
Ignored if ``max_samples_leaf`` is not None.
Note: this parameter is tree-specific.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on the diff Jul 27, 2014

sklearn/ensemble/forest.py
Note: this parameter is tree-specific.
- min_samples_leaf : integer, optional (default=1)
- The minimum number of samples in newly created leaves. A split is
- discarded if after the split, one of the leaves would contain less then
- ``min_samples_leaf`` samples.
+ min_samples_leaf : int, float, optional (default=1)
+ The minimum number of samples required to be at a leaf node:
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
@@ -1006,14 +1018,20 @@ class ExtraTreesClassifier(ForestClassifier):
Ignored if ``max_samples_leaf`` is not None.
Note: this parameter is tree-specific.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
Note: this parameter is tree-specific.
- min_samples_leaf : integer, optional (default=1)
- The minimum number of samples in newly created leaves. A split is
- discarded if after the split, one of the leaves would contain less then
- ``min_samples_leaf`` samples.
+ min_samples_leaf : int, float, optional (default=1)
+ The minimum number of samples required to be at a leaf node:
+ - If int, then consider `min_samples_leaf` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
@@ -1171,14 +1189,20 @@ class ExtraTreesRegressor(ForestRegressor):
Ignored if ``max_samples_leaf`` is not None.
Note: this parameter is tree-specific.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
Note: this parameter is tree-specific.
- min_samples_leaf : integer, optional (default=1)
- The minimum number of samples in newly created leaves. A split is
- discarded if after the split, one of the leaves would contain less then
- ``min_samples_leaf`` samples.
+ min_samples_leaf : int, float, optional (default=1)
+ The minimum number of samples required to be at a leaf node:
+ - If int, then consider `min_samples_leaf` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
@@ -1306,14 +1330,20 @@ class RandomTreesEmbedding(BaseForest):
min_samples_split samples.
Ignored if ``max_samples_leaf`` is not None.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/forest.py
Note: this parameter is tree-specific.
- min_samples_leaf : integer, optional (default=1)
- The minimum number of samples in newly created leaves. A split is
- discarded if after the split, one of the leaves would contain less then
- ``min_samples_leaf`` samples.
+ min_samples_leaf : int, float, optional (default=1)
+ The minimum number of samples required to be at a leaf node:
+ - If int, then consider `min_samples_leaf` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/gradient_boosting.py
@@ -992,11 +992,20 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
of the input variables.
Ignored if ``max_samples_leaf`` is not None.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/gradient_boosting.py
@@ -992,11 +992,20 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
of the input variables.
Ignored if ``max_samples_leaf`` is not None.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
+ - If float, then `min_samples_split` is a percentage and
+ `int(min_samples_split * n_samples)` are the minimum
+ number of samples for each split.
+
+ min_samples_leaf : int, float, optional (default=1)
+ The minimum number of samples required to be at a leaf node:
+ - If int, then consider `min_samples_leaf` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/gradient_boosting.py
@@ -1260,11 +1269,19 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
for best performance; the best value depends on the interaction
of the input variables.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
-
- min_samples_leaf : integer, optional (default=1)
- The minimum number of samples required to be at a leaf node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

@jnothman jnothman commented on an outdated diff Jul 27, 2014

sklearn/ensemble/gradient_boosting.py
@@ -1260,11 +1269,19 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
for best performance; the best value depends on the interaction
of the input variables.
- min_samples_split : integer, optional (default=2)
- The minimum number of samples required to split an internal node.
-
- min_samples_leaf : integer, optional (default=1)
- The minimum number of samples required to be at a leaf node.
+ min_samples_split : int, float, optional (default=2)
+ The minimum number of samples required to split an internal node:
+ - If int, then consider `min_samples_split` as the minimum number.
+ - If float, then `min_samples_split` is a percentage and
+ `int(min_samples_split * n_samples)` are the minimum
+ number of samples for each split.
+
+ min_samples_leaf : int, float, optional (default=1)
+ The minimum number of samples required to be at a leaf node:
+ - If int, then consider `min_samples_leaf` as the minimum number.
@jnothman

jnothman Jul 27, 2014

Owner

Same formatting here

Coverage Status

Coverage increased (+0.0%) when pulling 6ad5a72 on yelite:tree-min-param-enh into c0afd46 on scikit-learn:master.

Owner

jnothman commented Jul 27, 2014

I also hoped you should not indent the list, but that is less important from the perspective of documentation generation.

Coverage Status

Coverage increased (+0.0%) when pulling 01eb915 on yelite:tree-min-param-enh into e181932 on scikit-learn:master.

Coverage Status

Coverage increased (+0.0%) when pulling c57ca4c on yelite:tree-min-param-enh into d5430e3 on scikit-learn:master.

Contributor

yelite commented Jul 30, 2014

@arjoly
I found test_max_leaf_nodes_max_depth in test_forest.py failing randomly on CI but pass on my local.

File "/home/travis/build/yelite/scikit-learn/sklearn/ensemble/tests/test_forest.py", line 543, in check_max_leaf_nodes_max_depth
    assert_greater(est.estimators_[0].tree_.max_depth, 1)
nose.proxy.AssertionError: 1 not greater than 1
Owner

arjoly commented Jul 30, 2014

Can you compute the proportion of each class in the dataset?

@jnothman jnothman commented on the diff Jul 31, 2014

sklearn/tree/tree.py
@@ -165,6 +165,15 @@ def fit(self, X, y, sample_weight=None, check_input=True):
max_leaf_nodes = (-1 if self.max_leaf_nodes is None
else self.max_leaf_nodes)
+ min_samples_leaf = (int(np.ceil(self.min_samples_leaf * n_samples))
+ if isinstance(self.min_samples_leaf, float)
@jnothman

jnothman Jul 31, 2014

Owner

this, together with your validation below, will let through floats >1, which makes no sense. Why not just use 0 < self.min_samples_leaf < 1?

@yelite

yelite Jul 31, 2014

Contributor

@jnothman
I use 1 < min_samples_split <= n_samples, so a float > 1 will make it raise an error. I think this way is better because we don't need to check the type when validating

And there are tests for parameter validation.

assert_raises(ValueError,
                  ForestEstimator(min_samples_split=1.1).fit, X, y)

I find that I forgot to add such test in test_tree.py. Now it's here

@jnothman

jnothman Jul 31, 2014

Owner

True, I did not think it through. Still, I think it is more obfuscating than necessary.

Coverage Status

Coverage decreased (-0.05%) when pulling 5cc6352 on yelite:tree-min-param-enh into d5430e3 on scikit-learn:master.

Contributor

yelite commented Jul 31, 2014

@arjoly

X = rs.normal(size=shape).reshape(shape)
y = ((X ** 2.0).sum(axis=1) > 9.34).astype(np.float64)

The two labels seem to have the same proportion. Now the n_samples=20, should we raise it up?

Owner

arjoly commented Jul 31, 2014

It might be due to a perfect split with only one internal node. Thus the minimal depth is one. If it failed randomly, this might be due to an unset random_state.

Contributor

yelite commented Jul 31, 2014

hastie_X, hastie_y = datasets.make_hastie_10_2(n_samples=20, random_state=1)

random_state is set. Will a different OS or hardware make result different in this case?

Owner

arjoly commented Jul 31, 2014

I was thinking to this line est = TreeEstimator(max_depth=1, max_leaf_nodes=k).fit(X, y).

Owner

arjoly commented Jul 31, 2014

Will a different OS or hardware make result different in this case?

Not with the random number generator of numpy.

Coverage Status

Coverage increased (+0.0%) when pulling 9fc1f78 on yelite:tree-min-param-enh into ffd13c3 on scikit-learn:master.

Coverage Status

Coverage increased (+0.0%) when pulling 4df74bd on yelite:tree-min-param-enh into ad5773c on scikit-learn:master.

Contributor

yelite commented Aug 1, 2014

@arjoly Weird. It doesn't happen again... I have tested it three times on Travis and it's fine. Should we merge this PR?

@arjoly arjoly commented on the diff Aug 28, 2014

sklearn/tree/tree.py
@@ -198,6 +209,16 @@ def fit(self, X, y, sample_weight=None, check_input=True):
raise ValueError("min_samples_split must be greater than zero.")
if self.min_samples_leaf <= 0:
raise ValueError("min_samples_leaf must be greater than zero.")
+ if (isinstance(self.min_samples_leaf, float)
+ and self.min_samples_leaf == 1.0):
@arjoly

arjoly Aug 28, 2014

Owner

pep8 This line doesn't distinguish from the next one.

@arjoly arjoly commented on the diff Aug 28, 2014

sklearn/tree/tree.py
@@ -198,6 +209,16 @@ def fit(self, X, y, sample_weight=None, check_input=True):
raise ValueError("min_samples_split must be greater than zero.")
if self.min_samples_leaf <= 0:
raise ValueError("min_samples_leaf must be greater than zero.")
+ if (isinstance(self.min_samples_leaf, float)
+ and self.min_samples_leaf == 1.0):
+ raise ValueError("min_samples_leaf cannot be 1.0 "
+ "in order to avoid ambiguity")
+ if not 0 < min_samples_leaf <= n_samples:
+ raise ValueError("min_samples_leaf must in (0, n_samples], "
+ "check both the value and type of input parameter")
@arjoly

arjoly Aug 28, 2014

Owner

Could you tell what the user have entered as a value and what was inferred?

@arjoly arjoly commented on the diff Aug 28, 2014

sklearn/tree/tree.py
@@ -198,6 +209,16 @@ def fit(self, X, y, sample_weight=None, check_input=True):
raise ValueError("min_samples_split must be greater than zero.")
if self.min_samples_leaf <= 0:
raise ValueError("min_samples_leaf must be greater than zero.")
+ if (isinstance(self.min_samples_leaf, float)
+ and self.min_samples_leaf == 1.0):
+ raise ValueError("min_samples_leaf cannot be 1.0 "
+ "in order to avoid ambiguity")
+ if not 0 < min_samples_leaf <= n_samples:
+ raise ValueError("min_samples_leaf must in (0, n_samples], "
+ "check both the value and type of input parameter")
+ if not 1 < min_samples_split <= n_samples:
+ raise ValueError("min_samples_split must in [2, n_samples], "
+ "check both the value and type of input parameter")
@arjoly

arjoly Aug 28, 2014

Owner

Could you tell what the user have entered as a value and what was inferred?

@arjoly arjoly commented on the diff Aug 28, 2014

sklearn/tree/tree.py
@@ -232,9 +253,7 @@ def fit(self, X, y, sample_weight=None, check_input=True):
else:
min_weight_leaf = 0.
- # Set min_samples_split sensibly
@arjoly

arjoly Aug 28, 2014

Owner

Can you let this comment?

@arjoly arjoly commented on the diff Aug 28, 2014

sklearn/tree/tree.py
@@ -191,6 +200,8 @@ def fit(self, X, y, sample_weight=None, check_input=True):
self.max_features_ = max_features
+ if n_samples < 2:
+ raise ValueError("number of samples should be greater than one")
@arjoly

arjoly Aug 28, 2014

Owner

Having only one samples, should work. It would be a one-node tree.

@arjoly arjoly commented on the diff Aug 28, 2014

doc/modules/ensemble.rst
@@ -163,20 +163,20 @@ in bias::
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
... random_state=0)
- >>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=1,
+ >>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
@arjoly

arjoly Aug 28, 2014

Owner

Why do you need to change this?

Owner

arjoly commented Oct 22, 2015

closed superseeded by #5531

@arjoly arjoly closed this Oct 22, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment