Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

make min_density parameter for gradient boosting #1689

Closed
wants to merge 1 commit into from

6 participants

@jwkvam

I'm not sure why min_density is handled differently for gradient boosting compared to the random forest estimators. This commit simply makes gradient boosting treat min_density the same as random forests. I ran into this in a recent Kaggle competition, I set max_depth < 6 and training slowed to a crawl. Setting min_density to 0.1 fixed that. I'm not sure if the tests are still necessary but I updated them regardless.

I don't know the history of this, but perhaps there is a good reason, @pprett?
2b56ee1

@pprett
Owner
@jwkvam

The question is whether its a better idea to use 0.1 as our new
default setting or whether we should stick to our old default (via
min_density=None)

I don't mind at all keeping the old heuristic. In fact I tried keeping it in but ran into issues with certain tests (something about overriding parameter values, I forget exactly I could take a second look). I just wanted to be able to override the heuristic so I submitted this in the meantime to get the ball rolling.

This was the data for the GE flight quest so I'm not allowed to distribute it. I can say that I had in excess of 100k samples, 0.1 subsampling and default parameters for min_samples_leaf|split. I'm not familiar with the standard datasets, but perhaps I can find something to mimic the behavior I saw.

min_density should be smaller than 1

You mean greater than 0? min_density is currently set to 0 for small trees.

@erikbern

Thanks for including me. I struggled with this because it's not documented.

Adding it to the constructor sounds great. I still think it makes sense to change the heuristic so that it's smaller for shallow large trees. From my limited understanding, the tradeoff basically involves comparing something that's

  • O(depth * len(X)), if you don't copy, and use the mask
  • O(depth * len(X) * subsample), with copying/sampling

Doesn't the depth cancel out here? I.e. the main criteria should be that subsample is smaller than some (small) threshold.

@pprett
Owner

O(depth * len(X)) is the ideal runtime - at each depth level you have len(X) examples to process (assuming for now that leafs are at the bottom layer exclusively). However, our tree growing algorithm is not ideal because it grows the tree in a depth first manner: at each split node all samples are processed but those which are not in the sample mask are skipped. Since there are at most 2 ** depth - 1 splitting nodes runtime is actually O(2 ** depth * len(X)). When len(X) is large but only a few examples are relevant for the current splitting node (i.e. sparse sample_mask) - the bulk of time is spent checking the sample_mask and skipping.

The problem with subsample is that it makes the sample mask even sparser - so you ran into this issue much more quickly.

The fundamental problem with the depth < 6 heuristic is that it doesn't take the size of X into account - if len(X) is large and min_samples_split sufficiently small - the growing procedure might create very sparse splits too - compromising the performance.

We should definitely change this - I will work on this PR on the weekend - if the drop in performance on smallish datasets is not too large I'm tempted to set min_density to 0.1 as our default - just as for RandomForest and DecisionTree.

Thinking about our tree growing procedure: the sample mask is a neat idea if you grow your trees the way GBM does, i.e. you only grow only one branch of a specific depth, this way you only have to process len(X) samples per layer because each layer comprises a split node and a leaf. If you grow complete trees - as we do - checking the sample mask might be too costly indeed. I'll try to forge a PR that allows to build trees with exactly J leaves - I'm curious how this will effect both the efficiency and effectiveness of GBRT.

@erikbern

Your explanation makes a lot of sense, thanks for the background. I totally agree that the fundamental heuristic is that it ignores len(X).

@amueller
Owner

This still needs to implement the heuristic, right? That should be fairly easy to do, shouldn't it?

@jaquesgrobler

@amueller you refer to

...if the drop in performance on smallish datasets is not too large I'm tempted to set min_density to 0.1 as our default...

?

Doesn't look implement yet. Will that be a separate PR or done here? @pprett

Thanks for the work on this, and the enlightening discussion :)

@amueller
Owner

I didn't see the last reply by @pprett. So we need to do some benchmarking to see if 0.1 is slower than 1 on small datasets. If not (or not much), we can have 0.1 as default. Otherwise we need to implement the current heuristic.

@jwkvam

For what it's worth, I used make_friedman1 to simulate the size of that dataset I experienced this on. Using Ipython I tried the following.

X, y = make_friedman1(100000, 1000, 0.5)
clf1 = GradientBoostingRegressor(n_estimators=10, min_density=0.1, max_depth=5, subsample=0.05)
clf2 = GradientBoostingRegressor(n_estimators=10, min_density=0, max_depth=5, subsample=0.05)
clf3 = GradientBoostingRegressor(n_estimators=10, min_density=1, max_depth=5, subsample=0.05)

Then timing them

timeit clf1.fit(X, y)

1 loops, best of 3: 66.8 s per loop

timeit clf2.fit(X, y)

1 loops, best of 3: 319 s per loop

timeit clf3.fit(X, y)

1 loops, best of 3: 70 s per loop

If this type of benchmark is sufficient, perhaps I can try this on some smaller datasets.

@glouppe
Owner

This is not relevant anymore. The new tree implementation removes min_density, sample_mask and X_argsorted parameters.

@glouppe glouppe closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 17, 2013
  1. @jwkvam
This page is out of date. Refresh to see the latest.
View
42 sklearn/ensemble/gradient_boosting.py
@@ -426,14 +426,15 @@ class BaseGradientBoosting(BaseEnsemble):
@abstractmethod
def __init__(self, loss, learning_rate, n_estimators, min_samples_split,
- min_samples_leaf, max_depth, init, subsample, max_features,
- random_state, alpha=0.9, verbose=0):
+ min_samples_leaf, min_density, max_depth, init, subsample,
+ max_features, random_state, alpha=0.9, verbose=0):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.loss = loss
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
+ self.min_density = min_density
self.subsample = subsample
self.max_features = max_features
self.max_depth = max_depth
@@ -560,9 +561,6 @@ def fit(self, X, y):
random_state = check_random_state(self.random_state)
- # use default min_density (0.1) only for deep trees
- self.min_density = 0.0 if self.max_depth < 6 else 0.1
-
# create argsorted X for fast tree induction
X_argsorted = np.asfortranarray(
np.argsort(X.T, axis=1).astype(np.int32).T)
@@ -739,6 +737,17 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
+ min_density : float, optional (default=0.1)
+ This parameter controls a trade-off in an optimization heuristic. It
+ controls the minimum density of the `sample_mask` (i.e. the
+ fraction of samples in the mask). If the density falls below this
+ threshold the mask is recomputed and the input data is packed
+ which results in data copying. If `min_density` equals to one,
+ the partitions are always represented as copies of the original
+ data. Otherwise, partitions are represented as bit masks (aka
+ sample masks).
+ Note: this parameter is tree-specific.
+
subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base
learners. If smaller than 1.0 this results in Stochastic Gradient
@@ -813,13 +822,13 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
subsample=1.0, min_samples_split=2, min_samples_leaf=1,
- max_depth=3, init=None, random_state=None,
+ min_density=0.1, max_depth=3, init=None, random_state=None,
max_features=None, verbose=0):
super(GradientBoostingClassifier, self).__init__(
loss, learning_rate, n_estimators, min_samples_split,
- min_samples_leaf, max_depth, init, subsample, max_features,
- random_state, verbose=verbose)
+ min_samples_leaf, min_density, max_depth, init, subsample,
+ max_features, random_state, verbose=verbose)
def fit(self, X, y):
"""Fit the gradient boosting model.
@@ -969,6 +978,17 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
+ min_density : float, optional (default=0.1)
+ This parameter controls a trade-off in an optimization heuristic. It
+ controls the minimum density of the `sample_mask` (i.e. the
+ fraction of samples in the mask). If the density falls below this
+ threshold the mask is recomputed and the input data is packed
+ which results in data copying. If `min_density` equals to one,
+ the partitions are always represented as copies of the original
+ data. Otherwise, partitions are represented as bit masks (aka
+ sample masks).
+ Note: this parameter is tree-specific.
+
subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base
learners. If smaller than 1.0 this results in Stochastic Gradient
@@ -1048,13 +1068,13 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
subsample=1.0, min_samples_split=2, min_samples_leaf=1,
- max_depth=3, init=None, random_state=None,
+ min_density=0.1, max_depth=3, init=None, random_state=None,
max_features=None, alpha=0.9, verbose=0):
super(GradientBoostingRegressor, self).__init__(
loss, learning_rate, n_estimators, min_samples_split,
- min_samples_leaf, max_depth, init, subsample, max_features,
- random_state, alpha, verbose)
+ min_samples_leaf, min_density, max_depth, init, subsample,
+ max_features, random_state, alpha, verbose)
def fit(self, X, y):
"""Fit the gradient boosting model.
View
16 sklearn/ensemble/tests/test_gradient_boosting.py
@@ -477,11 +477,19 @@ def test_mem_layout():
def test_min_density():
- """Check if min_density is properly set when growing deep trees."""
- clf = GradientBoostingClassifier(max_depth=6)
+ """Check if setting min_density works and default is 0.1."""
+ clf = GradientBoostingClassifier()
clf.fit(X, y)
assert clf.min_density == 0.1
- clf = GradientBoostingClassifier(max_depth=5)
+ clf = GradientBoostingClassifier(min_density=0.5)
clf.fit(X, y)
- assert clf.min_density == 0.0
+ assert clf.min_density == 0.5
+
+ clf = GradientBoostingRegressor()
+ clf.fit(X, y)
+ assert clf.min_density == 0.1
+
+ clf = GradientBoostingRegressor(min_density=0.5)
+ clf.fit(X, y)
+ assert clf.min_density == 0.5
Something went wrong with that request. Please try again.