make min_density parameter for gradient boosting #1689

wants to merge 1 commit into from

6 participants


I'm not sure why min_density is handled differently for gradient boosting compared to the random forest estimators. This commit simply makes gradient boosting treat min_density the same as random forests. I ran into this in a recent Kaggle competition, I set max_depth < 6 and training slowed to a crawl. Setting min_density to 0.1 fixed that. I'm not sure if the tests are still necessary but I updated them regardless.

I don't know the history of this, but perhaps there is a good reason, @pprett?

scikit-learn member

The question is whether its a better idea to use 0.1 as our new
default setting or whether we should stick to our old default (via

I don't mind at all keeping the old heuristic. In fact I tried keeping it in but ran into issues with certain tests (something about overriding parameter values, I forget exactly I could take a second look). I just wanted to be able to override the heuristic so I submitted this in the meantime to get the ball rolling.

This was the data for the GE flight quest so I'm not allowed to distribute it. I can say that I had in excess of 100k samples, 0.1 subsampling and default parameters for min_samples_leaf|split. I'm not familiar with the standard datasets, but perhaps I can find something to mimic the behavior I saw.

min_density should be smaller than 1

You mean greater than 0? min_density is currently set to 0 for small trees.


Thanks for including me. I struggled with this because it's not documented.

Adding it to the constructor sounds great. I still think it makes sense to change the heuristic so that it's smaller for shallow large trees. From my limited understanding, the tradeoff basically involves comparing something that's

  • O(depth * len(X)), if you don't copy, and use the mask
  • O(depth * len(X) * subsample), with copying/sampling

Doesn't the depth cancel out here? I.e. the main criteria should be that subsample is smaller than some (small) threshold.

scikit-learn member

O(depth * len(X)) is the ideal runtime - at each depth level you have len(X) examples to process (assuming for now that leafs are at the bottom layer exclusively). However, our tree growing algorithm is not ideal because it grows the tree in a depth first manner: at each split node all samples are processed but those which are not in the sample mask are skipped. Since there are at most 2 ** depth - 1 splitting nodes runtime is actually O(2 ** depth * len(X)). When len(X) is large but only a few examples are relevant for the current splitting node (i.e. sparse sample_mask) - the bulk of time is spent checking the sample_mask and skipping.

The problem with subsample is that it makes the sample mask even sparser - so you ran into this issue much more quickly.

The fundamental problem with the depth < 6 heuristic is that it doesn't take the size of X into account - if len(X) is large and min_samples_split sufficiently small - the growing procedure might create very sparse splits too - compromising the performance.

We should definitely change this - I will work on this PR on the weekend - if the drop in performance on smallish datasets is not too large I'm tempted to set min_density to 0.1 as our default - just as for RandomForest and DecisionTree.

Thinking about our tree growing procedure: the sample mask is a neat idea if you grow your trees the way GBM does, i.e. you only grow only one branch of a specific depth, this way you only have to process len(X) samples per layer because each layer comprises a split node and a leaf. If you grow complete trees - as we do - checking the sample mask might be too costly indeed. I'll try to forge a PR that allows to build trees with exactly J leaves - I'm curious how this will effect both the efficiency and effectiveness of GBRT.


Your explanation makes a lot of sense, thanks for the background. I totally agree that the fundamental heuristic is that it ignores len(X).

scikit-learn member

This still needs to implement the heuristic, right? That should be fairly easy to do, shouldn't it?

scikit-learn member

@amueller you refer to

...if the drop in performance on smallish datasets is not too large I'm tempted to set min_density to 0.1 as our default...


Doesn't look implement yet. Will that be a separate PR or done here? @pprett

Thanks for the work on this, and the enlightening discussion :)

scikit-learn member

I didn't see the last reply by @pprett. So we need to do some benchmarking to see if 0.1 is slower than 1 on small datasets. If not (or not much), we can have 0.1 as default. Otherwise we need to implement the current heuristic.


For what it's worth, I used make_friedman1 to simulate the size of that dataset I experienced this on. Using Ipython I tried the following.

X, y = make_friedman1(100000, 1000, 0.5)
clf1 = GradientBoostingRegressor(n_estimators=10, min_density=0.1, max_depth=5, subsample=0.05)
clf2 = GradientBoostingRegressor(n_estimators=10, min_density=0, max_depth=5, subsample=0.05)
clf3 = GradientBoostingRegressor(n_estimators=10, min_density=1, max_depth=5, subsample=0.05)

Then timing them

timeit, y)

1 loops, best of 3: 66.8 s per loop

timeit, y)

1 loops, best of 3: 319 s per loop

timeit, y)

1 loops, best of 3: 70 s per loop

If this type of benchmark is sufficient, perhaps I can try this on some smaller datasets.

scikit-learn member

This is not relevant anymore. The new tree implementation removes min_density, sample_mask and X_argsorted parameters.

@glouppe glouppe closed this Jul 22, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment