Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Bagging meta-estimator #2375

Closed
wants to merge 86 commits into from
Closed

Conversation

glouppe
Copy link
Contributor

@glouppe glouppe commented Aug 20, 2013

Git history in #2198 was messed up so I make a new pull request. Sorry for the noise...

TODO:

  • narrative documentation
  • add an example

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2013

For the example it would be very interesting to try the BaggingClassifier with the kernel SVC as the base estimator for instance on the digits dataset. As SVC has a more than quadratic time complexity w.r.t. n_samples we can expect bagging to actually improve the speed for the same level of accuracy as the regular SVC model on the full dataset.

@glouppe
Copy link
Contributor Author

glouppe commented Aug 20, 2013

That might be a good idea, but the digits dataset is actually quite small. It doesn't take less than a second to train an SVC on that - at the scale, I'd rather not make any conclusions if one appears faster than the other.

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2013

You can nudge it as done in the RBM example to make it both larger and harder.

@glouppe
Copy link
Contributor Author

glouppe commented Aug 20, 2013

In another direction, I was thinking about a figure like the ones I had done in my paper (see http://orbi.ulg.ac.be/bitstream/2268/130099/1/glouppe12.pdf page 11): it can be used to show the effect of max_samples and max_features (either alone or combined).

Another great example would be a bias-variance decomposition of the error, illustrating what happens when base estimators are averaged together. (No matter what we choose here, such an example should anyway be in our documentation in my opinion...)

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2013

In another direction, I was thinking about a figure like the ones I had done in my paper (see http://orbi.ulg.ac.be/bitstream/2268/130099/1/glouppe12.pdf page 11): it can be used to show the effect of max_samples and max_features (either alone or combined).

+1. It would be interesting to do this kind of plot with on a bagged GBRT regressor on a non-trivial regression dataset.

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2013

Another great example would be a bias-variance decomposition of the error, illustrating what happens when base estimators are averaged together. (No matter what we choose here, such an example should anyway be in our documentation in my opinion...)

+1 as well.

@glouppe
Copy link
Contributor Author

glouppe commented Aug 21, 2013

I have got a working example of the bias-variance decomposition of the mean squared error of a single estimator versus bagging. It still needs some work and documentation, but here is how it renders on a toy 1d regression problem. The first plot displays the function to predict, the predictions of single estimators over several instances of the problem and the mean prediction. The second plot is a decomposition of the mean square error at point x.

img

Script output =
Single estimator: 0.003498 (mse) = 0.000079 (bias^2) + 0.003420 (var)
Bagging: 0.001900 (mse) = 0.000074 (bias^2) + 0.001825 (var)

In particular, one can see from the lower plot (compare the plain green line and the dashed greed line), or from the script output, that bagging mainly affect - and reduce - the variance part of the mean squared error.

@arjoly
Copy link
Member

arjoly commented Aug 22, 2013

Nice plot, but I can't discern the different curves. What do you think of breaking in three plots (mse, bias^2, variance) using the same scale? Would it be interesting to add some noise?

@glouppe
Copy link
Contributor Author

glouppe commented Aug 22, 2013

Nice plot, but I can't discern the different curves. What do you think of breaking in three plots (mse, bias^2, variance) using the same scale? Would it be interesting to add some noise?

The plot is not up to date, see the next commits :) I'll refresh this when I'll be done.

@glouppe
Copy link
Contributor Author

glouppe commented Aug 22, 2013

Here is an updated version of the example. See the explanations in the docstring for details.

img

Output:
Tree: 0.025531 (error) = 0.000308 (bias^2) + 0.015245 (var) + 0.009761 (noise)
Bagging(Tree): 0.019596 (error) = 0.000437 (bias^2) + 0.009164 (var) + 0.009761 (noise)

I think this makes quite a nice example overall, illustrating both the bias-variance decomposition and the benefits of bagging. What do you think? @ogrisel @arjoly

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2013

Very nice!

@glouppe
Copy link
Contributor Author

glouppe commented Aug 22, 2013

It is also quite interesting to explore other base estimators (KNN, SVR, etc) :)

estimators. The larger the variance, the more sensitive are the predictions for
`x` to small changes in the training set. The bias term corresponds to the
difference between the average prediction of the estimator (in cyan) and the
best possible model (in dark blue). On this problem, we can thus observe than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

than => that?

@arjoly
Copy link
Member

arjoly commented Aug 22, 2013

The plot is a lot nicer !!!

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2013

It is also quite interesting to explore other base estimators (KNN, SVR, etc) :)

Yes and the GBRT model as well. Although this problem might be too easy to emphasize the interest of bagging GBRT models.

@glouppe
Copy link
Contributor Author

glouppe commented Aug 22, 2013

I have completed the example and added a new section in the narrative documentation.

This looks ready to me, going from [WIP] to [MRG].

A first round of reviews is more than welcome! Random pings: @pprett @ogrisel @arjoly

y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2
y_var = np.var(y_predict, axis=1)

print("{0}: {1} (error) = {2} (bias^2) + {3} (var) + {4} (noise)".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use {1:.4f} to limit the precision to 4 decimal places and make the output easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was looking for that. Still not used to this Python3-way for formatting :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been there for quite some time, at least since python 2.6.

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2013

Before merging I would really like to have support for sparse X. This can be a bit tricky because if sample_weight is not supported in base_estimator that means converting the data first in CSC for feature wise sampling and then the subsample in CSR for sample-wise sampling.

I think it's worth doing it though (with tests).

@glouppe
Copy link
Contributor Author

glouppe commented Aug 22, 2013

Before merging I would really like to have support for sparse X. This kind be a bit tricky because if sample_weight is not supported that means converting the data first in CSC for feature wise sampling and then the subsample in CSR for sample-wise sampling.

me don't like sparse formats

I agree though, I'll look at this later.

- If float, then draw `max_features * X.shape[1]` features.

bootstrap : boolean, optional (default=False)
Whether instances are drawn with replacement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instances => samples

@arjoly
Copy link
Member

arjoly commented Aug 23, 2013

Before merging I would really like to have support for sparse X. This kind be a bit tricky because if sample_weight is not supported that means converting the data first in CSC for feature wise sampling and then the subsample in CSR for sample-wise sampling.

me don't like sparse formats

I agree though, I'll look at this later.

This pr is already pretty large (around 1300 addition). I would prefer to keep this feature
for another one.


In regression, the expected mean squared error of an estimator can be
decomposed in terms of bias, variance and noise. On average over dataset
instances LS of the regression problem, the bias term measures the average
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LS?

@arjoly
Copy link
Member

arjoly commented Sep 9, 2013

let's merge this beast !!! +1

@glouppe
Copy link
Contributor Author

glouppe commented Sep 9, 2013

Thanks for your review Arnaud!

@ogrisel Shall we merge this or wait for someone else review?

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2013

I would wait to know the opinion of past reviewers such as @larsmans and @mblondel.

@larsmans
Copy link
Member

larsmans commented Sep 9, 2013

I'll try to review tonight. @glouppe can you post the generated figures here?

@glouppe
Copy link
Contributor Author

glouppe commented Sep 9, 2013

Sure, here it is for the bias-variance decomposition example:

img

construction procedure and then making an ensemble out of it. In many cases,
bagging methods constitute a very simple way to improve with respect to a
single model, without making it necessary to adapt the underlying base
algorithm.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the noobs, it might be useful to state explicitly that bagging should be used with strong learners and that it reduces overfit (and maybe to contrast it with boosting in this sense).

estimators_features))


def _partition_estimators(ensemble):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go in ensemble/base.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@arjoly
Copy link
Member

arjoly commented Sep 11, 2013

LGTM!

@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2013

Any final words @larsmans? LGTM too.

@larsmans larsmans closed this in 524daee Sep 11, 2013
@larsmans
Copy link
Member

All tests pass on my box. Merged by hand after extensive rebase.

@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2013

Great! Thanks all!

@arjoly
Copy link
Member

arjoly commented Sep 11, 2013

Great :-) !!! 🍻

@glouppe
Copy link
Contributor Author

glouppe commented Sep 11, 2013

Great! Thank you all for the reviews :)

@glouppe
Copy link
Contributor Author

glouppe commented Sep 11, 2013

@larsmans By the way, did you had to squash everything into a single commit? :s

@larsmans
Copy link
Member

It's one feature, so you get one commit for it ;)

Seriously: this was the easiest way to get rid of the duplicate and typo commits.

@jakevdp
Copy link
Member

jakevdp commented Sep 11, 2013

Nice work!

@glouppe
Copy link
Contributor Author

glouppe commented Sep 11, 2013

It's one feature, so you get one commit for it ;)

Meh, why life is so hard? ; ;

(joking)

@mblondel
Copy link
Member

@glouppe Could you have a look at PR #2420?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants