Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] EHN Add bootstrap sample size limit to forest ensembles #14682

Merged
merged 31 commits into from Sep 20, 2019

Conversation

@notmatthancock
Copy link
Contributor

notmatthancock commented Aug 18, 2019

Adds a max_samples kwarg to forest ensembles that limits the size of the bootstrap samples used to train each estimator. This PR is intended to supersede two previous stalled PRs (#5963 and #9645), which have not been touched for a couple of years. This PR adds unit tests for the various functionalities.

Why is this useful?

When training a random forest classifier on a large dataset with correlated/redundant examples, training on the entire dataset is memory intensive and unnecessary. Further, this results models that occupy an unwieldy amount of memory. As an example consider training on image patches for a segmentation-as-classification problem. It would be useful in this situation to train only on a subset of the available image patches because it's expected that an image patch obtained at one location is highly correlated with the patch obtained one pixel over. Limiting the size of each bootstrap sample to train each estimator is useful in this and similar applications.

Pickled model disk space comparison

Here's a simple test script to show the difference between occupied disk space of the pickled model, using the full dataset size for each bootstrap vs. using just a bootstrap size of 1 (obviously this is dramatic):

File size `max_samples=None`: 0.154GB
File size `max_samples=1`: 5.286e-05GB
import os
import pickle

import numpy as np
from sklearn.ensemble import RandomForestClassifier


rs = np.random.RandomState(1234)
X = rs.randn(100000, 1000)
y = rs.randn(X.shape[0]) > 0

rfc = RandomForestClassifier(
    n_estimators=100, random_state=rs)
rfc.fit(X, y)
with open('rfc.pkl', 'wb') as f:
    pickle.dump(rfc, f)
size = os.stat('rfc.pkl').st_size / 1e9
print("File size `max_samples=None`: {}GB".format(size))

rfc = RandomForestClassifier(
    n_estimators=100, random_state=rs, max_samples=1)
rfc.fit(X, y)
with open('rfc.pkl', 'wb') as f:
    pickle.dump(rfc, f)
size = os.stat('rfc.pkl').st_size / 1e9
print("File size `max_samples=1`: {}GB".format(size))
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
Copy link
Contributor

glemaitre left a comment

You should also add the parameter to RandomForestRegressor, isn't it?

sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
Copy link
Contributor

glemaitre left a comment

Additional comments regarding the tests.

@@ -1330,3 +1330,93 @@ def test_forest_degenerate_feature_importances():
gbr = RandomForestRegressor(n_estimators=10).fit(X, y)
assert_array_equal(gbr.feature_importances_,
np.zeros(10, dtype=np.float64))


def test__get_n_bootstrap_samples():

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 19, 2019

Contributor

We usually don't test a private function. Instead we would test through the different estimator (RandomForest, ExtraTrees)

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 19, 2019

Contributor

You can parametrize easily this case and you can check. You can make a test function where want to check that errors are raised properly (you can parametrize as well).

The other behavior should be done by fitting the estimator and check that we fitted on a subset of data.

This comment has been minimized.

Copy link
@notmatthancock

notmatthancock Aug 21, 2019

Author Contributor

check that we fitted on a subset of data.

where do you get that info?

As far as I can see, the sample indices for each bootstrap are not stored anywhere, but translated into sample weights and passed to the estimator.

This comment has been minimized.

Copy link
@notmatthancock

notmatthancock Aug 21, 2019

Author Contributor

(exception check unit tests refactored in: 081b7b7)

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 22, 2019

Contributor

where do you get that info?

Yep I thought that we were having a private function to get back the indices but it does not look like so simple to do without making some hacks. I would need a bit more time to think about how to test this.

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
unsampled_indices = _generate_unsampled_indices(
estimator.random_state, n_samples)
estimator.random_state, n_samples, n_bootstrap_samples)

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 19, 2019

Contributor

@jnothman here we will test on the left out samples. Since we used max_samples it means that we will use a lot of samples for the OOB. Is it something that we want?

If this is the case, we should add a test to make sure that we have the right complement in case max_samples is not None.

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 20, 2019

Member

Is there a problem with having a large OOB sample for test? Testing with trees isn't fast, but is a lot faster than training...?

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 20, 2019

Contributor

Testing with trees isn't fast, but is a lot faster than training...?

True. Should not be an issue then

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 20, 2019

Member

If the sample is much smaller than the dataset I suppose it may be an issue. Maybe we should make a rule that the oob sample is constrained to be no larger than the training set... But that may confuse users trying different values for this parameter

This comment has been minimized.

Copy link
@notmatthancock

notmatthancock Aug 21, 2019

Author Contributor

If the sample is much smaller than the dataset

I can see this situation easily arising, e.g, your dataset is 106 examples and you want to to fit with say, max_samples=1000 and n_estimators=1000.

Copy link
Contributor

glemaitre left a comment

A couple of additional comments

sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
@Shihab-Shahriar

This comment has been minimized.

Copy link

Shihab-Shahriar commented Sep 6, 2019

Hello, does this actually limit sample size given to each base tree? I noticed few days back that _parallel_build_trees doesn't actually sample training instances. To inject randomness, it uses bootstrap as a way of re-weighting the sample_weight of instances. In other words, whole dataset is passed to each base tree, but with different sample_weight.

I tried putting a print(len(y)) just before the tree.fit call when bootstrap=True in your code, and it confirmed my suspicion. Please let me know if there's any error in my reasoning.

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Sep 9, 2019

Hello, does this actually limit sample size given to each base tree? I noticed few days back that _parallel_build_trees doesn't actually sample training instances. To inject randomness, it uses bootstrap as a way of re-weighting the sample_weight of instances. In other words, whole dataset is passed to each base tree, but with different sample_weight.

It is the definition of a bootstrap sample: sampling with replacement so n_samples=X.shape[0].
Now, we allow sampling with replacement allowing n_samples < X.shape[0].

@glemaitre glemaitre self-requested a review Sep 9, 2019
@notmatthancock

This comment has been minimized.

Copy link
Contributor Author

notmatthancock commented Sep 11, 2019

Sorry about that particular choice of word. I realize this may have sounded a lot different than I originally intended.

To be fair my initial comment for this PR suggests what you thought:

When training a random forest classifier on a large dataset with correlated/redundant examples, training on the entire dataset is memory intensive and unnecessary.

But, as it turns out, this PR does not address that aspect and importantly none of the contributed docs suggest this change either as @glemaitre points out.

@notmatthancock

This comment has been minimized.

Copy link
Contributor Author

notmatthancock commented Sep 11, 2019

@notmatthancock Could you add an entry in doc/whats_new/v0.22.rst

Done.

sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Sep 12, 2019

@glemaitre, Sorry about that particular choice of word. I realize this may have sounded a lot different than I originally intended.

It is just to be certain that we don't document something wrong :)

@glemaitre glemaitre self-requested a review Sep 13, 2019
@glemaitre glemaitre added this to IN REVISION in Guillaume's pet Sep 13, 2019
Copy link
Contributor

glemaitre left a comment

Apart from the small pytest.raises LGTM.

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Sep 13, 2019

FYI: codecov seems to be wrong. All the diff code is covered.

@glemaitre glemaitre changed the title Add bootstrap sample size limit to forest ensembles [MRG+1] EHN Add bootstrap sample size limit to forest ensembles Sep 13, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Sep 14, 2019

Thanks @notmatthancock, let's wait for a second review before merging.

Maybe @jnothman @adrinjalali could have a look

@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to TO BE MERGED in Guillaume's pet Sep 14, 2019
Copy link
Member

jnothman left a comment

I'm good with the implementation. I'm wondering if it needs mentioning in the documentation for bootstrap... This is no longer quite a bootstrap, but I think former proposals reused that parameter name.

If we are happy with a new parameter, this lgtm

Copy link
Member

adrinjalali left a comment

Otherwise LGTM, thanks @notmatthancock

if max_samples is None:
return n_samples

if isinstance(max_samples, numbers.Integral):

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Sep 18, 2019

Member

I think we should be consistent w.r.t how we treat these fractions. For instance, in optics, we have:

def _validate_size(size, n_samples, param_name):
if size <= 0 or (size !=
int(size)
and size > 1):
raise ValueError('%s must be a positive integer '
'or a float between 0 and 1. Got %r' %
(param_name, size))
elif size > n_samples:
raise ValueError('%s must be no greater than the'
' number of samples (%d). Got %d' %
(param_name, n_samples, size))

And then 1 always means 100% of the data, at least in optics. Do we have a similar semantics in other places?

This comment has been minimized.

Copy link
@glemaitre

glemaitre Sep 18, 2019

Contributor

With PCA, n_components=1 means 1 components while n_components<1 will be a percentage.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Sep 18, 2019

Contributor

Excluding 1 from float avoid issue with float comparison as well.

to train each base estimator.
- If None (default), then draw `X.shape[0]` samples.
- If int, then draw `max_samples` samples.
- If float, then draw `max_samples * X.shape[0]` samples.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Sep 19, 2019

Member

I'm okay with the behavior of the float, but we need to document that here, i.e. explicitly say that if float, it must be in (0, 1)

@glemaitre glemaitre merged commit 746efb5 into scikit-learn:master Sep 20, 2019
9 of 15 checks passed
9 of 15 checks passed
scikit-learn.scikit-learn Build #20190920.7 failed
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit failed
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl failed
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl failed
Details
ci/circleci: doc CircleCI is running your tests
Details
ci/circleci: doc-min-dependencies CircleCI is running your tests
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Sep 20, 2019

@notmatthancock I made the small change requested by @adrinjalali and merged.
Thanks a lot for your contribution.

@glemaitre glemaitre moved this from TO BE MERGED to MERGED in Guillaume's pet Sep 20, 2019
@notmatthancock

This comment has been minimized.

Copy link
Contributor Author

notmatthancock commented Sep 20, 2019

Likewise @glemaitre, thanks to you and @jnothman for the helpful review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
5 participants
You can’t perform that action at this time.