Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added subsampling to RandomForest #5963

Closed
wants to merge 3 commits into from

Conversation

@DrDub
Copy link

DrDub commented Dec 5, 2015

The current implementation uses bagging with the number of selected
samples equal to the whole training set. The current patch adds
a parameter to enable training the trees in the decision forest
on a smaller number of training samples.

For example, if you have a 1 million training instances, the
current implementation will train each decision tree in the
forest on 1 million random instances drawn from the original
set (with repetition). With this patch, it is possible to train
each tree, for example, in 50% of the overall training data.

To avoid modifying the signature of the fit(X,y) method, the
number of samples used in each estimator is passed to the
constructors, but this value can be set to a floating point
between 0.0 and 1.0 to refer to a percentage of the total
training data.

Minor changes to ensure PEP8 are also included, plus extended
test cases.

@glouppe

This comment has been minimized.

Copy link
Member

glouppe commented Dec 5, 2015

Thanks! Could you rename this added parameter to max_samples and mimick what is done in BaggingClassifier, where this functionality is also implemented? The goal is to have a consistent API across estimators implementing the same behaviour. There might be code also to factor out.

@DrDub

This comment has been minimized.

Copy link
Author

DrDub commented Dec 5, 2015

Very nice. Yes, I just added the two changes, it looks much nicer now. There might be some code to factor out (which makes sense even at the conceptual level, as bagging and random forests are related to the same key researchers) but refactoring code is outside what I'm comfortable given my current understanding of the code. Maybe this change can be merged and I then create an enhancement request for others to look into refactoring opportunities? Thanks so much for looking into this PR so promptly!

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 14, 2016

@DrDub sorry for the slow turn-around. Could you please rebase?

@DrDub DrDub force-pushed the DrDub:random-forest-subsampling branch from 1989580 to c0f0a42 Nov 3, 2016
The current implementation uses bagging with the number of selected
samples equal to the whole training set. The current patch adds
a parameter to enable training the trees in the decision forest
on a smaller number of training samples.

For example, if you have a 1 million training instances, the
current implementation will train each decision tree in the
forest on 1 million random instances drawn from the original
set (with repetition). With this patch, it is possible to train
each tree, for example, in 50% of the overall training data.

Following the example in the Bagging class, the estimator samples
are passed as the parameter max_samples to the constructor.

Minor changes to ensure PEP8 are also included, plus extended
test cases.
@DrDub DrDub force-pushed the DrDub:random-forest-subsampling branch from c0f0a42 to a361f67 Nov 3, 2016
@DrDub

This comment has been minimized.

Copy link
Author

DrDub commented Nov 3, 2016

Rebased.

Copy link
Member

jnothman left a comment

At a skim, this looks good. It would be nice if we had a stronger test. We should also aim for test coverage of the error messages.

clf.fit(X, y)
assert_array_equal(clf.predict(T), true_result)
assert_equal(10, len(clf))

clf = ForestClassifier(n_estimators=10, max_features=1, random_state=1)
clf = ForestClassifier(n_estimators=10, max_features=1, random_state=1, max_samples=max_samples)

This comment has been minimized.

Copy link
@jnothman

jnothman Nov 6, 2016

Member

please limit lines to 79 chars

This comment has been minimized.

Copy link
@DrDub

DrDub Nov 26, 2016

Author

Thanks for the review. I have fixed the line lengths.

I'm open to contribute better tests in a separate PR, I can @ you for it, if you'd be interested in reviewing it.

This comment has been minimized.

Copy link
@jnothman

jnothman Nov 28, 2016

Member

Why should we leave testing to a separate PR?

This comment has been minimized.

Copy link
@DrDub

DrDub Nov 28, 2016

Author

My PR has tests for the changes in line with existing tests in the class. You suggest a different work on improving the test cases. My interest in this PR is in bringing Random Forests in line with Breiman (2001) by sampling training instances. This is very useful when training over large sets. I have used this code for my own experiments and it works, I hope others might benefit from this improvement which puts scikit-learn implementation in line with other ML libraries and published work on RFs.

I offer improving the test cases of the class as a compromise, seeing that is something that interests you.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Nov 28, 2016

Tests are good engineering practice. I don't see why you feel the functionality doesn't need testing.

Sometimes it requires some imagination to design a test for machine learning. Here you could easily use an X that has a different single feature active for each sample, and ensure that a subsampled tree, built to fit the data perfectly, uses the correct number of features.

@leandroleal

This comment has been minimized.

Copy link

leandroleal commented Aug 28, 2017

Hi, I did this:

  • Merge with current master code (sklearn/ensemble/forest.py);
  • Design some test cases (sklearn/ensemble/tests/test_forest.py);

All the tests was passed ! :)

How I can send it to you ?
I can't commit in this pull request.

This url has my two patches: https://www.lapig.iesa.ufg.br/drive/index.php/s/kT1HAgwHt8XS2dT

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 29, 2017

@leandroleal

This comment has been minimized.

Copy link

leandroleal commented Aug 29, 2017

Pull request created #9645

@@ -94,19 +96,28 @@ def _generate_unsampled_indices(random_state, n_samples):


def _parallel_build_trees(tree, forest, X, y, sample_weight, tree_idx, n_trees,
verbose=0, class_weight=None):
verbose=0, class_weight=None, max_samples=1.0):

This comment has been minimized.

Copy link
@Borda

Borda Jul 13, 2019

by max_samples I would naturally understand a total number, not a ratio... add explanation or nb_samples, used_samples?


# if max_samples is float:
if not isinstance(max_samples, (numbers.Integral, np.integer)):
max_samples = int(max_samples * X.shape[0])

This comment has been minimized.

Copy link
@Borda

Borda Jul 13, 2019

int(round(max_samples * len(X)))


def _set_oob_score(self, X, y):
"""Compute out-of-bag score"""
X = check_array(X, dtype=DTYPE, accept_sparse='csr')

n_classes_ = self.n_classes_
n_samples = y.shape[0]
max_samples = self.max_samples

# if max_samples is float:

This comment has been minimized.

Copy link
@Borda

Borda Jul 13, 2019

it is repeating, think about a separate function

def __adjust_nb_samples(max_samples, nb_samples):
     n_samples = int(max_samples * nb_samples) if not isinstance(max_samples, (numbers.Integral, np.integer)) else nb_samples

     if not (0 < n_samples <= len(X)):
        raise ValueError("max_samples must be in (0, n_samples]")

    return nb_samples
self.classes_.append(classes_k)
self.n_classes_.append(classes_k.shape[0])
y = y_store_unique_indices

if self.class_weight is not None:
valid_presets = ('auto', 'balanced', 'subsample', 'balanced_subsample')
valid_presets = ('auto', 'balanced', 'subsample',

This comment has been minimized.

Copy link
@Borda

Borda Jul 13, 2019

why two lines?

@@ -552,8 +579,8 @@ def predict_proba(self, X):
The predicted class probabilities of an input sample are computed as
the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the same
class in a leaf.
class probability of a single tree is the fraction of samples of the

This comment has been minimized.

Copy link
@Borda
@@ -651,7 +678,8 @@ def __init__(self,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False):
warm_start=False,
max_samples=1.0):

This comment has been minimized.

Copy link
@Borda
"max_features", "max_leaf_nodes", "min_impurity_split",
"random_state"),
"max_features", "max_leaf_nodes",
"min_impurity_split", "random_state"),

This comment has been minimized.

Copy link
@Borda

Borda Jul 13, 2019

is this formatting really needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.