Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balanced Random Forest #13227

Closed
wants to merge 27 commits into from
Closed

Conversation

chkoar
Copy link
Contributor

@chkoar chkoar commented Feb 22, 2019

Reference Issues/PRs

Fixes #8607. Takes over #5181 and #8732.

What does this implement/fix? Explain your changes.

Actually I mainly used the @potash changes from #5181 and performed comparisons of variations of random forests on a standard benchmark that we have in imbalanced-learn . The Balanced Random Forest is triggered using the balanced_bootstrap in class_weight.

Any other comments?

Regarding the experiment. I build all forests using 100 trees and performed 5-fold cross-validation. The details of the datasets used can be found here.

The following table contains the performance of the variations in different datasets in terms of roc auc. With brf is stated the new balanced_boostrap .

dataset_name brf rf rf_balanced rf_balanced_subsample
abalone 0.855309 0.830593 0.834113 0.833721
abalone_19 0.812186 0.682914 0.694902 0.732027
arrhythmia 0.906580 0.951122 0.974703 0.972892
car_eval_34 0.983932 0.935592 0.956367 0.953385
car_eval_4 0.958985 0.938677 0.963698 0.964530
coil_2000 0.742975 0.696293 0.699116 0.694468
ecoli 0.905476 0.895000 0.910952 0.906905
isolet 0.989580 0.992651 0.991919 0.991755
letter_img 0.999549 0.999833 0.999830 0.999850
libras_move 0.953613 0.932691 0.958367 0.956466
mammography 0.956944 0.938190 0.909950 0.910677
oil 0.873508 0.851528 0.879456 0.850017
optical_digits 0.995662 0.997528 0.998007 0.997893
ozone_level 0.880928 0.844524 0.859265 0.873991
pen_digits 0.999836 0.999851 0.999851 0.999845
protein_homo 0.984600 0.965014 0.970406 0.962937
satimage 0.933214 0.936549 0.936868 0.933069
scene 0.780679 0.722448 0.774284 0.773774
sick_euthyroid 0.982234 0.975710 0.984795 0.983522
solar_flare_m0 0.748597 0.724919 0.695480 0.687116
spectrometer 0.975155 0.973201 0.973088 0.984658
thyroid_sick 0.993552 0.996121 0.996836 0.996300
us_crime 0.919144 0.915725 0.911504 0.915458
webpage 0.803485 0.901795 0.789156 0.788862
wine_quality 0.835976 0.827878 0.805647 0.809426
yeast_me2 0.934917 0.929296 0.924718 0.913895
yeast_ml8 0.611755 0.586060 0.600982 0.606157

The average ranking across the datasets are shown in the following table.
As we can see all forests perform similarly. (The lower the better)

Forest Name Average Rank
brf 2.148148
rf_balanced 2.259259
rf_balanced_subsample 2.703704
rf 2.888889

The average fit time of each forest for each datasets is presented in the following table.

dataset_name brf rf rf_balanced rf_balanced_subsample
abalone 0.422506 1.042681 0.864763 1.263026
abalone_19 0.309891 0.659666 0.594561 1.007489
arrhythmia 0.308522 0.574814 0.461610 0.612353
car_eval_34 0.247720 0.263163 0.262382 0.540598
car_eval_4 0.309501 0.366200 0.368351 0.578528
coil_2000 1.001036 3.873933 3.152484 6.266058
ecoli 0.299333 0.288385 0.298355 0.323187
isolet 4.061236 18.669528 22.482264 25.195425
letter_img 0.761140 2.690679 2.405421 4.009034
libras_move 0.343714 0.348212 0.415275 0.404716
mammography 0.592800 1.921325 1.498232 2.192701
oil 0.319863 0.580092 0.492893 0.652238
optical_digits 0.783819 1.477507 1.128903 1.812815
ozone_level 0.357791 1.447202 1.026845 1.307213
pen_digits 0.959196 2.243534 2.317744 4.168145
protein_homo 6.654938 288.988578 145.843798 162.089329
satimage 0.888224 1.667351 2.026317 2.429273
scene 0.726925 4.679455 3.539798 3.904238
sick_euthyroid 0.320645 0.502668 0.484485 0.736504
solar_flare_m0 0.347235 0.446362 0.429545 0.671594
spectrometer 0.241266 0.436779 0.355837 0.446164
thyroid_sick 0.516746 0.962520 0.977183 1.214147
us_crime 0.430915 1.081393 1.015309 1.199680
webpage 2.543454 43.693878 25.273632 29.456088
wine_quality 0.432087 1.762763 1.164291 1.587972
yeast_me2 0.306763 0.368546 0.358770 0.563083
yeast_ml8 0.616068 6.964830 2.893425 2.827342

The average rankings for the time across the datasets are shown in the following table.
We can observe that almost always the Balanced Random Forest is the fastest.

Forest Name Average Rank
brf 1.074074
rf_balanced 2.296296
rf 2.925926
rf_balanced_subsample 3.703704

So, over all I think that it could be a nice addition

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more visibility, through user guide additions, an example or a new parameter, for instance

@massich massich mentioned this pull request Feb 24, 2019
4 tasks
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a compelling example when the BalancedRandomForest is actually useful.

sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved
@glemaitre glemaitre self-assigned this Nov 19, 2019
@glemaitre
Copy link
Member

We just made a new example in imbalanced-learn:
https://imbalanced-learn.org/dev/auto_examples/applications/plot_impact_imbalanced_classes.html

I think that we could reuse it to show how to tackle the issues with imbalanced classes using the class_weight which allows tackling the learning issue. From the current example, we only have to remove the "sampler" part and the `BalancedBaggingClassifier".

I am thinking that this last estimator should also be included in scikit-learn since it has the same semantic and in general, it showed to be effective by using a strong base_estimator.

@glemaitre
Copy link
Member

I will add a shorter example and add some discussion in the user guide

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some documentation in user guide and an example

@glemaitre glemaitre requested review from glemaitre and removed request for glemaitre November 20, 2019 13:45
@glemaitre
Copy link
Member

I think this is ready for some reviews @adrinjalali @jnothman @ogrisel @amueller @NicolasHug

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick glance, didn't read the code yet.

Question, not sure if that's completely related to this specific PR: if we compute the boostrap subsamples based on the sample weights (i.e. the higher the weight, the more likely to be in the subsample), does it still make sense to take the weights into account when building the tree? My intuition tends toward no here.

doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.23.rst Outdated Show resolved Hide resolved
sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member

Question, not sure if that's completely related to this specific PR: if we compute the boostrap subsamples based on the sample weights (i.e. the higher the weight, the more likely to be in the subsample), does it still make sense to take the weights into account when building the tree? My intuition tends toward no here.

I did not think about it but actually we could easily take the weight into account in the random.choice call using p. It might allow increasing the chance to pick-up some sample within a class which might be a use case.

@NicolasHug
Copy link
Member

well, this is probably an other issue entirely, but I'm not sure it makes sense to both subsample based on the class imbalance and take SW into account. It seems to me that there are many many ways to do this wrong

@glemaitre
Copy link
Member

So a couple of comments after some discussions with @ogrisel:

  • it should be more efficient to resample X instead of using sample_weight and giving it to the underlying trees. In this last case, the trees will iterate over all samples and make a multiplication with a zero weight while in the first case, the sample will not be present in the data so we will not iterate over it.
  • the second issue is linked to the API. We currently cannot specify anything else than a truly balanced ratio for the classes. We might want to accept bootstrap="balanced" or bootstrap={0: xxx, 1: xxx} instead to be able to specify the class imbalancing.

@amueller
Copy link
Member

amueller commented Dec 3, 2019

An alternative could be to have a parameter that controls whether the class-weights are used to sample or not. That would allow most use-cases by just adding one boolean variable, right?
Alternatively, we could add resample_weights in addition to class_weights. If someone sets both to 'balanced' I'm not sure what the behavior should be, though.

Is there a case where we want to use weighted sampling and also reweight the classes after sampling?

@glemaitre
Copy link
Member

We also need to check that the OOB score is properly computed to not make the same mistake than in ub.com/scikit-learn-contrib/imbalanced-learn/issues/655

An additional should be added to detect this case to be safe.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue regarding sample weights is that we're calculating the class frequencies ignoring sample weights, which means if there are enough number of 0 sample weights, the frequencies are quite off (I think).

Also, it'd be nice to have some tests for the new private functions/methods. The tests right now are kinda too general for my taste to check the correctness of this PR.

@@ -200,6 +200,9 @@ in bias::
Parameters
----------

Impactful parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe sth like "main parameters"? I think we shouldn't imply the other parameters are not "impactful".

naturally favor the classes with the most samples given during ``fit``.

The :class:`RandomForestClassifier` provides a parameter `class_weight` with
the option `"balanced_bootstrap"` to alleviate the bias induces by the class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the option `"balanced_bootstrap"` to alleviate the bias induces by the class
the option `"balanced_bootstrap"` to alleviate the bias induced by the class

random-forest.

`class_weight="balanced"` and `class_weight="balanced_subsample"` provide
alternative balancing strategies which are not as efficient in case of large
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alternative balancing strategies which are not as efficient in case of large
alternative balancing strategies which are not as efficient as `class_weight="balanced_bootstrap"` in case of large

It was hard to parse the first time I read the sentence, this may help

.. note::
Be aware that `sample_weight` will be taken into account when setting
`class_weight="balanced_bootstrap"`. Thus, it is recommended to not manually
balanced the dataset using `sample_weight` and use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
balanced the dataset using `sample_weight` and use
balance the dataset using `sample_weight` and use

Comment on lines +260 to +262
`class_weight="balanced_bootstrap"`. Thus, it is recommended to not manually
balanced the dataset using `sample_weight` and use
`class_weight="balanced_bootstrap"` at the same time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Thus balancing the dataset using `sample_weight` and using `class_weight="balanced_bootstrap"`
at the same time is not recommented.

The maximum number of samples required in the bootstrap sample.
balanced_bootstrap : bool
Whether or not the class counts should be balanced in the bootstrap
y : ndarray of shape (n_samples,) or (n_samples, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y : ndarray of shape (n_samples,) or (n_samples, 1)
y : array-like of shape (n_samples,) or (n_samples, 1)

sklearn/ensemble/_forest.py Show resolved Hide resolved
For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.

.. versionadded:: 0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 0.23
.. versionchanged:: 0.23

maybe?

@@ -1575,7 +1670,7 @@ class ExtraTreesClassifier(ForestClassifier):
new forest. See :term:`the Glossary <warm_start>`.

class_weight : dict, list of dicts, "balanced", "balanced_subsample" or \
None, optional (default=None)
None, optional (default=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
None, optional (default=None)
None, default=None



def test_forest_balanced_bootstrap_max_samples():
# check that we take the minimum between max_samples and the minimum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also have a test where max_samples is set in two models in a way that the outcome is identical in both cases.

@glemaitre
Copy link
Member

I am putting a bit this work on hold. We need more work regarding the examples linked to the ROC and Precision/Recall curves. it leads us to investigate the need to a CutoffClassifier which we are implementing and discussion in #16525

@amueller
Copy link
Member

@glemaitre can you explain the connection? I would have just used roc_auc and average_precision and not worry about the threshold as it's a somewhat orthogonal issue.

@glemaitre
Copy link
Member

With a LogisticRegression, adding the class_weight="balanced" would improve the balanced_accuracy_score but not the ROC AUC and make the average PR worse. We wanted to understand what would be the optimal cutoff threshold for the different metrics. Basically, we wanted to verify if finding the optimal threshold would be as efficient or not than the class_weight strategy and for which reason.

@lorentzenchr
Copy link
Member

Isn't this, or some variant, implemented in scikit-learn-contrib/imbalanced-learn#459? If so, we could close.

@ogrisel
Copy link
Member

ogrisel commented Jan 13, 2022

Isn't this, or some variant, implemented in scikit-learn-contrib/imbalanced-learn#459? If so, we could close.

I think we need to improve the scikit-learn documentation to tell our users how to deal with imbalanced classification problems, both to avoid model evaluation pitfalls and how to improve the training of the models on imbalanced data.

The example proposed in this PR is a good first step in that direction. We could rewrite it to use imbalanced learn existing models as a first step.

As @amueller suggested we should decouple the problem of the choice of the cut-off from the problem of improving the training of the model and using metrics that do not depend on the cut-off such as ROC AUC and average precision is a good idea to not have to wait for #16525.

Finally I also think we could provide default implementations for methods that are direct extensions of existing scikit-learn models, such as random forests and bagging classifiers with a built-in option to do subsampling of the majority class.

@chkoar
Copy link
Contributor Author

chkoar commented Jan 13, 2022

we should decouple the problem of the choice of the cut-off from the problem of improving the training of the model

+1. In the former actually you need a trained model, right?

such as random forests and bagging classifiers with a built-in option to do subsampling of the majority class.
Finally I also think we could provide default implementations for methods that are direct extensions of existing scikit-learn models, such as random forests and bagging classifiers with a built-in option to do subsampling of the majority class.

This variant, that performs resampling only in the majority class, is called UnderBagging.
I have planned to implement a variant in imbalanced-learn in order to replace the BalancedBaggingClassifier.
I could work to integrate the under bagging approach in the original bagging estimator, if we want to.

@ogrisel
Copy link
Member

ogrisel commented Jan 14, 2022

I could work to integrate the under bagging approach in the original bagging estimator, if we want to.

Sounds reasonable to me but since scikit-learn reviewers tend to be conservative, maybe starting with an implementation in imbalancedlearn is a good idea.

The we can do a first PR in scikit-learn that just document the problem of dealing with imbalanced problems and show how to use imbalancedlearn.

Then in a second step we can explore whether or not we want to move under-bagging upstream in scikit-learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bootstrapping based on sample weights in random forests
8 participants