-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balanced Random Forest #13227
Balanced Random Forest #13227
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs more visibility, through user guide additions, an example or a new parameter, for instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to find a compelling example when the BalancedRandomForest is actually useful.
We just made a new example in imbalanced-learn: I think that we could reuse it to show how to tackle the issues with imbalanced classes using the I am thinking that this last estimator should also be included in scikit-learn since it has the same semantic and in general, it showed to be effective by using a strong |
I will add a shorter example and add some discussion in the user guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some documentation in user guide and an example
I think this is ready for some reviews @adrinjalali @jnothman @ogrisel @amueller @NicolasHug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick glance, didn't read the code yet.
Question, not sure if that's completely related to this specific PR: if we compute the boostrap subsamples based on the sample weights (i.e. the higher the weight, the more likely to be in the subsample), does it still make sense to take the weights into account when building the tree? My intuition tends toward no here.
I did not think about it but actually we could easily take the weight into account in the |
well, this is probably an other issue entirely, but I'm not sure it makes sense to both subsample based on the class imbalance and take SW into account. It seems to me that there are many many ways to do this wrong |
So a couple of comments after some discussions with @ogrisel:
|
An alternative could be to have a parameter that controls whether the class-weights are used to sample or not. That would allow most use-cases by just adding one boolean variable, right? Is there a case where we want to use weighted sampling and also reweight the classes after sampling? |
We also need to check that the OOB score is properly computed to not make the same mistake than in ub.com/scikit-learn-contrib/imbalanced-learn/issues/655 An additional should be added to detect this case to be safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue regarding sample weights is that we're calculating the class frequencies ignoring sample weights, which means if there are enough number of 0 sample weights, the frequencies are quite off (I think).
Also, it'd be nice to have some tests for the new private functions/methods. The tests right now are kinda too general for my taste to check the correctness of this PR.
@@ -200,6 +200,9 @@ in bias:: | |||
Parameters | |||
---------- | |||
|
|||
Impactful parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe sth like "main parameters"? I think we shouldn't imply the other parameters are not "impactful".
naturally favor the classes with the most samples given during ``fit``. | ||
|
||
The :class:`RandomForestClassifier` provides a parameter `class_weight` with | ||
the option `"balanced_bootstrap"` to alleviate the bias induces by the class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the option `"balanced_bootstrap"` to alleviate the bias induces by the class | |
the option `"balanced_bootstrap"` to alleviate the bias induced by the class |
random-forest. | ||
|
||
`class_weight="balanced"` and `class_weight="balanced_subsample"` provide | ||
alternative balancing strategies which are not as efficient in case of large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternative balancing strategies which are not as efficient in case of large | |
alternative balancing strategies which are not as efficient as `class_weight="balanced_bootstrap"` in case of large |
It was hard to parse the first time I read the sentence, this may help
.. note:: | ||
Be aware that `sample_weight` will be taken into account when setting | ||
`class_weight="balanced_bootstrap"`. Thus, it is recommended to not manually | ||
balanced the dataset using `sample_weight` and use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
balanced the dataset using `sample_weight` and use | |
balance the dataset using `sample_weight` and use |
`class_weight="balanced_bootstrap"`. Thus, it is recommended to not manually | ||
balanced the dataset using `sample_weight` and use | ||
`class_weight="balanced_bootstrap"` at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
Thus balancing the dataset using `sample_weight` and using `class_weight="balanced_bootstrap"`
at the same time is not recommented.
sklearn/ensemble/_forest.py
Outdated
The maximum number of samples required in the bootstrap sample. | ||
balanced_bootstrap : bool | ||
Whether or not the class counts should be balanced in the bootstrap | ||
y : ndarray of shape (n_samples,) or (n_samples, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y : ndarray of shape (n_samples,) or (n_samples, 1) | |
y : array-like of shape (n_samples,) or (n_samples, 1) |
sklearn/ensemble/_forest.py
Outdated
For multi-output, the weights of each column of y will be multiplied. | ||
|
||
Note that these weights will be multiplied with sample_weight (passed | ||
through the fit method) if sample_weight is specified. | ||
|
||
.. versionadded:: 0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. versionadded:: 0.23 | |
.. versionchanged:: 0.23 |
maybe?
sklearn/ensemble/_forest.py
Outdated
@@ -1575,7 +1670,7 @@ class ExtraTreesClassifier(ForestClassifier): | |||
new forest. See :term:`the Glossary <warm_start>`. | |||
|
|||
class_weight : dict, list of dicts, "balanced", "balanced_subsample" or \ | |||
None, optional (default=None) | |||
None, optional (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None, optional (default=None) | |
None, default=None |
|
||
|
||
def test_forest_balanced_bootstrap_max_samples(): | ||
# check that we take the minimum between max_samples and the minimum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also have a test where max_samples
is set in two models in a way that the outcome is identical in both cases.
I am putting a bit this work on hold. We need more work regarding the examples linked to the ROC and Precision/Recall curves. it leads us to investigate the need to a |
@glemaitre can you explain the connection? I would have just used roc_auc and average_precision and not worry about the threshold as it's a somewhat orthogonal issue. |
With a |
Isn't this, or some variant, implemented in scikit-learn-contrib/imbalanced-learn#459? If so, we could close. |
I think we need to improve the scikit-learn documentation to tell our users how to deal with imbalanced classification problems, both to avoid model evaluation pitfalls and how to improve the training of the models on imbalanced data. The example proposed in this PR is a good first step in that direction. We could rewrite it to use imbalanced learn existing models as a first step. As @amueller suggested we should decouple the problem of the choice of the cut-off from the problem of improving the training of the model and using metrics that do not depend on the cut-off such as ROC AUC and average precision is a good idea to not have to wait for #16525. Finally I also think we could provide default implementations for methods that are direct extensions of existing scikit-learn models, such as random forests and bagging classifiers with a built-in option to do subsampling of the majority class. |
+1. In the former actually you need a trained model, right?
This variant, that performs resampling only in the majority class, is called UnderBagging. |
Sounds reasonable to me but since scikit-learn reviewers tend to be conservative, maybe starting with an implementation in imbalancedlearn is a good idea. The we can do a first PR in scikit-learn that just document the problem of dealing with imbalanced problems and show how to use imbalancedlearn. Then in a second step we can explore whether or not we want to move under-bagging upstream in scikit-learn. |
Reference Issues/PRs
Fixes #8607. Takes over #5181 and #8732.
What does this implement/fix? Explain your changes.
Actually I mainly used the @potash changes from #5181 and performed comparisons of variations of random forests on a standard benchmark that we have in
imbalanced-learn
. The Balanced Random Forest is triggered using thebalanced_bootstrap
inclass_weight
.Any other comments?
Regarding the experiment. I build all forests using 100 trees and performed 5-fold cross-validation. The details of the datasets used can be found here.
The following table contains the performance of the variations in different datasets in terms of roc auc. With
brf
is stated the newbalanced_boostrap
.The average ranking across the datasets are shown in the following table.
As we can see all forests perform similarly. (The lower the better)
The average fit time of each forest for each datasets is presented in the following table.
The average rankings for the time across the datasets are shown in the following table.
We can observe that almost always the Balanced Random Forest is the fastest.
So, over all I think that it could be a nice addition