Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Balanced Random Forest #8732

Closed
wants to merge 17 commits into from

Conversation

@massich
Copy link
Contributor

massich commented Apr 12, 2017

Reference Issue

Fixes #8607

What does this implement/fix? Explain your changes.

This PR takes over #5181 ( and #8728 )

What does this implement/fix? Explain your changes.

Tasks to be performed

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Apr 20, 2017

Can you provide a summary of what exactly is left to do in the PR description? Thanks!

@potash

This comment has been minimized.

Copy link

potash commented May 17, 2017

@massich check out my branch feature/balanced-random-forest-api. The changes are:

  1. Followed the discussion of @glemaitre @arjoly @amueller in #8607 to remove the ad-hoc support for multioutput balanced randomf forest and raising an error when it is attempted.

  2. Added unit tests for the two BRF helper methods to test_balanced_random_forest.py-- it wasn't obvious to me which of the existing test files they belong in so feel free to move them.

  3. I changed the API to be class_weight="balanced_bootstrap" as discussed in #8607.

Please let me know what is left to get this merged.

@massich

This comment has been minimized.

Copy link
Contributor Author

massich commented May 18, 2017

@potash I am benchmarking the estimator here. My idea for the benchmark is:

  • Using sklearn datasets:
    • Create a synthetic dataset and go from balanced to highly unbalanced to see when BRF is beneficial
    • Repeat the experiment with Breast dataset in Sk-learn.
  • Using sklearn-imbalance:
    • Test against their selection of imbalanced datasets
  • Using openML:
    • Explore some imbalanced datasets
@potash

This comment has been minimized.

Copy link

potash commented May 18, 2017

Sounds good. You'll want to merge feature/balanced-random-forest-api so you can work off the new api (class_weight="balanced_bootstrap") and merge brf-example as it's been updated there too. Let me know if I can help with the examples.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented May 18, 2017

There's some benchmarks here on a real datasets and also a silly implementation of the feature using imblearn: https://github.com/amueller/applied_ml_spring_2017/blob/master/slides/aml-15-resampling-imbalanced-data.ipynb
You can see round Out[83] that this method is doing much better than any of the others.

@raghavrv raghavrv added the Sprint label Jun 3, 2017
@raghavrv raghavrv self-requested a review Jun 28, 2017
@geneorama

This comment has been minimized.

Copy link

geneorama commented Nov 21, 2017

Hello there, is it possible to get an update on this? We're using this model in production (https://github.com/Chicago/lead-model), and as we prepare to go live it would be very helpful for deployment if this branch were in the standard sci-kit learn library.

Thanks for all the great work here!

Also, let us know if there's something we can do to move this forward.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 21, 2017

this needs tests, documentation and examples. I'm a big fan of this methods, so I'd be happy to see this moved forward. @massich are you still working on it? Would you like some help?
I liked using the mammography dataset: https://www.openml.org/d/310, see #9908 for a loader ;)

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Nov 21, 2017

In the meanwhile, we have the BalancedBaggingClassifier which can be set to a balanced random forest by setting max_features='auto' if I am not wrong.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 21, 2017

@glemaitre I believe you are right.

@massich

This comment has been minimized.

Copy link
Contributor Author

massich commented Nov 22, 2017

Actually, it completely stalled. I did not even finish the benchmark. I was playing with openml but I didn't finish it. It has been sitting for 6 months.

We should definitely revive it.

@chkoar

This comment has been minimized.

Copy link
Contributor

chkoar commented Jan 5, 2018

@massich what is the current status of this PR? Do you need a hand? According to a previous comment of @amueller this PR needs love, tests, documentation and examples, right?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 21, 2018

IMO it would be good if you helped complete this, @chkoar

@chkoar

This comment has been minimized.

Copy link
Contributor

chkoar commented Feb 21, 2018

@jnothman That was the intention. If it is not picked by anyone else I will give it a in a couple of weeks. @massich has already given write access to me on his repos

@potash

This comment has been minimized.

Copy link

potash commented Feb 21, 2018

@chkoar let me know if there's anything I (original author of the feature) can do to help. Would be very happy to see this merged.

@chkoar

This comment has been minimized.

Copy link
Contributor

chkoar commented Feb 18, 2019

@potash ok, thanks. Let's hope that it will be merged during the upcoming sprint.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 19, 2019

I think you should expect a little less. But let's honours list hope it will be a lot closer to merge after the sprint.

@chkoar chkoar referenced this pull request Feb 22, 2019
@massich

This comment has been minimized.

Copy link
Contributor Author

massich commented Feb 24, 2019

closing in favor of #13227. Thx @chkoar for taking over.

@massich massich closed this Feb 24, 2019
@adrinjalali adrinjalali added this to To do in Resampler Oct 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Andy's pets
PR phase
Resampler
  
To do
9 participants
You can’t perform that action at this time.