New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Refactor ratio to pick up any class #290

Merged
merged 53 commits into from Jun 10, 2017

Conversation

Projects
None yet
4 participants
@glemaitre
Member

glemaitre commented May 8, 2017

Reference Issue

Fixes #121

What does this implement/fix? Explain your changes.

Refactor the ratio parameter

TODO:

  • Need to be back compatible
  • Base class for: under-sampling and cleaning
  • Correct every test for array equality
  • Change the common test regarding the ratio
  • Add multi-class test for SMOTE, ADASYN, BalanceCascade, and others which were not binary
  • Correct the ratio docstring. 'auto' corresponds to 'all' for over-sampling and 'not minority' for under-sampling.
  • Add ratio docstring for cleaning methods.
  • Check that all ratio make sense for the methods:
    • 'all' makes no sense for OneSidedSelection
    • 'all' makes no sense for NeighborhoodCleaningRule
  • Make a proper logger
  • Add authorship
  • Add whats new entry
  • Remove a much comments as possible when it is not necessary
  • Remove the Counter occurrences when this is useless.
  • Remove X_shape_ from the documentation
  • Make a fast profiling to know if the hashing is an issue
  • Use deprecated directive from sphinx.
  • Check double quote in docstring

Any other comments?

@pep8speaks

This comment has been minimized.

pep8speaks commented May 8, 2017

Hello @glemaitre! Thanks for updating the PR.

  • In the file doc/conf.py, following are the PEP8 issues :

Line 30:1: E722 do not use bare except'
Line 41:1: E402 module level import not at top of file
Line 309:80: E501 line too long (86 > 79 characters)
Line 335:80: E501 line too long (84 > 79 characters)

Line 142:5: E722 do not use bare except'

Comment last updated on May 30, 2017 at 22:15 Hours UTC

glemaitre added some commits May 8, 2017

raise ValueError("When 'ratio' is a string, it needs to be one of"
" {}. Got '{}' instead.".format(RATIO_KIND,
ratio))
if ratio == 'all' or ratio == 'auto':

This comment has been minimized.

@chkoar

chkoar May 9, 2017

Member

IMHO we could avoid that if-elses by using a dict

@chkoar

This comment has been minimized.

Member

chkoar commented May 14, 2017

That's big PR @glemaitre!

@glemaitre

This comment has been minimized.

Member

glemaitre commented May 14, 2017

@chkoar yes indeed. I am fixing back the tests, right now.
Mainly, I still have to check the case of a function given for the ratio but we should be almost done code wise. I still have to go through all the docstring.

I get why we had trouble to put it out. It required lot of works

glemaitre added some commits May 14, 2017

@glemaitre

This comment has been minimized.

Member

glemaitre commented May 19, 2017

@chkoar @massich I think the PR is ready.

I removed the multiclass and binary mixins since we only have multiclass from now.
I think that any integration should be only be multiclass.

I still have the issue with the dictionary with the cleaning method thought.

@glemaitre

This comment has been minimized.

Member

glemaitre commented May 19, 2017

And it should be good to go fast on this one since any further development will benefit from those changes

@glemaitre

This comment has been minimized.

Member

glemaitre commented May 19, 2017

Hints: for a faster review process, you should actually focus on the base classes and mixin. All methods were adapted to use those and there is not real changes apart to make some of them multiclass.

glemaitre added some commits May 19, 2017

@glemaitre glemaitre changed the title from [MRG] Refactor ratio to pick up any class to [WIP] Refactor ratio to pick up any class May 21, 2017

X, y = check_X_y(X, y)
y = check_target_type(y)
self.X_hash_, self.y_hash_ = hash_X_y(X, y)
self.ratio_ = check_ratio(self.ratio, y, 'cleaning-sampling')

This comment has been minimized.

@chkoar

chkoar May 22, 2017

Member

clean-sampling might be more appropriate, no?

@chkoar

This comment has been minimized.

Member

chkoar commented May 22, 2017

That review it will take some time!

BaseUnderSampler, BaseCleaningSampler and BaseOverSampler have the same __init__ and almost same fit. We could use a parent class that will derive from the SamplerMixin lets say BaseSampler . In the fit in the check_ratio function as a third argument we will use a class variable that it will "become" an instance variable. So, in the base class we will have self.ratio_ = check_ratio(self.ratio, y, self.clean_kind) . Thus, in any child classes we only have to define the clean_kind class variable variable.

p.s. I proposed class variables in order to avoid to override the __init__. You can access class level variables from self.

@glemaitre

This comment has been minimized.

Member

glemaitre commented May 22, 2017

massich added some commits May 22, 2017

[MRG] Incorporate chkoar remarks (#6)
* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class
[MRG] Remove the init in base class (#7)
* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* iterate

* fix PEP8
@glemaitre

This comment has been minimized.

Member

glemaitre commented May 24, 2017

@massich @chkoar anything else?

@chkoar

This comment has been minimized.

Member

chkoar commented May 25, 2017

@glemaitre I will scan it this weekend!

if not type_of_target(y) == 'binary':
warnings.simplefilter('always', UserWarning)
warnings.warn('The target type should be binary.')
class BaseSampler(SamplerMixin):

This comment has been minimized.

@chkoar

chkoar May 29, 2017

Member

Couldn't we merge BaseSampler and SamplerMixin into BaseSampler?

This comment has been minimized.

@glemaitre

glemaitre May 29, 2017

Member

I am not in favor. I would like the sampler to have a SamplerMixin as the ClassifierMixin or TransformerMixin

-------
self : object,
Return self.
def _validate_size_ngh_deprecation(self):

This comment has been minimized.

@chkoar

chkoar May 29, 2017

Member

I think that we could make all these _vaildate_methods(estimator) as module level functions to keep the class clean. So, it will be clean for any newcomer.

This comment has been minimized.

@glemaitre

glemaitre May 29, 2017

Member

Yep I agree with that

y_hash: str
Hash identifier of the ``y`` matrix.
"""
return joblib.hash(X), joblib.hash(y)

This comment has been minimized.

@chkoar

chkoar May 29, 2017

Member

Is this fast for big arrays? Does it worth instead of the naive check of the array shape?

This comment has been minimized.

@glemaitre

glemaitre May 29, 2017

Member

it needs to be profiled. But I think that this is too slow. We could randomly pick some samples which should be better than the shape. But I am still unsure about that part.

glemaitre added some commits May 30, 2017

@glemaitre glemaitre changed the title from [WIP] Refactor ratio to pick up any class to [MRG] Refactor ratio to pick up any class May 30, 2017

@glemaitre

This comment has been minimized.

Member

glemaitre commented May 30, 2017

@massich

This comment has been minimized.

Contributor

massich commented Jun 8, 2017

LGTM

@chkoar

This comment has been minimized.

Member

chkoar commented Jun 8, 2017

@massich @chkoar Does anybody how to optimize this function:
https://github.com/scikit-learn-contrib/imbalanced-learn/pull/290/files#diff-d928243d0cfeca2e8bd7d6084e5018c6R244
without cythonizing at first.

What was the progress on that?

@chkoar

This comment has been minimized.

Member

chkoar commented Jun 8, 2017

In general this PR should have been splitted. Since, it hasn't and the tests are passing I propose to merge the changes. @glemaitre you put a lot of effort on this. We could improve or fix any bug from this PR in the feature. Thanks!

@glemaitre glemaitre merged commit c2c6565 into scikit-learn-contrib:master Jun 10, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 99.35% of diff hit (target 98.24%)
Details
codecov/project 98.36% (+0.12%) compared to 987ddbc
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017

[MRG] Refactor ratio to pick up any class (scikit-learn-contrib#290)
* EHN enable multiclass ratio handling

* FIX simplify call to dictionary

* FIX RUS done

* FIX Refactor ADASYN

* FIX partial

* FIX refactor SMOTE

* FIX refactor SMOTE

* DOC add proper docstring

* PEP8

* FIX ClusterCentroids

* FIX refactor IHT

* FIX Nearmiss refactoring

* FIX tomek links refactor

* FIX refactor OSS

* FIX NCR refactoring

* FIX refactor combined methods with Pipeline

* FIX combine method targetting all classes when cleaning

* FIX balance cascade refactoring

* EHN add the possibility to add a dict for ratio

* TST add test for check_ratio

* TST add test for float

* FIX/TST adapt common test

* TST fix IHT tests

* TST fix NCR

* FIX combine test

* TST fix balance

* FIX doctest

* FIX doctest

* FIX solve the pickle issue

* FIX remove comments

* TST add test for NCR

* TST add knn balance cascade

* EHN add callable option for the ratio

* DOC make doc cleaner

* FIX/DOC remove useless comments and clean doc

* DEP deprecation of ratio as float

* EHN add base class for cleaning methods

* TST add common test for multi class

* MAINT downgrade sphinx for the moment

* TST/EHN add test for the ratio and specific ratio for cleaning sampling

* EHN remove redundant code

* FIX warning

* Remove useless base class

* MAINT add christos back to some file

* EHN rename test and add a comment

* DOC add hash_X_y in the API

* [MRG] Incorporate chkoar remarks (#6)

* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* [MRG] Remove the init in base class (#7)

* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* iterate

* fix PEP8

* EHN doc

* FIX add extension for sphinx

* EHN make deprecatin great again

* EHN Improve SMOTE and ADASYN

glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017

[MRG] Refactor ratio to pick up any class (scikit-learn-contrib#290)
* EHN enable multiclass ratio handling

* FIX simplify call to dictionary

* FIX RUS done

* FIX Refactor ADASYN

* FIX partial

* FIX refactor SMOTE

* FIX refactor SMOTE

* DOC add proper docstring

* PEP8

* FIX ClusterCentroids

* FIX refactor IHT

* FIX Nearmiss refactoring

* FIX tomek links refactor

* FIX refactor OSS

* FIX NCR refactoring

* FIX refactor combined methods with Pipeline

* FIX combine method targetting all classes when cleaning

* FIX balance cascade refactoring

* EHN add the possibility to add a dict for ratio

* TST add test for check_ratio

* TST add test for float

* FIX/TST adapt common test

* TST fix IHT tests

* TST fix NCR

* FIX combine test

* TST fix balance

* FIX doctest

* FIX doctest

* FIX solve the pickle issue

* FIX remove comments

* TST add test for NCR

* TST add knn balance cascade

* EHN add callable option for the ratio

* DOC make doc cleaner

* FIX/DOC remove useless comments and clean doc

* DEP deprecation of ratio as float

* EHN add base class for cleaning methods

* TST add common test for multi class

* MAINT downgrade sphinx for the moment

* TST/EHN add test for the ratio and specific ratio for cleaning sampling

* EHN remove redundant code

* FIX warning

* Remove useless base class

* MAINT add christos back to some file

* EHN rename test and add a comment

* DOC add hash_X_y in the API

* [MRG] Incorporate chkoar remarks (#6)

* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* [MRG] Remove the init in base class (#7)

* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* iterate

* fix PEP8

* EHN doc

* FIX add extension for sphinx

* EHN make deprecatin great again

* EHN Improve SMOTE and ADASYN
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment