Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH: K-Means SMOTE implementation #435

Merged
merged 42 commits into from Jun 12, 2019

Conversation

Projects
None yet
5 participants
@StephanHeijl
Copy link
Contributor

commented Jun 22, 2018

What does this implement/fix? Explain your changes.

This pull request implements K-Means SMOTE, as described in Oversampling for Imbalanced Learning Based on K-Means and SMOTE by Last et al.

Any other comments?

The density estimation function has been changed slightly from the reference paper, as the power term yielded very large numbers. This caused the weighting to favour a single cluster.

@codecov

This comment has been minimized.

Copy link

commented Jun 22, 2018

Codecov Report

Merging #435 into master will decrease coverage by 2.43%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #435      +/-   ##
==========================================
- Coverage    98.9%   96.47%   -2.44%     
==========================================
  Files          84       85       +1     
  Lines        5202     5329     +127     
==========================================
- Hits         5145     5141       -4     
- Misses         57      188     +131
Impacted Files Coverage Δ
imblearn/over_sampling/tests/test_kmeans_smote.py 100% <100%> (ø)
imblearn/over_sampling/_smote.py 97.82% <100%> (+0.55%) ⬆️
imblearn/over_sampling/__init__.py 100% <100%> (ø) ⬆️
imblearn/utils/estimator_checks.py 90.87% <100%> (-6.22%) ⬇️
imblearn/keras/tests/test_generator.py 8.62% <0%> (-91.38%) ⬇️
imblearn/tensorflow/_generator.py 34.28% <0%> (-65.72%) ⬇️
imblearn/keras/_generator.py 40.35% <0%> (-57.9%) ⬇️
imblearn/over_sampling/tests/test_smote_nc.py 94.3% <0%> (-5.7%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c630df3...c3a1502. Read the comment docs.

@chkoar

This comment has been minimized.

Copy link
Member

commented Jun 22, 2018

@felix-last as the author and creator of the algorithm are you willing to review this PR?

@StephanHeijl

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2018

@chkoar @felix-last Thanks in advance for taking a look at this. It must be said that my initial research was somewhat poor, which caused me to neglect the existing implementation by the original author. As such, this is was programmed without actually looking at that code. I nevertheless thought it valuable to contribute this back to the project.

@felix-last

This comment has been minimized.

Copy link

commented Jun 26, 2018

Unfortunately I don't have time for a detailed review. Just a few comments after glancing over:

  • The exponent used for the density computation is actually a hyperparameter of the algorithm described in the paper. In our implementation, it defaults to the number of features, but can be set arbitrarily by the user.
  • In the original implementation, k_neighbors is decreased when it is larger than the number of samples in a given cluster. This enables the user to specify, for example, k_neighbors=float('inf') to apply SMOTE to interpolate among all samples of a cluster.
  • This implementation does not encompass limit cases of SMOTE and random oversampling described in section 3.2 of the paper.

I'd like to point out that our implementation also complies with the imbalanced-learn API.

@StephanHeijl

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2018

@felix-last Sincere thanks for the lookover. I'll be happy to make these changes and push them to the project, provided @chkoar would still like it included.

@glemaitre

This comment has been minimized.

Copy link
Member

commented Jun 27, 2018

@StephanHeijl we are fine including new methods (I would be frilled to get something working on categorical data :)) until that we provide references.
ping me when you made the last changes.

@pep8speaks

This comment has been minimized.

Copy link

commented Jul 1, 2018

Hello @StephanHeijl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1222:5: E303 too many blank lines (2)

Line 60:25: E128 continuation line under-indented for visual indent
Line 61:25: E128 continuation line under-indented for visual indent
Line 69:1: E302 expected 2 blank lines, found 1

Comment last updated at 2019-06-12 13:20:16 UTC

StephanHeijl added some commits Jul 1, 2018

@StephanHeijl StephanHeijl changed the title [WIP] K-Means SMOTE implementation [MRG] K-Means SMOTE implementation Jul 12, 2018

@StephanHeijl

This comment has been minimized.

Copy link
Contributor Author

commented Jul 12, 2018

@glemaitre I took some time to work out the last kinks and implement the edge cases from the paper, but I believe this can be subjected to a thorough review.

@glemaitre

This comment has been minimized.

Copy link
Member

commented Jul 12, 2018

Ok I will try to have a look

@glemaitre

This comment has been minimized.

Copy link
Member

commented Jul 25, 2018

The PR looks in general good. @chkoar Before to carry on a full review I have an API question.

I am split between:

(1) adding a new kind of SMOTE and adding 3 new parameters (as in this PR);
(2) deprecate kind for SMOTE-SVM and create 2 new classes SMOTESVM and SMOTEKMeans.

In (1), I am scared that we will have to much parameters for the common case which is painful for the user. In (2), we will have 3 classes instead of 1.

What are you thoughts @chkoar

@chkoar

This comment has been minimized.

Copy link
Member

commented Jul 25, 2018

It seems that SMOTE class gets much responsibility. Also, according to this article there are almost 100 hundred published variations of SMOTE. So, I believe that option two is the way to go. We have to take care not to duplicate code between classes.

@glemaitre

This comment has been minimized.

Copy link
Member

commented Jul 25, 2018

OK so let's go for a base class BaseSMOTE to factorize code as much as possible.

@StephanHeijl I will make a PR and then you will need to move the code within a new class which will inherit from the base class. It should be a minimal change but quite beneficial.

@StephanHeijl

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2018

@glemaitre Will do, once the PR drops I'll integrate it and move the code. I agree that this should make SMOTE implementation a lot more accessible.

@glemaitre glemaitre referenced this pull request Jul 26, 2018

Merged

[MRG] EHN: split and factorize SMOTE classes #440

4 of 4 tasks complete
@glemaitre

This comment has been minimized.

Copy link
Member

commented Jul 26, 2018

@StephanHeijl I think that you can start to have a look to #440.
I still have couple of thing to update but the basically, you just have to inherit from BaseSMOTE and move the implementation in _sample.

svm_estimator : object, optional (default=SVC())
If ``kind='svm'``, a parametrized :class:`sklearn.svm.SVC`
classifier can be passed.
n_kmeans_clusters: int, optional (default=10)

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jul 26, 2018

Member

I think that we could make this a KMeans estimator instead.
I am thinking that it could be interesting to plug a MiniBatchKMeans for instance which could be quicker to converge and memory efficient. This could be the default. However, it let people to be able to pass a KMeans classifier if interested.

This comment has been minimized.

Copy link
@StephanHeijl

StephanHeijl Mar 7, 2019

Author Contributor

Completely agree, MiniBatchKMeans is now the default and a test has been included to ensure that using a "normal" kmeans instance will also work.

@glemaitre

This comment has been minimized.

Copy link
Member

commented Jul 27, 2018

I merged #440, you should be able to create a new class.
You can make the quick change such that I can do a full review.

@chkoar chkoar requested a review from glemaitre Mar 7, 2019

@felix-last
Copy link

left a comment

As a sanity check, I tried creating some of the toy dataset plots from our paper using this implementation.

For some examples this implementation yields unwanted results. Of course, we couldn't expect results to be exactly the same, but it could also be a hint that something is wrong (or that I'm overlooking something or it's just an unlucky random seed). I tried translating our implementation's default parameters (which I used for the toy plots in the paper) to the corresponding parameters of this implementation.

Comparison our implementation (left) vs. this implementation (right)

Screen Shot 2019-03-30 at 22 41 00

After a quick glance, the code looks good to me.

Feel free to play around with the plotting script. Prerequisites are to install requirements.txt, install this PR's imblearn and create a config.yml with a results_dir and dataset_dir. Download the toy datasets into the datasets dir. For comparison, here's all the toy plots obtained from our implementation: kmeans-smote-figures.zip.

@StephanHeijl

This comment has been minimized.

Copy link
Contributor Author

commented Apr 1, 2019

@felix-last Thank you very much for taking the time to look into this! It appears that I entered a wrong parameter for the _make_samples function. I have resolved this and the results that I gathered from your (very helpful) plotting script appear to be far more in line with the expected results. I have uploaded the results here: https://github.com/StephanHeijl/imblearn-kmeans-toydatasets . Sample:
KMeans Smote Sample

I have updated the tests and the script in accordance with this improvement.

@felix-last
Copy link

left a comment

That looks more sane! Looks good to me now (but I didn't do a thorough code review).

@chkoar
Copy link
Member

left a comment

Apart from my latest comments the only thing that is missing is some user guide.

Show resolved Hide resolved imblearn/over_sampling/_smote.py Outdated
Show resolved Hide resolved imblearn/over_sampling/_smote.py Outdated
Use the parameter ``sampling_strategy`` instead. It will be removed
in 0.6.
"""

This comment has been minimized.

Copy link
@chkoar

chkoar May 5, 2019

Member

At the end of the docstring we could add the reference of the paper and after that an interactive example. Check here for instance.

This comment has been minimized.

Copy link
@StephanHeijl

StephanHeijl May 6, 2019

Author Contributor

I added an interactive example that specifically demonstrates the utility of the KMeansSmote class, 3 blobs, with the positive class in the middle and the negative classes on the outside and a single negative sample in the middle blob. The example shows that after resampling no new samples are added in the middle blob. Inspired by the following toy problem:
image

@StephanHeijl

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

@chkoar I have attended to your proposed changes, I hope this covers enough of an interactive example.

@glemaitre glemaitre added this to the 0.5 milestone Jun 11, 2019

@chkoar

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

@StephanHeijl could you please rebase your branch so we can finally merge this pr?

@chkoar chkoar changed the title [MRG] K-Means SMOTE implementation [MRG] ENH: K-Means SMOTE implementation Jun 12, 2019

@glemaitre

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

I made the push to solve the conflict. I will review and make couple of changes regarding the style if necessary and merge it.

@glemaitre glemaitre merged commit aff4125 into scikit-learn-contrib:master Jun 12, 2019

1 of 4 checks passed

LGTM analysis: Python Running analyses for revisions
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
LGTM analysis: JavaScript No code changes detected
Details

@StephanHeijl StephanHeijl deleted the StephanHeijl:kmeans-smote branch Jun 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.