Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] EHN handling sparse matrices whenever possible #316

merged 34 commits into from Aug 24, 2017


Copy link

@glemaitre glemaitre commented Aug 12, 2017

Reference Issue

closes #158

What does this implement/fix? Explain your changes.

Enhancement to handle sparse matrices (+pandas dataframe, list, etc.).

Supported samplers:

  • All prototype-selection can be supported.
  • Ensemble classes.
  • RandomOverSampler should be supported.

ADASYN and SMOTE and ClusterCentroids cannot be supported straightforwardly. I need more thought about it.


  • use safe_indexing in the different methods.
    • RandomUnderSampler
    • CNN
    • ENN
    • RENN
    • AllKNN
    • IHT
    • NearMiss
    • NCR
    • OSS
    • TL
    • RandomOverSampling
    • ADASYN
    • SMOTE
    • BalanceCascade
    • EasyEnsemble
    • SMOTETomek
    • ClusterCentroids -> Update the tests.
  • update the documentation doctring.
  • add an entry in the user guide
  • add a common test
  • update examples text classification without densification

Any other comments?

Copy link

@codecov codecov bot commented Aug 12, 2017

Codecov Report

Merging #316 into master will decrease coverage by 0.22%.
The diff coverage is 97.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
- Coverage   98.19%   97.96%   -0.23%     
  Files          66       66              
  Lines        3978     3924      -54     
- Hits         3906     3844      -62     
- Misses         72       80       +8
Impacted Files Coverage Δ
imblearn/ensemble/ 100% <ø> (ø) ⬆️
imblearn/combine/ 85.29% <100%> (ø) ⬆️
...arn/under_sampling/prototype_selection/ 98.66% <100%> (+1.26%) ⬆️
imblearn/over_sampling/ 100% <100%> (ø) ⬆️
...sampling/prototype_generation/ 100% <100%> (ø) ⬆️
imblearn/over_sampling/tests/ 100% <100%> (ø) ⬆️
...prototype_selection/ 100% <100%> (ø) ⬆️
.../under_sampling/prototype_selection/ 100% <100%> (ø) ⬆️
imblearn/combine/ 93.22% <100%> (ø) ⬆️
...mpling/prototype_selection/ 100% <100%> (ø) ⬆️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488a0e8...f8ebd0e. Read the comment docs.

Copy link

@pep8speaks pep8speaks commented Aug 13, 2017

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 24, 2017 at 19:40 Hours UTC
@glemaitre glemaitre changed the title [WIP] EHN handling sparse matrices whenever possible [MRG] EHN handling sparse matrices whenever possible Aug 14, 2017
Copy link
Member Author

@glemaitre glemaitre commented Aug 14, 2017

@massich @chkoar Here comes the sparse handling.

It is a bit tricky for the ClusterCentroids since we create a sparse matrix for the centroids which are not sparse. But this is stacked with the original sparse matrix. I will add an entry in the docstring to state that.

For SMOTE, we can compute effectively the sparse matrix but then I am not sure it makes sense to have a lot of zero there. It would make sense inside a categorical-SMOTE I think.

For the review, I did not modified the test and added a common test in which we check that dense sampling provide the same thing than a sparse sampling.

Copy link
Member Author

@glemaitre glemaitre commented Aug 14, 2017

Regarding the ClusterCentroids, I think that a PR making a nearest-neighbors voting instead of generating the centroids will be good.

I can imagine a parameter voting with 'hard'/'soft'/'auto' is a possibility. 'hard' would be default with sparse input while 'soft' otherwise.

@chkoar WDYT? I know that you were thinking about that since a while.

@glemaitre glemaitre force-pushed the scikit-learn-contrib:master branch from 9304aa5 to f02c565 Aug 17, 2017
from sklearn.utils.estimator_checks import _yield_all_checks \
as sklearn_yield_all_checks, check_estimator \
as sklearn_check_estimator, check_parameters_default_constructible
from sklearn.exceptions import NotFittedError
from sklearn.utils.testing import (assert_warns, assert_raises_regex,
assert_true, set_random_state,
assert_equal, assert_allclose,

This comment has been minimized.


massich Aug 22, 2017

I don't understand why but this PR would insert back the assert_equal and the diff looks wierd 'cos they are no longer there in master (see this)

This comment has been minimized.


massich Aug 22, 2017

Actually needs rebasing master.

Copy link
Member Author

@glemaitre glemaitre commented Aug 22, 2017

@massich rebased. Note that you will have an assert_raises_regex in a new test_adasyn

edit: ping #321

Copy link

@massich massich commented Aug 22, 2017

LGTM, if all the CIs are happy

API's of imbalanced-learn samplers

The sampler available follows the scikit-learn API using the estimator base

This comment has been minimized.


glemaitre Aug 24, 2017
Author Member

The available samplers follow

This comment has been minimized.


glemaitre Aug 24, 2017
Author Member

using the base estimator and adding a sampling functionality throw the sample method

The augmented data set should be used instead of the original data set to train
a classifier::

>>> from sklearn.svm import LinearSVC
>>> clf = LinearSVC()
>>>, y_resampled) # doctest: +ELLIPSIS

This comment has been minimized.


glemaitre Aug 24, 2017
Author Member

Add this back

.. warning::

:class:`ClusterCentroids` supports sparse matrices. However, the new samples
are generated are not specifically sparse. Therefore, even if the resulting

This comment has been minimized.


glemaitre Aug 24, 2017
Author Member

the new samples generated

index_classified = index_under_sample[pred_target == y_subset]
pred_target = pred[:index_under_sample.size]
index_classified = index_under_sample[
pred_target == y_subset[:index_under_sample.size]]

This comment has been minimized.


glemaitre Aug 24, 2017
Author Member

safe_indexing y_subset

@glemaitre glemaitre merged commit cddf39b into scikit-learn-contrib:master Aug 24, 2017
0 of 4 checks passed
0 of 4 checks passed
ci/circleci CircleCI is running your tests
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
continuous-integration/travis-ci/pr The Travis CI build is in progress
lgtm analysis: Python Fetching Git Commits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants