New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] EHN handling sparse matrices whenever possible #316

Merged
merged 34 commits into from Aug 24, 2017

Conversation

Projects
None yet
3 participants
@glemaitre
Member

glemaitre commented Aug 12, 2017

Reference Issue

closes #158

What does this implement/fix? Explain your changes.

Enhancement to handle sparse matrices (+pandas dataframe, list, etc.).

Supported samplers:

  • All prototype-selection can be supported.
  • Ensemble classes.
  • RandomOverSampler should be supported.

ADASYN and SMOTE and ClusterCentroids cannot be supported straightforwardly. I need more thought about it.

TODO:

  • use safe_indexing in the different methods.
    • RandomUnderSampler
    • CNN
    • ENN
    • RENN
    • AllKNN
    • IHT
    • NearMiss
    • NCR
    • OSS
    • TL
    • RandomOverSampling
    • ADASYN
    • SMOTE
    • BalanceCascade
    • EasyEnsemble
    • SMOTETomek
    • SMOTEENN
    • ClusterCentroids -> Update the tests.
  • update the documentation doctring.
  • add an entry in the user guide
  • add a common test
  • update examples text classification without densification

Any other comments?

@codecov

This comment has been minimized.

codecov bot commented Aug 12, 2017

Codecov Report

Merging #316 into master will decrease coverage by 0.22%.
The diff coverage is 97.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
- Coverage   98.19%   97.96%   -0.23%     
==========================================
  Files          66       66              
  Lines        3978     3924      -54     
==========================================
- Hits         3906     3844      -62     
- Misses         72       80       +8
Impacted Files Coverage Δ
imblearn/ensemble/easy_ensemble.py 100% <ø> (ø) ⬆️
imblearn/combine/smote_enn.py 85.29% <100%> (ø) ⬆️
...arn/under_sampling/prototype_selection/nearmiss.py 98.66% <100%> (+1.26%) ⬆️
imblearn/over_sampling/random_over_sampler.py 100% <100%> (ø) ⬆️
...sampling/prototype_generation/cluster_centroids.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/tests/test_adasyn.py 100% <100%> (ø) ⬆️
...prototype_selection/neighbourhood_cleaning_rule.py 100% <100%> (ø) ⬆️
.../under_sampling/prototype_selection/tomek_links.py 100% <100%> (ø) ⬆️
imblearn/combine/smote_tomek.py 93.22% <100%> (ø) ⬆️
...mpling/prototype_selection/random_under_sampler.py 100% <100%> (ø) ⬆️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488a0e8...f8ebd0e. Read the comment docs.

@pep8speaks

This comment has been minimized.

pep8speaks commented Aug 13, 2017

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 24, 2017 at 19:40 Hours UTC

glemaitre added some commits Aug 13, 2017

@glemaitre glemaitre changed the title from [WIP] EHN handling sparse matrices whenever possible to [MRG] EHN handling sparse matrices whenever possible Aug 14, 2017

@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 14, 2017

@massich @chkoar Here comes the sparse handling.

It is a bit tricky for the ClusterCentroids since we create a sparse matrix for the centroids which are not sparse. But this is stacked with the original sparse matrix. I will add an entry in the docstring to state that.

For SMOTE, we can compute effectively the sparse matrix but then I am not sure it makes sense to have a lot of zero there. It would make sense inside a categorical-SMOTE I think.

For the review, I did not modified the test and added a common test in which we check that dense sampling provide the same thing than a sparse sampling.

@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 14, 2017

Regarding the ClusterCentroids, I think that a PR making a nearest-neighbors voting instead of generating the centroids will be good.

I can imagine a parameter voting with 'hard'/'soft'/'auto' is a possibility. 'hard' would be default with sparse input while 'soft' otherwise.

@chkoar WDYT? I know that you were thinking about that since a while.

glemaitre added some commits Aug 14, 2017

@glemaitre glemaitre force-pushed the scikit-learn-contrib:master branch from 9304aa5 to f02c565 Aug 17, 2017

from sklearn.utils.estimator_checks import _yield_all_checks \
as sklearn_yield_all_checks, check_estimator \
as sklearn_check_estimator, check_parameters_default_constructible
from sklearn.exceptions import NotFittedError
from sklearn.utils.testing import (assert_warns, assert_raises_regex,
assert_true, set_random_state,
assert_equal)
assert_equal, assert_allclose,

This comment has been minimized.

@massich

massich Aug 22, 2017

Contributor

I don't understand why but this PR would insert back the assert_equal and the diff looks wierd 'cos they are no longer there in master (see this)

This comment has been minimized.

@massich

massich Aug 22, 2017

Contributor

Actually needs rebasing master.

@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 22, 2017

@massich rebased. Note that you will have an assert_raises_regex in a new test_adasyn

edit: ping #321

@massich

This comment has been minimized.

Contributor

massich commented Aug 22, 2017

LGTM, if all the CIs are happy

API's of imbalanced-learn samplers
----------------------------------

The sampler available follows the scikit-learn API using the estimator base

This comment has been minimized.

@glemaitre

glemaitre Aug 24, 2017

Member

The available samplers follow

This comment has been minimized.

@glemaitre

glemaitre Aug 24, 2017

Member

using the base estimator and adding a sampling functionality throw the sample method


The augmented data set should be used instead of the original data set to train
a classifier::

>>> from sklearn.svm import LinearSVC
>>> clf = LinearSVC()
>>> clf.fit(X_resampled, y_resampled) # doctest: +ELLIPSIS

This comment has been minimized.

@glemaitre

glemaitre Aug 24, 2017

Member

Add this back

.. warning::

:class:`ClusterCentroids` supports sparse matrices. However, the new samples
are generated are not specifically sparse. Therefore, even if the resulting

This comment has been minimized.

@glemaitre

glemaitre Aug 24, 2017

Member

the new samples generated

index_classified = index_under_sample[pred_target == y_subset]
pred_target = pred[:index_under_sample.size]
index_classified = index_under_sample[
pred_target == y_subset[:index_under_sample.size]]

This comment has been minimized.

@glemaitre

glemaitre Aug 24, 2017

Member

safe_indexing y_subset

@glemaitre glemaitre merged commit cddf39b into scikit-learn-contrib:master Aug 24, 2017

0 of 4 checks passed

ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
lgtm analysis: Python Fetching Git Commits
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment