Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] EHN handling sparse matrices whenever possible #316

Merged
merged 34 commits into from Aug 24, 2017

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Aug 12, 2017

Reference Issue

closes #158

What does this implement/fix? Explain your changes.

Enhancement to handle sparse matrices (+pandas dataframe, list, etc.).

Supported samplers:

  • All prototype-selection can be supported.
  • Ensemble classes.
  • RandomOverSampler should be supported.

ADASYN and SMOTE and ClusterCentroids cannot be supported straightforwardly. I need more thought about it.

TODO:

  • use safe_indexing in the different methods.
    • RandomUnderSampler
    • CNN
    • ENN
    • RENN
    • AllKNN
    • IHT
    • NearMiss
    • NCR
    • OSS
    • TL
    • RandomOverSampling
    • ADASYN
    • SMOTE
    • BalanceCascade
    • EasyEnsemble
    • SMOTETomek
    • SMOTEENN
    • ClusterCentroids -> Update the tests.
  • update the documentation doctring.
  • add an entry in the user guide
  • add a common test
  • update examples text classification without densification

Any other comments?

@codecov
Copy link

codecov bot commented Aug 12, 2017

Codecov Report

Merging #316 into master will decrease coverage by 0.22%.
The diff coverage is 97.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
- Coverage   98.19%   97.96%   -0.23%     
==========================================
  Files          66       66              
  Lines        3978     3924      -54     
==========================================
- Hits         3906     3844      -62     
- Misses         72       80       +8
Impacted Files Coverage Δ
imblearn/ensemble/easy_ensemble.py 100% <ø> (ø) ⬆️
imblearn/combine/smote_enn.py 85.29% <100%> (ø) ⬆️
...arn/under_sampling/prototype_selection/nearmiss.py 98.66% <100%> (+1.26%) ⬆️
imblearn/over_sampling/random_over_sampler.py 100% <100%> (ø) ⬆️
...sampling/prototype_generation/cluster_centroids.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/tests/test_adasyn.py 100% <100%> (ø) ⬆️
...prototype_selection/neighbourhood_cleaning_rule.py 100% <100%> (ø) ⬆️
.../under_sampling/prototype_selection/tomek_links.py 100% <100%> (ø) ⬆️
imblearn/combine/smote_tomek.py 93.22% <100%> (ø) ⬆️
...mpling/prototype_selection/random_under_sampler.py 100% <100%> (ø) ⬆️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488a0e8...f8ebd0e. Read the comment docs.

@pep8speaks
Copy link

pep8speaks commented Aug 13, 2017

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 24, 2017 at 19:40 Hours UTC

@glemaitre glemaitre changed the title [WIP] EHN handling sparse matrices whenever possible [MRG] EHN handling sparse matrices whenever possible Aug 14, 2017
@glemaitre
Copy link
Member Author

@massich @chkoar Here comes the sparse handling.

It is a bit tricky for the ClusterCentroids since we create a sparse matrix for the centroids which are not sparse. But this is stacked with the original sparse matrix. I will add an entry in the docstring to state that.

For SMOTE, we can compute effectively the sparse matrix but then I am not sure it makes sense to have a lot of zero there. It would make sense inside a categorical-SMOTE I think.

For the review, I did not modified the test and added a common test in which we check that dense sampling provide the same thing than a sparse sampling.

@glemaitre
Copy link
Member Author

Regarding the ClusterCentroids, I think that a PR making a nearest-neighbors voting instead of generating the centroids will be good.

I can imagine a parameter voting with 'hard'/'soft'/'auto' is a possibility. 'hard' would be default with sparse input while 'soft' otherwise.

@chkoar WDYT? I know that you were thinking about that since a while.

from sklearn.utils.estimator_checks import _yield_all_checks \
as sklearn_yield_all_checks, check_estimator \
as sklearn_check_estimator, check_parameters_default_constructible
from sklearn.exceptions import NotFittedError
from sklearn.utils.testing import (assert_warns, assert_raises_regex,
assert_true, set_random_state,
assert_equal)
assert_equal, assert_allclose,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why but this PR would insert back the assert_equal and the diff looks wierd 'cos they are no longer there in master (see this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually needs rebasing master.

@glemaitre
Copy link
Member Author

glemaitre commented Aug 22, 2017

@massich rebased. Note that you will have an assert_raises_regex in a new test_adasyn

edit: ping #321

@massich
Copy link
Contributor

massich commented Aug 22, 2017

LGTM, if all the CIs are happy

API's of imbalanced-learn samplers
----------------------------------

The sampler available follows the scikit-learn API using the estimator base
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The available samplers follow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the base estimator and adding a sampling functionality throw the sample method


The augmented data set should be used instead of the original data set to train
a classifier::

>>> from sklearn.svm import LinearSVC
>>> clf = LinearSVC()
>>> clf.fit(X_resampled, y_resampled) # doctest: +ELLIPSIS
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this back

.. warning::

:class:`ClusterCentroids` supports sparse matrices. However, the new samples
are generated are not specifically sparse. Therefore, even if the resulting
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new samples generated

index_classified = index_under_sample[pred_target == y_subset]
pred_target = pred[:index_under_sample.size]
index_classified = index_under_sample[
pred_target == y_subset[:index_under_sample.size]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe_indexing y_subset

@glemaitre glemaitre merged commit cddf39b into scikit-learn-contrib:master Aug 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle sparse matrix
3 participants