-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MRG] EHN handling sparse matrices whenever possible #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #316 +/- ##
==========================================
- Coverage 98.19% 97.96% -0.23%
==========================================
Files 66 66
Lines 3978 3924 -54
==========================================
- Hits 3906 3844 -62
- Misses 72 80 +8
Continue to review full report at Codecov.
|
|
Hello @glemaitre! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 24, 2017 at 19:40 Hours UTC |
|
@massich @chkoar Here comes the sparse handling. It is a bit tricky for the For SMOTE, we can compute effectively the sparse matrix but then I am not sure it makes sense to have a lot of zero there. It would make sense inside a categorical-SMOTE I think. For the review, I did not modified the test and added a common test in which we check that dense sampling provide the same thing than a sparse sampling. |
|
Regarding the ClusterCentroids, I think that a PR making a nearest-neighbors voting instead of generating the centroids will be good. I can imagine a parameter @chkoar WDYT? I know that you were thinking about that since a while. |
imblearn/utils/estimator_checks.py
Outdated
| from sklearn.utils.testing import (assert_warns, assert_raises_regex, | ||
| assert_true, set_random_state, | ||
| assert_equal) | ||
| assert_equal, assert_allclose, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why but this PR would insert back the assert_equal and the diff looks wierd 'cos they are no longer there in master (see this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually needs rebasing master.
|
LGTM, if all the CIs are happy |
doc/introduction.rst
Outdated
| API's of imbalanced-learn samplers | ||
| ---------------------------------- | ||
|
|
||
| The sampler available follows the scikit-learn API using the estimator base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The available samplers follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the base estimator and adding a sampling functionality throw the sample method
|
|
||
| >>> from sklearn.svm import LinearSVC | ||
| >>> clf = LinearSVC() | ||
| >>> clf.fit(X_resampled, y_resampled) # doctest: +ELLIPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this back
doc/under_sampling.rst
Outdated
| .. warning:: | ||
|
|
||
| :class:`ClusterCentroids` supports sparse matrices. However, the new samples | ||
| are generated are not specifically sparse. Therefore, even if the resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new samples generated
imblearn/ensemble/balance_cascade.py
Outdated
| index_classified = index_under_sample[pred_target == y_subset] | ||
| pred_target = pred[:index_under_sample.size] | ||
| index_classified = index_under_sample[ | ||
| pred_target == y_subset[:index_under_sample.size]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe_indexing y_subset
Reference Issue
closes #158
What does this implement/fix? Explain your changes.
Enhancement to handle sparse matrices (+pandas dataframe, list, etc.).
Supported samplers:
ADASYN and SMOTE and ClusterCentroids cannot be supported straightforwardly. I need more thought about it.
TODO:
safe_indexingin the different methods.Any other comments?