Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[MRG] EHN handling sparse matrices whenever possible (#316)
* EHN POC sparse handling for RandomUnderSampler * EHN support sparse ENN * iter * EHN sparse indexing IHT * EHN sparse support nearmiss * EHN support sparse matrices for NCR * EHN support sparse Tomek and OSS * EHN support sparsity for CNN * EHN support sparse for SMOTE * EHN support sparse adasyn * EHN support sparsity for sombine methods * EHN support sparsity BC * DOC update docstring * DOC fix example topic classification * FIX fix test and class clustercentroids * TST add common test * TST add ensemble * TST use allclose * TST install conda with ubuntu container * TST increase tolerance * TST increase tolerance * TST test all versions NearMiss and SMOTE * TST set the algorithm of KMeans * DOC add entry in user guide * DOC add entry sparse for CC * DOC whatsnew entry * DOC fix api * TST adapt pytest * DOC update user guide * address comments * TST remove the last assert_regex
- Loading branch information
Showing
33 changed files
with
682 additions
and
550 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
.. _introduction: | ||
|
||
============ | ||
Introduction | ||
============ | ||
|
||
.. _api_imblearn: | ||
|
||
API's of imbalanced-learn samplers | ||
---------------------------------- | ||
|
||
The available samplers follows the scikit-learn API using the base estimator and adding a sampling functionality throw the ``sample`` method:: | ||
|
||
:Estimator: | ||
|
||
The base object, implements a ``fit`` method to learn from data, either:: | ||
|
||
estimator = obj.fit(data, targets) | ||
|
||
:Sampler: | ||
|
||
To resample a data sets, each sampler implements:: | ||
|
||
data_resampled, targets_resampled = obj.sample(data, targets) | ||
|
||
Fitting and sampling can also be done in one step:: | ||
|
||
data_resampled, targets_resampled = obj.fit_sample(data, targets) | ||
|
||
Imbalanced-learn samplers accept the same inputs that in scikit-learn: | ||
|
||
* ``data``: array-like (2-D list, pandas.Dataframe, numpy.array) or sparse | ||
matrices; | ||
* ``targets``: array-like (1-D list, pandas.Series, numpy.array). | ||
|
||
.. topic:: Sparse input | ||
|
||
For sparse input the data is **converted to the Compressed Sparse Rows | ||
representation** (see ``scipy.sparse.csr_matrix``) before being fed to the | ||
sampler. To avoid unnecessary memory copies, it is recommended to choose the | ||
CSR representation upstream. | ||
|
||
.. _problem_statement: | ||
|
||
Problem statement regarding imbalanced data sets | ||
------------------------------------------------ | ||
|
||
The learning phase and the subsequent prediction of machine learning algorithms | ||
can be affected by the problem of imbalanced data set. The balancing issue | ||
corresponds to the difference of the number of samples in the different | ||
classes. We illustrate the effect of training a linear SVM classifier with | ||
different level of class balancing. | ||
|
||
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_001.png | ||
:target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html | ||
:scale: 60 | ||
:align: center | ||
|
||
As expected, the decision function of the linear SVM is highly impacted. With a | ||
greater imbalanced ratio, the decision function favor the class with the larger | ||
number of samples, usually referred as the majority class. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.