Skip to content

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jun 15, 2017

Reference Issue

closes #253

What does this implement/fix? Explain your changes.

Any other comments?

  • Over-sampling
  • Under-sampling
  • Combination of algorithms
  • Ensemble of algorithms
  • Datasets module
  • Application example:
    • Text classification or something like that.

@pep8speaks
Copy link

pep8speaks commented Jun 15, 2017

Hello @glemaitre! Thanks for updating the PR.

  • In the file doc/conf.py, following are the PEP8 issues :

Line 30:1: E722 do not use bare except'
Line 41:1: E402 module level import not at top of file
Line 311:80: E501 line too long (86 > 79 characters)
Line 338:80: E501 line too long (84 > 79 characters)

Comment last updated on August 11, 2017 at 23:05 Hours UTC

@codecov
Copy link

codecov bot commented Jun 15, 2017

Codecov Report

Merging #295 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #295   +/-   ##
=======================================
  Coverage   98.33%   98.33%           
=======================================
  Files          66       66           
  Lines        3848     3848           
=======================================
  Hits         3784     3784           
  Misses         64       64
Impacted Files Coverage Δ
imblearn/ensemble/easy_ensemble.py 100% <ø> (ø) ⬆️
...g/prototype_selection/edited_nearest_neighbours.py 100% <ø> (ø) ⬆️
.../under_sampling/prototype_selection/tomek_links.py 100% <ø> (ø) ⬆️
imblearn/pipeline.py 97.8% <ø> (ø) ⬆️
...mpling/prototype_selection/random_under_sampler.py 100% <ø> (ø) ⬆️
imblearn/metrics/classification.py 96.77% <ø> (ø) ⬆️
imblearn/ensemble/balance_cascade.py 100% <ø> (ø) ⬆️
imblearn/over_sampling/random_over_sampler.py 100% <ø> (ø) ⬆️
...sampling/prototype_generation/cluster_centroids.py 100% <ø> (ø) ⬆️
imblearn/combine/smote_tomek.py 100% <ø> (ø) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9c2756...98e920e. Read the comment docs.

@glemaitre
Copy link
Member Author

@chkoar @massich Would you have time to split the task because it is huge.
I would think that ensemble and combination methods are not so hard to explain.

For the moment it is how it looks:
https://699-36019880-gh.circle-artifacts.com/0/home/ubuntu/imbalanced-learn/doc/_build/html/user_guide.html

What need to be carefully done is to design an example if we need some images such that we automatically generate them. Then we need cross-referencing to the API doc to the User Guide and ase well from the API doc to the example.

Let me know if you would have some time to put on that.

@glemaitre
Copy link
Member Author

@pep8speaks

@glemaitre glemaitre changed the title [WIP] User Guide [MRG] User Guide Aug 9, 2017
@glemaitre
Copy link
Member Author

@massich @chkoar @mrastgoo Can you review it such that we get down with it.

@chkoar
Copy link
Member

chkoar commented Aug 9, 2017

Could you provide us the link for the artifacts?

@glemaitre
Copy link
Member Author

doc/combine.rst Outdated

We previously presented :class:`SMOTE` and showed that this method can generate
noisy samples by interpolating new points between marginal outliers and
inliers. This issue can be solved by cleaning the resulting space obtained
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaning the resulted space obtained from/ after over-sampling.

@chkoar
Copy link
Member

chkoar commented Aug 9, 2017

In general this work is quite good. We can merge in order to enhance the documentation and correct afterwards. Some notes:

  • I think that all figures could be smaller.
  • I would place a title on all images (probably under the plots). For instance what we see here without reading the text?

Thanks @glemaitre

------------------

While the :class:`RandomOverSampler` is over-sampling by repeating some of the
original samples, :class:`SMOTE` and :class:`ADASYN` generate new samples in by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say:

by duplicating some of the original samples of the minority class

Copy link
Contributor

@massich massich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the 1.1 and do 0. where we state the problem.

If I land in under-sampling, the example of what I'm trying to understand is under the title of oversampling, so I wont be able to find it. However if in the index there's a 0 - The problem of imbalance. It's more likely that I realize that I might want to read that first.

I didn't fin this (It is also possible to bootstrap the data when resampling by setting replacement to True.) but I would change it by: ...resampling using bootstrap by setting ... and I would actually not use it also but RandomUnderSampler allows

generated considering its k neareast-neighbors (corresponding to
``k_neighbors``). For instance, the 3 nearest-neighbors are included in the
blue circle as illustrated in the figure below. Then, one of these
nearest-neighbors :math:`x_{zi}` will be selected and a sample will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using will on the previous sentence is fine. 'cos is something that would appear as a consequence. But here confuses me and I had to read twice (acutally more :P)
is selected and the new sample is generated as follows:


###############################################################################
# The algorithm performing prototype selection can be subdivided into two
# groups: (i) the controlled unde-sampling methods and (ii) the cleaning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under-sampling (misses an r)

doc/combine.rst Outdated
.. currentmodule:: imblearn.combine

In this regard, Tomek's link and edited nearest-neighbours are the two cleaning
methods which have been pipeline after SMOTE over-sampling to obtain a cleaner
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which have can be added to the pipeline

doc/combine.rst Outdated
pipeline both over- and under-sampling methods: (i) :class:`SMOTETomek`
and (ii) :class:`SMOTEENN`.

Those two classes can be used as any other sampler with identical parameters
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These

doc/combine.rst Outdated
>>> print(Counter(y_resampled))
Counter({1: 4566, 0: 4499, 2: 4413})

We can also see in the example below that :class:`SMOTEENN` tend to clean more
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tends


.. currentmodule:: imblearn.datasets

The ``imblearn.datasets`` package is complementing the the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one the extra

.. currentmodule:: imblearn.datasets

The ``imblearn.datasets`` package is complementing the the
``sklearn.datasets`` package. The package provide both: (i) a set of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provides

Controlled under-sampling techniques
------------------------------------

:class:`RandomUnderSampler` is a fast and easy to balance the data by randomly
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

easy way to balance


As later stated in the next section, :class:`NearMiss` heuristic rules are
based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors``
and ``n_neighbors_ver3`` accepts classifier derived from ``KNeighborsMixin``
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accept ?

will be selected. NearMiss-2 will not have this effect since it does not focus
on the nearest samples but rather on the farthest samples. We can imagine that
the presence of noise can also altered the sampling mainly in the presence of
marginal outliers. NearMiss-3 is probably the version which will be the less
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which will be less affected

In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to
remove noisy samples. In addition, more samples will be kept since it will not
iterate over the samples of the majority class but all samples which do not
agree with the 1 nearest neighbor rule will be added at once. The class can be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two sentence maybe, will be added at once at the end of sentence doesnt make sense

This class has 2 important parameters. ``estimator`` will accept any
scikit-learn classifier which has a method ``predict_proba``. The classifier
training is performed using a cross-validation and the parameter ``cv`` can set
the number of fold to use.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folds

@glemaitre
Copy link
Member Author

resampling using bootstrap by setting

bootstrap is a verb which mean resampling with replacement. So I would be inclined to use:

RandomUnderSampling allows to boostrap the data by setting ....

@glemaitre
Copy link
Member Author

@chkoar @massich I added the backreferencing of sphinx-gallery. I think this is good for merging and nitpicking can come in another PR.

I let you make the merging if you agree.

@glemaitre glemaitre force-pushed the master branch 2 times, most recently from 1b22868 to 33660d4 Compare August 11, 2017 14:43
@chkoar chkoar merged commit ca5452c into scikit-learn-contrib:master Aug 12, 2017
@glemaitre
Copy link
Member Author

Finally we got a user guide :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a User Guide
5 participants