New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] User Guide #295

Merged
merged 54 commits into from Aug 12, 2017

Conversation

Projects
None yet
5 participants
@glemaitre
Member

glemaitre commented Jun 15, 2017

Reference Issue

closes #253

What does this implement/fix? Explain your changes.

Any other comments?

  • Over-sampling
  • Under-sampling
  • Combination of algorithms
  • Ensemble of algorithms
  • Datasets module
  • Application example:
    • Text classification or something like that.
@pep8speaks

This comment has been minimized.

pep8speaks commented Jun 15, 2017

Hello @glemaitre! Thanks for updating the PR.

  • In the file doc/conf.py, following are the PEP8 issues :

Line 30:1: E722 do not use bare except'
Line 41:1: E402 module level import not at top of file
Line 311:80: E501 line too long (86 > 79 characters)
Line 338:80: E501 line too long (84 > 79 characters)

Comment last updated on August 11, 2017 at 23:05 Hours UTC
@codecov

This comment has been minimized.

codecov bot commented Jun 15, 2017

Codecov Report

Merging #295 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #295   +/-   ##
=======================================
  Coverage   98.33%   98.33%           
=======================================
  Files          66       66           
  Lines        3848     3848           
=======================================
  Hits         3784     3784           
  Misses         64       64
Impacted Files Coverage Δ
imblearn/ensemble/easy_ensemble.py 100% <ø> (ø) ⬆️
...g/prototype_selection/edited_nearest_neighbours.py 100% <ø> (ø) ⬆️
.../under_sampling/prototype_selection/tomek_links.py 100% <ø> (ø) ⬆️
imblearn/pipeline.py 97.8% <ø> (ø) ⬆️
...mpling/prototype_selection/random_under_sampler.py 100% <ø> (ø) ⬆️
imblearn/metrics/classification.py 96.77% <ø> (ø) ⬆️
imblearn/ensemble/balance_cascade.py 100% <ø> (ø) ⬆️
imblearn/over_sampling/random_over_sampler.py 100% <ø> (ø) ⬆️
...sampling/prototype_generation/cluster_centroids.py 100% <ø> (ø) ⬆️
imblearn/combine/smote_tomek.py 100% <ø> (ø) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9c2756...98e920e. Read the comment docs.

@glemaitre glemaitre referenced this pull request Jul 9, 2017

Closed

[RFC] best practices #383

@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 4, 2017

@chkoar @massich Would you have time to split the task because it is huge.
I would think that ensemble and combination methods are not so hard to explain.

For the moment it is how it looks:
https://699-36019880-gh.circle-artifacts.com/0/home/ubuntu/imbalanced-learn/doc/_build/html/user_guide.html

What need to be carefully done is to design an example if we need some images such that we automatically generate them. Then we need cross-referencing to the API doc to the User Guide and ase well from the API doc to the example.

Let me know if you would have some time to put on that.

@glemaitre glemaitre changed the title from [WIP] User Guide to [MRG] User Guide Aug 9, 2017

@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 9, 2017

@massich @chkoar @mrastgoo Can you review it such that we get down with it.

@chkoar

This comment has been minimized.

Member

chkoar commented Aug 9, 2017

Could you provide us the link for the artifacts?

We previously presented :class:`SMOTE` and showed that this method can generate
noisy samples by interpolating new points between marginal outliers and
inliers. This issue can be solved by cleaning the resulting space obtained

This comment has been minimized.

@mrastgoo

mrastgoo Aug 9, 2017

cleaning the resulted space obtained from/ after over-sampling.

@glemaitre glemaitre force-pushed the scikit-learn-contrib:master branch from 9395cbe to 333d81b Aug 9, 2017

@chkoar

This comment has been minimized.

Member

chkoar commented Aug 9, 2017

In general this work is quite good. We can merge in order to enhance the documentation and correct afterwards. Some notes:

  • I think that all figures could be smaller.
  • I would place a title on all images (probably under the plots). For instance what we see here without reading the text?

Thanks @glemaitre

------------------
While the :class:`RandomOverSampler` is over-sampling by repeating some of the
original samples, :class:`SMOTE` and :class:`ADASYN` generate new samples in by

This comment has been minimized.

@chkoar

chkoar Aug 9, 2017

Member

I would say:

by duplicating some of the original samples of the minority class

@massich

I would move the 1.1 and do 0. where we state the problem.

If I land in under-sampling, the example of what I'm trying to understand is under the title of oversampling, so I wont be able to find it. However if in the index there's a 0 - The problem of imbalance. It's more likely that I realize that I might want to read that first.

I didn't fin this (It is also possible to bootstrap the data when resampling by setting replacement to True.) but I would change it by: ...resampling using bootstrap by setting ... and I would actually not use it also but RandomUnderSampler allows

generated considering its k neareast-neighbors (corresponding to
``k_neighbors``). For instance, the 3 nearest-neighbors are included in the
blue circle as illustrated in the figure below. Then, one of these
nearest-neighbors :math:`x_{zi}` will be selected and a sample will be

This comment has been minimized.

@massich

massich Aug 9, 2017

Contributor

using will on the previous sentence is fine. 'cos is something that would appear as a consequence. But here confuses me and I had to read twice (acutally more :P)
is selected and the new sample is generated as follows:

###############################################################################
# The algorithm performing prototype selection can be subdivided into two
# groups: (i) the controlled unde-sampling methods and (ii) the cleaning

This comment has been minimized.

@massich

massich Aug 9, 2017

Contributor

under-sampling (misses an r)

.. currentmodule:: imblearn.combine
In this regard, Tomek's link and edited nearest-neighbours are the two cleaning
methods which have been pipeline after SMOTE over-sampling to obtain a cleaner

This comment has been minimized.

@mrastgoo

mrastgoo Aug 9, 2017

which have can be added to the pipeline

pipeline both over- and under-sampling methods: (i) :class:`SMOTETomek`
and (ii) :class:`SMOTEENN`.
Those two classes can be used as any other sampler with identical parameters

This comment has been minimized.

@mrastgoo
>>> print(Counter(y_resampled))
Counter({1: 4566, 0: 4499, 2: 4413})
We can also see in the example below that :class:`SMOTEENN` tend to clean more

This comment has been minimized.

@mrastgoo
.. currentmodule:: imblearn.datasets
The ``imblearn.datasets`` package is complementing the the

This comment has been minimized.

@mrastgoo

mrastgoo Aug 9, 2017

one the extra

.. currentmodule:: imblearn.datasets
The ``imblearn.datasets`` package is complementing the the
``sklearn.datasets`` package. The package provide both: (i) a set of

This comment has been minimized.

@mrastgoo
Controlled under-sampling techniques
------------------------------------
:class:`RandomUnderSampler` is a fast and easy to balance the data by randomly

This comment has been minimized.

@mrastgoo

mrastgoo Aug 9, 2017

easy way to balance

As later stated in the next section, :class:`NearMiss` heuristic rules are
based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors``
and ``n_neighbors_ver3`` accepts classifier derived from ``KNeighborsMixin``

This comment has been minimized.

@mrastgoo
will be selected. NearMiss-2 will not have this effect since it does not focus
on the nearest samples but rather on the farthest samples. We can imagine that
the presence of noise can also altered the sampling mainly in the presence of
marginal outliers. NearMiss-3 is probably the version which will be the less

This comment has been minimized.

@mrastgoo

mrastgoo Aug 9, 2017

which will be less affected

In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to
remove noisy samples. In addition, more samples will be kept since it will not
iterate over the samples of the majority class but all samples which do not
agree with the 1 nearest neighbor rule will be added at once. The class can be

This comment has been minimized.

@mrastgoo

mrastgoo Aug 9, 2017

Two sentence maybe, will be added at once at the end of sentence doesnt make sense

This class has 2 important parameters. ``estimator`` will accept any
scikit-learn classifier which has a method ``predict_proba``. The classifier
training is performed using a cross-validation and the parameter ``cv`` can set
the number of fold to use.

This comment has been minimized.

@mrastgoo
@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 9, 2017

resampling using bootstrap by setting

bootstrap is a verb which mean resampling with replacement. So I would be inclined to use:

RandomUnderSampling allows to boostrap the data by setting ....

glemaitre added some commits Aug 9, 2017

@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 9, 2017

@chkoar @massich I added the backreferencing of sphinx-gallery. I think this is good for merging and nitpicking can come in another PR.

I let you make the merging if you agree.

@glemaitre glemaitre force-pushed the scikit-learn-contrib:master branch 2 times, most recently from 1b22868 to 33660d4 Aug 11, 2017

@chkoar chkoar merged commit ca5452c into scikit-learn-contrib:master Aug 12, 2017

6 checks passed

ci/circleci Your tests passed on CircleCI!
Details
code-quality/landscape Code quality increased by 0.13%
Details
codecov/patch Coverage not affected when comparing e9c2756...98e920e
Details
codecov/project 98.33% remains the same compared to e9c2756
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 12, 2017

Finally we got a user guide :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment