-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MRG] User Guide #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] User Guide #295
Conversation
Hello @glemaitre! Thanks for updating the PR.
Comment last updated on August 11, 2017 at 23:05 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #295 +/- ##
=======================================
Coverage 98.33% 98.33%
=======================================
Files 66 66
Lines 3848 3848
=======================================
Hits 3784 3784
Misses 64 64
Continue to review full report at Codecov.
|
@chkoar @massich Would you have time to split the task because it is huge. For the moment it is how it looks: What need to be carefully done is to design an example if we need some images such that we automatically generate them. Then we need cross-referencing to the API doc to the User Guide and ase well from the API doc to the example. Let me know if you would have some time to put on that. |
Could you provide us the link for the artifacts? |
doc/combine.rst
Outdated
|
||
We previously presented :class:`SMOTE` and showed that this method can generate | ||
noisy samples by interpolating new points between marginal outliers and | ||
inliers. This issue can be solved by cleaning the resulting space obtained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaning the resulted space obtained from/ after over-sampling.
In general this work is quite good. We can merge in order to enhance the documentation and correct afterwards. Some notes:
Thanks @glemaitre |
doc/over_sampling.rst
Outdated
------------------ | ||
|
||
While the :class:`RandomOverSampler` is over-sampling by repeating some of the | ||
original samples, :class:`SMOTE` and :class:`ADASYN` generate new samples in by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say:
by duplicating some of the original samples of the minority class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the 1.1 and do 0. where we state the problem.
If I land in under-sampling, the example of what I'm trying to understand is under the title of oversampling, so I wont be able to find it. However if in the index there's a 0 - The problem of imbalance. It's more likely that I realize that I might want to read that first.
I didn't fin this (It is also possible to bootstrap the data when resampling by setting replacement to True.
) but I would change it by: ...resampling using bootstrap by setting ...
and I would actually not use it also
but RandomUnderSampler allows
doc/over_sampling.rst
Outdated
generated considering its k neareast-neighbors (corresponding to | ||
``k_neighbors``). For instance, the 3 nearest-neighbors are included in the | ||
blue circle as illustrated in the figure below. Then, one of these | ||
nearest-neighbors :math:`x_{zi}` will be selected and a sample will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using will on the previous sentence is fine. 'cos is something that would appear as a consequence. But here confuses me and I had to read twice (acutally more :P)
is selected and the new sample is generated as follows:
|
||
############################################################################### | ||
# The algorithm performing prototype selection can be subdivided into two | ||
# groups: (i) the controlled unde-sampling methods and (ii) the cleaning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
under-sampling (misses an r)
doc/combine.rst
Outdated
.. currentmodule:: imblearn.combine | ||
|
||
In this regard, Tomek's link and edited nearest-neighbours are the two cleaning | ||
methods which have been pipeline after SMOTE over-sampling to obtain a cleaner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which have can be added to the pipeline
doc/combine.rst
Outdated
pipeline both over- and under-sampling methods: (i) :class:`SMOTETomek` | ||
and (ii) :class:`SMOTEENN`. | ||
|
||
Those two classes can be used as any other sampler with identical parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These
doc/combine.rst
Outdated
>>> print(Counter(y_resampled)) | ||
Counter({1: 4566, 0: 4499, 2: 4413}) | ||
|
||
We can also see in the example below that :class:`SMOTEENN` tend to clean more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tends
doc/datasets/index.rst
Outdated
|
||
.. currentmodule:: imblearn.datasets | ||
|
||
The ``imblearn.datasets`` package is complementing the the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one the extra
doc/datasets/index.rst
Outdated
.. currentmodule:: imblearn.datasets | ||
|
||
The ``imblearn.datasets`` package is complementing the the | ||
``sklearn.datasets`` package. The package provide both: (i) a set of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provides
doc/under_sampling.rst
Outdated
Controlled under-sampling techniques | ||
------------------------------------ | ||
|
||
:class:`RandomUnderSampler` is a fast and easy to balance the data by randomly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
easy way to balance
doc/under_sampling.rst
Outdated
|
||
As later stated in the next section, :class:`NearMiss` heuristic rules are | ||
based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors`` | ||
and ``n_neighbors_ver3`` accepts classifier derived from ``KNeighborsMixin`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accept ?
doc/under_sampling.rst
Outdated
will be selected. NearMiss-2 will not have this effect since it does not focus | ||
on the nearest samples but rather on the farthest samples. We can imagine that | ||
the presence of noise can also altered the sampling mainly in the presence of | ||
marginal outliers. NearMiss-3 is probably the version which will be the less |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which will be less affected
doc/under_sampling.rst
Outdated
In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to | ||
remove noisy samples. In addition, more samples will be kept since it will not | ||
iterate over the samples of the majority class but all samples which do not | ||
agree with the 1 nearest neighbor rule will be added at once. The class can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two sentence maybe, will be added at once at the end of sentence doesnt make sense
doc/under_sampling.rst
Outdated
This class has 2 important parameters. ``estimator`` will accept any | ||
scikit-learn classifier which has a method ``predict_proba``. The classifier | ||
training is performed using a cross-validation and the parameter ``cv`` can set | ||
the number of fold to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
folds
bootstrap is a verb which mean resampling with replacement. So I would be inclined to use:
|
1b22868
to
33660d4
Compare
Finally we got a user guide :D |
Reference Issue
closes #253
What does this implement/fix? Explain your changes.
Any other comments?