Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] update doscstrings NCL #854

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 23 additions & 4 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -347,10 +347,29 @@ place. The class can be used as::
Our implementation offer to set the number of seeds to put in the set :math:`C`
originally by setting the parameter ``n_seeds_S``.

:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than
condensing them :cite:`laurikkala2001improving`. Therefore, it will used the
union of samples to be rejected between the :class:`EditedNearestNeighbours`
and the output a 3 nearest neighbors classifier. The class can be used as::
:class:`NeighbourhoodCleaningRule` focuses more on cleaning the data than on
reducing the number of samples :cite:`laurikkala2001improving`. It expands
the :class:`EditedNearestNeighbours` in that, it further eliminates samples from
the majority class, if they belong to the 3 closest neighbours of a sample from
the majority class, where the majority, or all of the neighbours disagree with the
minority. The procedure for the :class:`NeighbourhoodCleaningRule` is as follows:

1. Split dataset into the class of interest C (minority) and the rest of the data O.
2. Identify noisy data A1 in O, with edited nearest neighbor rule.
3. For each (majority) class in O, if its observations are one of the 3 closest
neighbors of a minority sample where all or most of those neighbors are not minority,
add the observation to group A2.
4. Reduce the original data S = T - ( A1 union A2 )

The first step is an ENN, where most of neighbours need to disagree to remove a sample
from the majority. The second step is a cleaning step, that further removes samples
from the majority classes. To carry on the cleaning step there is one condition:
it will only clean samples from classes that contain a minimum number of observations.
The minimum number is regulated by the `threshold_cleaning` parameter. In the original
article :cite:`laurikkala2001improving` samples would be removed if the class had at
least half as many observations as those in the minority class.

The class can be used as::

>>> from imblearn.under_sampling import NeighbourhoodCleaningRule
>>> ncr = NeighbourhoodCleaningRule()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
class NeighbourhoodCleaningRule(BaseCleaningSampler):
"""Undersample based on the neighbourhood cleaning rule.

This class uses ENN and a k-NN to remove noisy samples from the datasets.
This class uses ENN and a k-NN to remove noisy samples from the dataset.

Read more in the :ref:`User Guide <condensed_nearest_neighbors>`.

Expand All @@ -40,19 +40,24 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler):
If ``int``, size of the neighbourhood to consider to compute the
nearest neighbors. If object, an estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the nearest-neighbors. By default, it will be a 3-NN.
find the nearest-neighbors. By default, it explores the 3 closest
neighbors.

kind_sel : {{"all", "mode"}}, default='all'
Strategy to use in order to exclude samples in the ENN sampling.

- If ``'all'``, all neighbours will have to agree with the samples of
interest to not be excluded.
- If ``'mode'``, the majority vote of the neighbours will be used in
order to exclude a sample.
- If ``'all'``, all neighbours will have to agree with a sample in order
not to be excluded.
- If ``'mode'``, the majority of the neighbours will have to agree with
a sample in order not to be excluded.

The strategy `"all"` will be less conservative than `'mode'`. Thus,
more samples will be removed when `kind_sel="all"` generally.
more samples will be removed when `kind_sel="all"`, generally.

Note that this parameter only applies to the cleaning step of the NCL.
The ENN is done using majority vote, as described in the original article.

#TODO: this is not the originally described threshold, fix
threshold_cleaning : float, default=0.5
Threshold used to whether consider a class or not during the cleaning
after applying ENN. A class will be considered during cleaning when:
Expand Down