Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NCL - class should be cleaned if number of sampes is 0.5 * minority samples, not if 0.5* data.shape[0] #764

Closed
solegalli opened this issue Oct 14, 2020 · 4 comments · Fixed by #1012

Comments

@solegalli
Copy link
Contributor

solegalli commented Oct 14, 2020

Describe the bug

Neighbourhood cleaning rule procedure:

  1. Split data T into the class of interest C (minority) and the rest of data O.
  2. Identify noisy data A1 in O with edited nearest neighbor rule.
  3. For each class Ci in O: (this is, for each observation in the majority class(es)
    if ( x Ci in 3-nearest neighbors of misclassified y C )
    and ( | Ci | ‡ 0.5 · | C | ) then A2 = { x } A2
  4. Reduced data S = T - ( A1 union A2 )

The above is a copy of the pseudo code in the article. There, C is the minority class or class of interest.

Further quote what is on the article:
"To avoid excessive reduction of small classes, only examples from classes larger or equal to 0.5 * | C | are considered while forming A2. " and it previously mentions that C is the minority. They refer to the entire dataset as T.

@solegalli solegalli changed the title [BUG] Neighbourhood cleaning rule algo - CNN should fit to O [BUG] NCL - class should be cleaned if number of sampes is 0.5 * minority samples, not if 0.5* data.shape[0] Aug 11, 2021
@solegalli
Copy link
Contributor Author

I renamed the issue, because after reading the paper further, my original interpretation was wrong, and the implementation in imbalanced learn reflects what is proposed in the paper. Apart from the criteria to exclude observations from the cleaning procedure.

@solegalli
Copy link
Contributor Author

@glemaitre @chkoar was this parameter set up as a n_samples > X.shape[0] * self.threshold_cleaning for some reason?

Otherwise, I am happy to pick this up. Pls let me know.

@glemaitre
Copy link
Member

n_samples > X.shape[0] * self.threshold_cleaning

It corresponds to C_i > C * t where by default t is 0.5 as in the paper. Then, we put a parameter such that one has control to clean other classes.

I will add some additional tests now but the algorithm looks fine to me.

@glemaitre glemaitre reopened this Jul 10, 2023
@glemaitre
Copy link
Member

Oh no, I see your point. Indeed, it should be the minority class indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants