Skip to content

FIX Prevent incorrect class category resampling in SMOTENC when median_std_ == 0#675

Merged
glemaitre merged 7 commits intoscikit-learn-contrib:masterfrom
bganglia:smotenc-with-zero-median-std
Jun 9, 2020
Merged

FIX Prevent incorrect class category resampling in SMOTENC when median_std_ == 0#675
glemaitre merged 7 commits intoscikit-learn-contrib:masterfrom
bganglia:smotenc-with-zero-median-std

Conversation

@bganglia
Copy link
Copy Markdown
Contributor

@bganglia bganglia commented Jan 14, 2020

Fixes #662

What does this implement/fix? Explain your changes.

If the median standard deviation is 0, the SMOTENC class will now store the categorical features before multiplying the 1's by the median standard deviation. This way, information about the most common categorical labels can still be used in _get_samples.

Checklist:

  • Write tests

Example:

import numpy as np
from imblearn.over_sampling import SMOTENC
from sklearn.datasets import make_classification

np.random.seed(2)

# Original data
X = np.array([[1, 2, 4, 2], #minority class
              [1, 2, 5, 2], #minority class
              [1, 2, 1, 0], 
              [2, 1, 2, 0], 
              [1, 2, 3, 1]])
y = np.array(['A', 'A', 'B', 'B', 'B'])

# Construct SMOTENC with masks
smotenc = SMOTENC(
    [False, False, False, True], 
    sampling_strategy = "not majority",
    k_neighbors=1
)

# Resample
X_resampled, y_resampled = smotenc.fit_resample(X, y)
print(X_resampled)
print(y_resampled)

Output on master:

[[1.         2.         4.         2.        ]
 [1.         2.         5.         2.        ]
 [1.         2.         1.         0.        ]
 [2.         1.         2.         0.        ]
 [1.         2.         3.         1.        ]
 [1.         2.         4.18508208 1.        ]]
['A' 'A' 'B' 'B' 'B' 'A']

Only the last row is new. It has the category 1 in the fourth column, even though all rows from the minority class have the category 2 in the fourth column. This is incorrect.

Output on this fork:

[[1.         2.         4.         2.        ]
 [1.         2.         5.         2.        ]
 [1.         2.         1.         0.        ]
 [2.         1.         2.         0.        ]
 [1.         2.         3.         1.        ]
 [1.         2.         4.18508208 2.        ]]
['A' 'A' 'B' 'B' 'B' 'A']

Here, the resampled row correctly has the category 2 in the fourth column.

@pep8speaks
Copy link
Copy Markdown

pep8speaks commented Jan 14, 2020

Hello @bganglia! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-09 09:01:06 UTC

@glemaitre
Copy link
Copy Markdown
Member

Could you retrigger the CIs since we solve the issue with the dependencies.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 31, 2020

Codecov Report

Merging #675 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #675   +/-   ##
=======================================
  Coverage   96.48%   96.49%           
=======================================
  Files          82       82           
  Lines        5035     5043    +8     
=======================================
+ Hits         4858     4866    +8     
  Misses        177      177           
Impacted Files Coverage Δ
imblearn/over_sampling/_smote.py 97.25% <100.00%> (+0.01%) ⬆️
imblearn/over_sampling/tests/test_smote_nc.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8a8adb...1c4e380. Read the comment docs.

@glemaitre glemaitre self-assigned this Jun 9, 2020
@glemaitre
Copy link
Copy Markdown
Member

I added your non-regression test and I think that we are good to merge

@glemaitre glemaitre changed the title [WIP] Prevent incorrect class category resampling in SMOTENC when median_std_ == 0 FIX Prevent incorrect class category resampling in SMOTENC when median_std_ == 0 Jun 9, 2020
@glemaitre glemaitre removed their assignment Jun 9, 2020
@glemaitre glemaitre merged commit 3c6d232 into scikit-learn-contrib:master Jun 9, 2020
@glemaitre
Copy link
Copy Markdown
Member

@bganglia Thanks for the contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SMOTE-NC: wrong class categories are resampled when self.median_std_ == 0.0

3 participants