Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Prevent incorrect class category resampling in SMOTENC when median_std_ == 0 #675

Merged

Conversation

bganglia
Copy link
Contributor

@bganglia bganglia commented Jan 14, 2020

Fixes #662

What does this implement/fix? Explain your changes.

If the median standard deviation is 0, the SMOTENC class will now store the categorical features before multiplying the 1's by the median standard deviation. This way, information about the most common categorical labels can still be used in _get_samples.

Checklist:

  • Write tests

Example:

import numpy as np
from imblearn.over_sampling import SMOTENC
from sklearn.datasets import make_classification

np.random.seed(2)

# Original data
X = np.array([[1, 2, 4, 2], #minority class
              [1, 2, 5, 2], #minority class
              [1, 2, 1, 0], 
              [2, 1, 2, 0], 
              [1, 2, 3, 1]])
y = np.array(['A', 'A', 'B', 'B', 'B'])

# Construct SMOTENC with masks
smotenc = SMOTENC(
    [False, False, False, True], 
    sampling_strategy = "not majority",
    k_neighbors=1
)

# Resample
X_resampled, y_resampled = smotenc.fit_resample(X, y)
print(X_resampled)
print(y_resampled)

Output on master:

[[1.         2.         4.         2.        ]
 [1.         2.         5.         2.        ]
 [1.         2.         1.         0.        ]
 [2.         1.         2.         0.        ]
 [1.         2.         3.         1.        ]
 [1.         2.         4.18508208 1.        ]]
['A' 'A' 'B' 'B' 'B' 'A']

Only the last row is new. It has the category 1 in the fourth column, even though all rows from the minority class have the category 2 in the fourth column. This is incorrect.

Output on this fork:

[[1.         2.         4.         2.        ]
 [1.         2.         5.         2.        ]
 [1.         2.         1.         0.        ]
 [2.         1.         2.         0.        ]
 [1.         2.         3.         1.        ]
 [1.         2.         4.18508208 2.        ]]
['A' 'A' 'B' 'B' 'B' 'A']

Here, the resampled row correctly has the category 2 in the fourth column.

@pep8speaks
Copy link

pep8speaks commented Jan 14, 2020

Hello @bganglia! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-09 09:01:06 UTC

@glemaitre
Copy link
Member

Could you retrigger the CIs since we solve the issue with the dependencies.

@codecov
Copy link

codecov bot commented Jan 31, 2020

Codecov Report

Merging #675 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #675   +/-   ##
=======================================
  Coverage   96.48%   96.49%           
=======================================
  Files          82       82           
  Lines        5035     5043    +8     
=======================================
+ Hits         4858     4866    +8     
  Misses        177      177           
Impacted Files Coverage Δ
imblearn/over_sampling/_smote.py 97.25% <100.00%> (+0.01%) ⬆️
imblearn/over_sampling/tests/test_smote_nc.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8a8adb...1c4e380. Read the comment docs.

@glemaitre glemaitre self-assigned this Jun 9, 2020
@glemaitre
Copy link
Member

I added your non-regression test and I think that we are good to merge

@glemaitre glemaitre changed the title [WIP] Prevent incorrect class category resampling in SMOTENC when median_std_ == 0 FIX Prevent incorrect class category resampling in SMOTENC when median_std_ == 0 Jun 9, 2020
@glemaitre glemaitre removed their assignment Jun 9, 2020
@glemaitre glemaitre merged commit 3c6d232 into scikit-learn-contrib:master Jun 9, 2020
@glemaitre
Copy link
Member

@bganglia Thanks for the contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SMOTE-NC: wrong class categories are resampled when self.median_std_ == 0.0
3 participants