Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NearMiss version 3 does not work well with sampling_strategy=dictionary #836

Closed
miguelBra opened this issue May 6, 2021 · 2 comments

Comments

@miguelBra
Copy link

Describe the bug

Undersampling with NearMiss version 3 does not work well with sampling_strategy=dictionary.

A potential explanation could be that the first step of the algorithm already performs an intense undersampling, leaving a number of observations to be undersampled in the second step that is already lower than the number specified in the dictionary. As a consequence, the algortithm only seems to work if the number of desired samples is very low in comparison to the existing samples. The code examples below show how, for a class of 357 samples, NearMiss3 does not work if the desired number of samples is 300 but it does work if the desired number of samples is 50.

I don't think this is a desirable feature in the algorithm, especially considering that the 3 versions of NearMiss are presented in the documentation as methods that allow to specify the number of samples to have in each class. Anyway, I think that at least it could be good to explain this in the documentation for saving time to people who find this problem (I have lost several hours trying to figure out what was happening).

Steps/Code to Reproduce

Example 1. Undersampling to 300 observations (this doesn't work):

from sklearn.datasets import load_breast_cancer
import pandas as pd
from imblearn.under_sampling import NearMiss

data = load_breast_cancer()
X = pd.DataFrame(data=data.data, columns=data.feature_names)

# class 1 has clearly more than 300 observations
np.unique(data.target, return_counts = True)

X_smt, y_smt = NearMiss(version=3, sampling_strategy={1: 300}).fit_resample(X, data.target)

Example 2. Undersampling to 50 observations (this works well):

from sklearn.datasets import load_breast_cancer
import pandas as pd
from imblearn.under_sampling import NearMiss

data = load_breast_cancer()
X = pd.DataFrame(data=data.data, columns=data.feature_names)

X_smt, y_smt = NearMiss(version=3, sampling_strategy={1: 50}).fit_resample(X, data.target)
np.unique(y_smt, return_counts = True) # it worked

Expected Results

In the first example, the resulting dataset (X_smt, y_smt) should have 300 samples for class 1. In the second example, class 1 should have 50 samples.

Actual Results

The code in Example 1 raises:
"UserWarning: The number of the samples to be selected is larger than the number of samples available. The balancing ratio cannot be ensure and all samples will be returned."

The code in Example 2 works well.

Versions

Linux-5.10.15-200.fc33.x86_64-x86_64-with-glibc2.2.5
Python 3.8.6 (default, Nov 10 2011, 15:00:00)
[GCC 10.2.0]
NumPy 1.19.5
SciPy 1.6.1
Scikit-Learn 0.24.1
Imbalanced-Learn 0.8.0

@glemaitre
Copy link
Member

This is expected but we should document it. Anyway, we are going to deprecate because this method is actually not NearMiss 3

@glemaitre
Copy link
Member

See #980

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants