Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter n_classes in make_classification does not work as expected in extreme cases. #16789

Open
oXwvdrbbj8S4wo9k8lSN opened this issue Mar 28, 2020 · 4 comments · May be fixed by #17052
Open

Comments

@oXwvdrbbj8S4wo9k8lSN
Copy link

oXwvdrbbj8S4wo9k8lSN commented Mar 28, 2020

Describe the bug

According to the documentation, n_classes corresponds to the number of classes (or labels) of the classification problem. Therefore, I expected exactly n classes in the generated y. However, in at least one case, this did not work. I am well aware that the combination of parameters is absolutely stupid. This behavior was only discovered in a unit test of a wrapper function that was fed with random inputs.

Steps/Code to Reproduce

from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=40, 
                         n_features=301,
                         n_informative=147,
                         n_redundant=22,
                         n_repeated=43,
                         n_classes=34, 
                         shuffle=False, 
                         random_state=527)

print(len(np.unique(y)))

Expected Results

The passed number of classes (n_classes):
34

Actual Results

33

Versions

System:
python: 3.7.6 | packaged by conda-forge | (default, Mar 5 2020, 15:27:18) [GCC 7.3.0]
executable: /opt/conda/bin/python
machine: Linux-4.19.76-linuxkit-x86_64-with-debian-buster-sid

Python dependencies:
pip: 20.0.2
setuptools: 46.0.0.post20200311
sklearn: 0.22.2.post1
numpy: 1.18.1
scipy: 1.4.1
Cython: 0.29.15
pandas: 1.0.3
matplotlib: 3.1.3
joblib: 0.14.1

Built with OpenMP: True

@oXwvdrbbj8S4wo9k8lSN oXwvdrbbj8S4wo9k8lSN changed the title Parameter *n_classes* in *make_classification* does not work as expected in extreme cases. Parameter n_classes in make_classification does not work as expected in extreme cases. Mar 28, 2020
@rth
Copy link
Member

rth commented Mar 30, 2020

Thanks! Well the code is

y[flip_mask] = generator.randint(n_classes, size=flip_mask.sum())

and randint(34, size=40) doesn't indeed guarantee that there will be 34 classes. I think the documentation should be updated to reflect that. There is little else we can do (while remaining backward compatible).

@jnothman
Copy link
Member

jnothman commented Mar 31, 2020 via email

@thomasjpfan
Copy link
Member

Simple cases like the following would also produce something unexpected:

X, y = make_classification(n_samples=40, n_informative=8, n_classes=20,
                           random_state=0, flip_y=0.5)
len(np.unique(y))
# 16

How do we should we move forward? Open a PR tagged 1.0 and merge when the time comes?

@tianchuliang
Copy link
Contributor

@oXwvdrbbj8S4wo9k8lSN @thomasjpfan @rth This was merged in master Oct 2019 084a351

So it seems this is the expected behavior, given the flip_y parameter is default to 0.01, i.e the probability of some classes randomly assigned. The higher this number is, the more likely you see unbalanced/unexpected label distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants