Parameter n_classes in make_classification does not work as expected in extreme cases. #16789

oXwvdrbbj8S4wo9k8lSN · 2020-03-28T19:08:28Z

Describe the bug

According to the documentation, n_classes corresponds to the number of classes (or labels) of the classification problem. Therefore, I expected exactly n classes in the generated y. However, in at least one case, this did not work. I am well aware that the combination of parameters is absolutely stupid. This behavior was only discovered in a unit test of a wrapper function that was fed with random inputs.

Steps/Code to Reproduce

from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=40, 
                         n_features=301,
                         n_informative=147,
                         n_redundant=22,
                         n_repeated=43,
                         n_classes=34, 
                         shuffle=False, 
                         random_state=527)

print(len(np.unique(y)))

Expected Results

The passed number of classes (n_classes):
34

Actual Results

33

Versions

System:
python: 3.7.6 | packaged by conda-forge | (default, Mar 5 2020, 15:27:18) [GCC 7.3.0]
executable: /opt/conda/bin/python
machine: Linux-4.19.76-linuxkit-x86_64-with-debian-buster-sid

Python dependencies:
pip: 20.0.2
setuptools: 46.0.0.post20200311
sklearn: 0.22.2.post1
numpy: 1.18.1
scipy: 1.4.1
Cython: 0.29.15
pandas: 1.0.3
matplotlib: 3.1.3
joblib: 0.14.1

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

rth · 2020-03-30T09:41:43Z

Thanks! Well the code is

scikit-learn/sklearn/datasets/_samples_generator.py

Line 240 in 95d4f08

y[flip_mask] = generator.randint(n_classes, size=flip_mask.sum())

and randint(34, size=40) doesn't indeed guarantee that there will be 34 classes. I think the documentation should be updated to reflect that. There is little else we can do (while remaining backward compatible).

jnothman · 2020-03-31T20:16:06Z

Fix in version 1.0?

thomasjpfan · 2020-04-05T20:32:31Z

Simple cases like the following would also produce something unexpected:

X, y = make_classification(n_samples=40, n_informative=8, n_classes=20,
                           random_state=0, flip_y=0.5)
len(np.unique(y))
# 16

How do we should we move forward? Open a PR tagged 1.0 and merge when the time comes?

tianchuliang · 2020-04-26T17:52:06Z

@oXwvdrbbj8S4wo9k8lSN @thomasjpfan @rth This was merged in master Oct 2019 084a351

So it seems this is the expected behavior, given the flip_y parameter is default to 0.01, i.e the probability of some classes randomly assigned. The higher this number is, the more likely you see unbalanced/unexpected label distribution.

oXwvdrbbj8S4wo9k8lSN added the Bug: triage label Mar 28, 2020

oXwvdrbbj8S4wo9k8lSN changed the title ~~Parameter *n_classes* in *make_classification* does not work as expected in extreme cases.~~ Parameter n_classes in make_classification does not work as expected in extreme cases. Mar 28, 2020

rth added Documentation and removed Bug: triage labels Mar 30, 2020

tianchuliang mentioned this issue Apr 26, 2020

DOC add detail about flip_y parameter in make_classification #17049

Merged

thomasjpfan linked a pull request Apr 26, 2020 that will close this issue

ENH Adds permute_y to make_classification interaction #17052

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter n_classes in make_classification does not work as expected in extreme cases. #16789

Parameter n_classes in make_classification does not work as expected in extreme cases. #16789

oXwvdrbbj8S4wo9k8lSN commented Mar 28, 2020 •

edited

rth commented Mar 30, 2020

jnothman commented Mar 31, 2020 via email

thomasjpfan commented Apr 5, 2020

tianchuliang commented Apr 26, 2020

Parameter n_classes in make_classification does not work as expected in extreme cases. #16789

Parameter n_classes in make_classification does not work as expected in extreme cases. #16789

Comments

oXwvdrbbj8S4wo9k8lSN commented Mar 28, 2020 • edited

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

rth commented Mar 30, 2020

jnothman commented Mar 31, 2020 via email

thomasjpfan commented Apr 5, 2020

tianchuliang commented Apr 26, 2020

oXwvdrbbj8S4wo9k8lSN commented Mar 28, 2020 •

edited