make_classification fails for 31 informative features #8159

Closed
mikebenfield opened this Issue Jan 5, 2017 · 3 comments

Comments

Projects
None yet
5 participants
@mikebenfield
Contributor

mikebenfield commented Jan 5, 2017

Description

I get an exception from make_classification when I try to use 31 or more features, all of which are informative. It's possible I'm using the function wrong, but if so, please clarify in the documentation or provide a more informative error message.

Steps/Code to Reproduce

~ $ cat example.py 
from sklearn.datasets import make_classification
X, y = make_classification(
    n_samples=4000,
    n_features=31,
    n_informative=31,
    n_repeated=0,
    n_redundant=0,
)
~ $ python example.py 
Traceback (most recent call last):
  File "example.py", line 7, in <module>
    n_redundant=0,
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/datasets/samples_generator.py", line 186, in make_classification
    generator).astype(float)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/datasets/samples_generator.py", line 29, in _generate_hypercube
    return np.hstack([_generate_hypercube(samples, dimensions - 30, rng),
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/datasets/samples_generator.py", line 32, in _generate_hypercube
    random_state=rng),
  File "sklearn/utils/_random.pyx", line 226, in sklearn.utils._random.sample_without_replacement (sklearn/utils/_random.c:4007)
  File "sklearn/utils/_random.pyx", line 279, in sklearn.utils._random.sample_without_replacement (sklearn/utils/_random.c:3464)
  File "sklearn/utils/_random.pyx", line 35, in sklearn.utils._random._sample_without_replacement_check_input (sklearn/utils/_random.c:1719)
ValueError: n_population should be greater or equal than n_samples, got n_samples > n_population (4 > 2)

Versions

Darwin-15.4.0-x86_64-i386-64bit
Python 3.5.2 (default, Oct 11 2016, 15:01:29)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18.1

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Jan 5, 2017

Contributor

_generate_hypercube is called such that the polytope do not exceed
30 dimensions.

Therefore, in the case of n_feature=31 and n_informative=31,
2 ** n_informative < n_classes * n_clusters_per_class is actually not
True since that we will have n_informative=1 for the first polytope
and n_informative=30 for the second one.

There is probably a need for a modulo operation and a good unit
test?

n_samples in _generate_hypercurbe corresponds to n_clusters_per_classes.
However, the generation

Contributor

glemaitre commented Jan 5, 2017

_generate_hypercube is called such that the polytope do not exceed
30 dimensions.

Therefore, in the case of n_feature=31 and n_informative=31,
2 ** n_informative < n_classes * n_clusters_per_class is actually not
True since that we will have n_informative=1 for the first polytope
and n_informative=30 for the second one.

There is probably a need for a modulo operation and a good unit
test?

n_samples in _generate_hypercurbe corresponds to n_clusters_per_classes.
However, the generation

@devanshdalal

This comment has been minimized.

Show comment
Hide comment
@devanshdalal

devanshdalal Jan 7, 2017

Contributor

@glemaitre, I am working on it.

Contributor

devanshdalal commented Jan 7, 2017

@glemaitre, I am working on it.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 8, 2017

Member

I think _generate_hypercube(samples, dimensions - 30, rng) can just be replaced with rng.randint(2, size=(samples, dimensions - 30)). The samples are ensured to be distinct in those first 30 dimensions anyway (under the assumption that 2 ** 30 >> samples).

Member

jnothman commented Jan 8, 2017

I think _generate_hypercube(samples, dimensions - 30, rng) can just be replaced with rng.randint(2, size=(samples, dimensions - 30)). The samples are ensured to be distinct in those first 30 dimensions anyway (under the assumption that 2 ** 30 >> samples).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment