Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_classification fails for 31 informative features #8159

Closed
mikebenfield opened this issue Jan 5, 2017 · 3 comments
Closed

make_classification fails for 31 informative features #8159

mikebenfield opened this issue Jan 5, 2017 · 3 comments
Labels

Comments

@mikebenfield
Copy link
Contributor

@mikebenfield mikebenfield commented Jan 5, 2017

Description

I get an exception from make_classification when I try to use 31 or more features, all of which are informative. It's possible I'm using the function wrong, but if so, please clarify in the documentation or provide a more informative error message.

Steps/Code to Reproduce

~ $ cat example.py 
from sklearn.datasets import make_classification
X, y = make_classification(
    n_samples=4000,
    n_features=31,
    n_informative=31,
    n_repeated=0,
    n_redundant=0,
)
~ $ python example.py 
Traceback (most recent call last):
  File "example.py", line 7, in <module>
    n_redundant=0,
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/datasets/samples_generator.py", line 186, in make_classification
    generator).astype(float)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/datasets/samples_generator.py", line 29, in _generate_hypercube
    return np.hstack([_generate_hypercube(samples, dimensions - 30, rng),
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/datasets/samples_generator.py", line 32, in _generate_hypercube
    random_state=rng),
  File "sklearn/utils/_random.pyx", line 226, in sklearn.utils._random.sample_without_replacement (sklearn/utils/_random.c:4007)
  File "sklearn/utils/_random.pyx", line 279, in sklearn.utils._random.sample_without_replacement (sklearn/utils/_random.c:3464)
  File "sklearn/utils/_random.pyx", line 35, in sklearn.utils._random._sample_without_replacement_check_input (sklearn/utils/_random.c:1719)
ValueError: n_population should be greater or equal than n_samples, got n_samples > n_population (4 > 2)

Versions

Darwin-15.4.0-x86_64-i386-64bit
Python 3.5.2 (default, Oct 11 2016, 15:01:29)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18.1

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jan 5, 2017

_generate_hypercube is called such that the polytope do not exceed
30 dimensions.

Therefore, in the case of n_feature=31 and n_informative=31,
2 ** n_informative < n_classes * n_clusters_per_class is actually not
True since that we will have n_informative=1 for the first polytope
and n_informative=30 for the second one.

There is probably a need for a modulo operation and a good unit
test?

n_samples in _generate_hypercurbe corresponds to n_clusters_per_classes.
However, the generation

@devanshdalal
Copy link
Contributor

@devanshdalal devanshdalal commented Jan 7, 2017

@glemaitre, I am working on it.

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 8, 2017

I think _generate_hypercube(samples, dimensions - 30, rng) can just be replaced with rng.randint(2, size=(samples, dimensions - 30)). The samples are ensured to be distinct in those first 30 dimensions anyway (under the assumption that 2 ** 30 >> samples).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

5 participants
You can’t perform that action at this time.