New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should train_test_split warn or error out on single sample array? #11028
Comments
Thanks for raising this issue! I think it would make sense to be consistent with >>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> x = np.random.RandomState().rand(0, 100)
>>> x
array([], shape=(0, 100), dtype=float64)
>>> np.split(x, 2)
[array([], shape=(0, 100), dtype=float64), array([], shape=(0, 100), dtype=float64)]
>>> train_test_split(x)
[array([], shape=(0, 100), dtype=float64), array([], shape=(0, 100), dtype=float64)] The current behavior is consistent with it. I agree that it's somewhat unintuitive, but so is any operation of empty array (or in this case an input that would produce an empty array), e.g. >>> x.dot(x.T)
array([], shape=(0, 0), dtype=float64) scikit-learn just follows numpy conventions on this, so I'm going to close this. Please comment if you disagree. |
Yes, the repartion between train and test is not satisfactory in this case, but rasing an error/warning for 1 input sample, and then not rasing it for 0 samples to be consistent with numpy is also not very satisfactory. I can't think of a reason why why one would expect a meaningful train / test split with 1 sample and don't think we should spend time fixing this. |
@vivekk-ezdi, could you explain how you landed up performing |
@jnothman It was a beginner question at stackoverflow, in which the user I think by mistake reshaped the data to have one sample. I tested and was surprised that |
In that case, I basically agree with @rth. Other utilities throw errors in
this case because sometimes random sampling has created a smaller dataset
than you realised, and similar cases. Even for people who aren't beginners.
We can't really fix beginner errors like that in a sustainable way.
|
I encountered the same thing in an estimator that does early-stopping. I'm using When I run I was surprised as well that |
So are you proposing a warning from train_test_split, Nicolas?
|
My feeling is that Both |
I'm happy with an error if train or test is empty.
|
Ok I'll submit a PR. Should we do the checks upstream in the CV-iterators? |
I'm okay to do this in the cv splitters as long as we don't force this
constraint on custom splitters
|
Description
train_test_split
splits the single sample data such that train part has 0 samples and test has that sample. Also this behaviour is not affected by setting thetest_size
to any value.Steps/Code to Reproduce
Expected Results
I am not sure of expected results as this seems like an unintended usage. But still think that at-least a warning (if not error) should be given when splitting.
Versions
Linux-3.16.0-77-generic-x86_64-with-Ubuntu-14.04-trusty
('Python', '2.7.6 (default, Nov 23 2017, 15:49:48) \n[GCC 4.8.4]')
('NumPy', '1.14.2')
('SciPy', '1.0.1')
('Scikit-Learn', '0.19.1')
I am sorry if its a duplicate. I tried searching for similar issues but could not find (even though I thought that this would have been discussed somewhere).
The text was updated successfully, but these errors were encountered: