Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should train_test_split warn or error out on single sample array? #11028

Closed
vivekk0903 opened this issue Apr 25, 2018 · 11 comments · Fixed by #12861
Closed

Should train_test_split warn or error out on single sample array? #11028

vivekk0903 opened this issue Apr 25, 2018 · 11 comments · Fixed by #12861

Comments

@vivekk0903
Copy link
Contributor

Description

train_test_split splits the single sample data such that train part has 0 samples and test has that sample. Also this behaviour is not affected by setting the test_size to any value.

Steps/Code to Reproduce

import numpy as np
from sklearn.model_selection import train_test_split

data = np.random.normal(0, 1, [1, 100])
print(A.shape)
#Output:  (1, 100)

data_train, data_test = train_test_split(data)
print(data_train.shape, data_test.shape)
#Output:  ((0, 100), (1, 100))

Expected Results

I am not sure of expected results as this seems like an unintended usage. But still think that at-least a warning (if not error) should be given when splitting.

Versions

Linux-3.16.0-77-generic-x86_64-with-Ubuntu-14.04-trusty
('Python', '2.7.6 (default, Nov 23 2017, 15:49:48) \n[GCC 4.8.4]')
('NumPy', '1.14.2')
('SciPy', '1.0.1')
('Scikit-Learn', '0.19.1')

I am sorry if its a duplicate. I tried searching for similar issues but could not find (even though I thought that this would have been discussed somewhere).

@rth
Copy link
Member

rth commented Apr 25, 2018

Thanks for raising this issue!

I think it would make sense to be consistent with numpy.split here,

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> x = np.random.RandomState().rand(0, 100)
>>> x
array([], shape=(0, 100), dtype=float64)
>>> np.split(x, 2)
[array([], shape=(0, 100), dtype=float64), array([], shape=(0, 100), dtype=float64)]
>>> train_test_split(x)
[array([], shape=(0, 100), dtype=float64), array([], shape=(0, 100), dtype=float64)]

The current behavior is consistent with it.

I agree that it's somewhat unintuitive, but so is any operation of empty array (or in this case an input that would produce an empty array), e.g.

>>> x.dot(x.T)
array([], shape=(0, 0), dtype=float64)

scikit-learn just follows numpy conventions on this, so I'm going to close this. Please comment if you disagree.

@rth rth closed this as completed Apr 25, 2018
@rth
Copy link
Member

rth commented Apr 25, 2018

splits the single sample data such that train part has 0 samples and test has that sample. Also this behaviour is not affected by setting the test_size to any value.

Yes, the repartion between train and test is not satisfactory in this case, but rasing an error/warning for 1 input sample, and then not rasing it for 0 samples to be consistent with numpy is also not very satisfactory. I can't think of a reason why why one would expect a meaningful train / test split with 1 sample and don't think we should spend time fixing this.

@jnothman
Copy link
Member

@vivekk-ezdi, could you explain how you landed up performing train_test_split on a single sample?

@vivekk0903
Copy link
Contributor Author

@jnothman It was a beginner question at stackoverflow, in which the user I think by mistake reshaped the data to have one sample.

I tested and was surprised that train_test_split doesnt complain about it. Since most of other scikit utilities warn or throw error in this case, I just wanted to know the rationale behind it.

@jnothman
Copy link
Member

jnothman commented Apr 25, 2018 via email

@NicolasHug
Copy link
Member

NicolasHug commented Dec 6, 2018

could you explain how you landed up performing train_test_split on a single sample?

I encountered the same thing in an estimator that does early-stopping. I'm using train_test_split to hold out some validation data.

When I run check_estimator on my estimator, it fails on check_fit2d_1sample: this tests passes only one sample, and the training set becomes empty.

I was surprised as well that train_test_split doesn't complain. Basically that means that any estimator that internally uses train_test_split without checking the output cannot pass the check_estimator suite.

@jnothman
Copy link
Member

jnothman commented Dec 9, 2018 via email

@NicolasHug
Copy link
Member

My feeling is that train_test_split should raise an error when one of the output is empty. It makes sense in numpy not to raise an error because np.split is very general, but in the context of scikit learn we're dealing with train/test data and I can't think of a scenario where it's OK to have one of these that's empty (except if you don't want to test, but then you just don't call train_test_split).

Both check_array and check_X_y default ensure_min_samples to 1, so that's somehow an indication that in general no train/test set should be empty. The only occurrence of ensure_min_samples=0 in scikit-learn code base is in DummyRegressor when method='constant': the prediction is a constant specified by the user.

@jnothman
Copy link
Member

jnothman commented Dec 10, 2018 via email

@NicolasHug
Copy link
Member

Ok I'll submit a PR.

Should we do the checks upstream in the CV-iterators? train_test_split uses ShuffleSplit, and some other iterators like LeaveOneOut may also return empty train sets.

@jnothman
Copy link
Member

jnothman commented Dec 11, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants