Should train_test_split warn or error out on single sample array? #11028

vivekk0903 · 2018-04-25T04:52:00Z

Description

train_test_split splits the single sample data such that train part has 0 samples and test has that sample. Also this behaviour is not affected by setting the test_size to any value.

Steps/Code to Reproduce

import numpy as np
from sklearn.model_selection import train_test_split

data = np.random.normal(0, 1, [1, 100])
print(A.shape)
#Output:  (1, 100)

data_train, data_test = train_test_split(data)
print(data_train.shape, data_test.shape)
#Output:  ((0, 100), (1, 100))

Expected Results

I am not sure of expected results as this seems like an unintended usage. But still think that at-least a warning (if not error) should be given when splitting.

Versions

Linux-3.16.0-77-generic-x86_64-with-Ubuntu-14.04-trusty
('Python', '2.7.6 (default, Nov 23 2017, 15:49:48) \n[GCC 4.8.4]')
('NumPy', '1.14.2')
('SciPy', '1.0.1')
('Scikit-Learn', '0.19.1')

I am sorry if its a duplicate. I tried searching for similar issues but could not find (even though I thought that this would have been discussed somewhere).

The text was updated successfully, but these errors were encountered:

rth · 2018-04-25T11:45:36Z

Thanks for raising this issue!

I think it would make sense to be consistent with numpy.split here,

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> x = np.random.RandomState().rand(0, 100)
>>> x
array([], shape=(0, 100), dtype=float64)
>>> np.split(x, 2)
[array([], shape=(0, 100), dtype=float64), array([], shape=(0, 100), dtype=float64)]
>>> train_test_split(x)
[array([], shape=(0, 100), dtype=float64), array([], shape=(0, 100), dtype=float64)]

The current behavior is consistent with it.

I agree that it's somewhat unintuitive, but so is any operation of empty array (or in this case an input that would produce an empty array), e.g.

>>> x.dot(x.T)
array([], shape=(0, 0), dtype=float64)

scikit-learn just follows numpy conventions on this, so I'm going to close this. Please comment if you disagree.

rth · 2018-04-25T11:53:22Z

splits the single sample data such that train part has 0 samples and test has that sample. Also this behaviour is not affected by setting the test_size to any value.

Yes, the repartion between train and test is not satisfactory in this case, but rasing an error/warning for 1 input sample, and then not rasing it for 0 samples to be consistent with numpy is also not very satisfactory. I can't think of a reason why why one would expect a meaningful train / test split with 1 sample and don't think we should spend time fixing this.

jnothman · 2018-04-25T12:37:05Z

@vivekk-ezdi, could you explain how you landed up performing train_test_split on a single sample?

vivekk0903 · 2018-04-25T12:56:06Z

@jnothman It was a beginner question at stackoverflow, in which the user I think by mistake reshaped the data to have one sample.

I tested and was surprised that train_test_split doesnt complain about it. Since most of other scikit utilities warn or throw error in this case, I just wanted to know the rationale behind it.

jnothman · 2018-04-25T14:21:21Z

In that case, I basically agree with @rth. Other utilities throw errors in this case because sometimes random sampling has created a smaller dataset than you realised, and similar cases. Even for people who aren't beginners. We can't really fix beginner errors like that in a sustainable way.

NicolasHug · 2018-12-06T22:45:36Z

could you explain how you landed up performing train_test_split on a single sample?

I encountered the same thing in an estimator that does early-stopping. I'm using train_test_split to hold out some validation data.

When I run check_estimator on my estimator, it fails on check_fit2d_1sample: this tests passes only one sample, and the training set becomes empty.

I was surprised as well that train_test_split doesn't complain. Basically that means that any estimator that internally uses train_test_split without checking the output cannot pass the check_estimator suite.

jnothman · 2018-12-09T13:36:37Z

So are you proposing a warning from train_test_split, Nicolas?

NicolasHug · 2018-12-09T14:33:12Z

My feeling is that train_test_split should raise an error when one of the output is empty. It makes sense in numpy not to raise an error because np.split is very general, but in the context of scikit learn we're dealing with train/test data and I can't think of a scenario where it's OK to have one of these that's empty (except if you don't want to test, but then you just don't call train_test_split).

Both check_array and check_X_y default ensure_min_samples to 1, so that's somehow an indication that in general no train/test set should be empty. The only occurrence of ensure_min_samples=0 in scikit-learn code base is in DummyRegressor when method='constant': the prediction is a constant specified by the user.

jnothman · 2018-12-10T10:02:14Z

I'm happy with an error if train or test is empty.

NicolasHug · 2018-12-10T13:00:55Z

Ok I'll submit a PR.

Should we do the checks upstream in the CV-iterators? train_test_split uses ShuffleSplit, and some other iterators like LeaveOneOut may also return empty train sets.

jnothman · 2018-12-11T01:04:21Z

I'm okay to do this in the cv splitters as long as we don't force this constraint on custom splitters

rth closed this as completed Apr 25, 2018

albertcthomas mentioned this issue Dec 7, 2018

check_fit2d_1sample and check_fit2d_1feature expect very specific error message #12734

Closed

NicolasHug mentioned this issue Dec 24, 2018

[MRG] Raise ValueError when trainset is empty in CVSplitters #12861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should train_test_split warn or error out on single sample array? #11028

Should train_test_split warn or error out on single sample array? #11028

vivekk0903 commented Apr 25, 2018

rth commented Apr 25, 2018

rth commented Apr 25, 2018

jnothman commented Apr 25, 2018

vivekk0903 commented Apr 25, 2018

jnothman commented Apr 25, 2018 via email

NicolasHug commented Dec 6, 2018 •

edited

jnothman commented Dec 9, 2018 via email

NicolasHug commented Dec 9, 2018

jnothman commented Dec 10, 2018 via email

NicolasHug commented Dec 10, 2018

jnothman commented Dec 11, 2018 via email

Should train_test_split warn or error out on single sample array? #11028

Should train_test_split warn or error out on single sample array? #11028

Comments

vivekk0903 commented Apr 25, 2018

Description

Steps/Code to Reproduce

Expected Results

Versions

rth commented Apr 25, 2018

rth commented Apr 25, 2018

jnothman commented Apr 25, 2018

vivekk0903 commented Apr 25, 2018

jnothman commented Apr 25, 2018 via email

NicolasHug commented Dec 6, 2018 • edited

jnothman commented Dec 9, 2018 via email

NicolasHug commented Dec 9, 2018

jnothman commented Dec 10, 2018 via email

NicolasHug commented Dec 10, 2018

jnothman commented Dec 11, 2018 via email

NicolasHug commented Dec 6, 2018 •

edited