Documentation of iterative_train_test_split incomplete #160

edufonseca · 2019-03-13T19:18:13Z

iterative_train_test_split is briefly documented here (at the bottom), but the input params X, y are not explained. I tried passing yas a list of lists, encoding the labels as categorical integers, eg

[[2], [0,3], [1], [0,2,3]]

but it crashed.

By debugging the example provided here, X, y turn out to be scipy.sparse.lil_matrix. Is this the only format allowed?

Any indication on the possible formats for X, y in iterative_train_test_split? Thanks

The text was updated successfully, but these errors were encountered:

AlexMRuch · 2020-05-23T22:20:16Z

I'm having this issue as well. I've tried converting my inputs to a list of lists, a np.array of lists, a np.array or np.arrays, etc.

I can only get the example to work with the test example, which will work for non-sparse matrices:

from skmultilearn.model_selection.iterative_stratification import iterative_train_test_split
from skmultilearn.dataset import load_dataset

X,y, _, _ = load_dataset('scene', 'undivided')

X_train, y_train, X_test, y_test = iterative_train_test_split(
    X.A,
    y.A,
    test_size = 0.2
)

^^^ This works fine for me

In this case, we have

print(type(X.A))
print(X.A.shape)
X.A

Return

<class 'numpy.ndarray'>
(2407, 294)
array([[0.646467, 0.666435, 0.685047, ..., 0.247298, 0.014025, 0.029709],
       [0.770156, 0.767255, 0.761053, ..., 0.137833, 0.082672, 0.03632 ],
       [0.793984, 0.772096, 0.76182 , ..., 0.051125, 0.112506, 0.083924],
       ...,
       [0.952281, 0.944987, 0.905556, ..., 0.0319  , 0.017547, 0.019734],
       [0.88399 , 0.899004, 0.901019, ..., 0.256158, 0.226332, 0.22307 ],
       [0.974915, 0.866425, 0.818144, ..., 0.005131, 0.025059, 0.004033]])

And

print(type(y.A))
print(y.A.shape)
y.A

return

<class 'numpy.ndarray'>
(2407, 6)
array([[1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1]])

However, with my own data,

print(type(df_train["text"].values))
print(df_train["text"].values.shape)
df_train["text"].values

Which returns

<class 'numpy.ndarray'>
(23455,)
array(['Wholeheartedly support these protests &amp; acts of civil disobedience &amp; will join when I can! #Ferguson #AllLivesMatter http://t.co/D8Phc8UakE',
       'This Sandra Bland situation man no disrespect rest her soul , but people die everyday in a unjustified matter #AllLivesMatter',
       'Commitment to peace, healing and loving neighbors. Give us strength and patience. #PortlandPride #AllLivesMatter #Peace',
       ...,
       'After losing the election to 2 unisex names, maybe it is time for the GOP to support Marriage Equality and Civil Unions. #Sandy #Christie',
       '@FoxNews:Price gouging, looting and rage: #Sandy crimes stories grow http://t.co/zL3iI, Good Luck with their Gun Control Laws and 0 cops!',
       "Might devastated #Sandy victims lose the oppurtunity to vote, thus having their rights violated? Looting their vote. It shouldn't happen."],
      dtype=object)

And

print(type(df_train["labels"].values))
print(df_train["labels"].values.shape)
df_train["labels"].values

Which returns

<class 'numpy.ndarray'>
(23455,)
array([list([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]),
       list([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]),
       list([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), ...,
       list([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]),
       list([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]),
       list([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])], dtype=object)

And coded another way as

print(type(df_train_labels_split))
print(df_train_labels_split.shape)
df_train_labels_split

Which returns

<class 'numpy.ndarray'>
(23455, 11)
array([[0, 0, 0, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]])

^^^ All of these give me errors:

X_train, y_train, X_test, y_test = iterative_train_test_split(
    df_train["text"].values,
    df_train_labels_split,
    test_size = 0.2
)

Throws

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-67-d7e6efda299e> in <module>
      1 # Get multi-label train/test splits of data
      2 from sklearn.model_selection import train_test_split
----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split(
      4     df_train["text"].values,
      5     df_train_labels_split,

~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size)
     93     train_indexes, test_indexes = next(stratifier.split(X, y))
     94 
---> 95     X_train, y_train = X[train_indexes, :], y[train_indexes, :]
     96     X_test, y_test = X[test_indexes, :], y[test_indexes, :]
     97 

IndexError: too many indices for array

^^^ The number of rows matches perfectly, so this is really unclear

And

X_train, y_train, X_test, y_test = iterative_train_test_split(
    df_train["text"].values,
    df_train["labels"].values,
    test_size = 0.2
)

Gives me

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-75472797614a> in <module>
      1 # Get multi-label train/test splits of data
      2 from sklearn.model_selection import train_test_split
----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split(
      4     df_train["text"].values,
      5     df_train["labels"].values,

~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size)
     91 
     92     stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])
---> 93     train_indexes, test_indexes = next(stratifier.split(X, y))
     94 
     95     X_train, y_train = X[train_indexes, :], y[train_indexes, :]

~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
    334                 .format(self.n_splits, n_samples))
    335 
--> 336         for train, test in super().split(X, y, groups):
    337             yield train, test
    338 

~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
     78         X, y, groups = indexable(X, y, groups)
     79         indices = np.arange(_num_samples(X))
---> 80         for test_index in self._iter_test_masks(X, y, groups):
     81             train_index = indices[np.logical_not(test_index)]
     82             test_index = indices[test_index]

~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
     90         By default, delegates to _iter_test_indices(X, y, groups)
     91         """
---> 92         for test_index in self._iter_test_indices(X, y, groups):
     93             test_mask = np.zeros(_num_samples(X), dtype=np.bool)
     94             test_mask[test_index] = True

~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _iter_test_indices(self, X, y, groups)
    339 
    340         rows, rows_used, all_combinations, per_row_combinations, samples_with_combination, folds = \
--> 341             self._prepare_stratification(y)
    342 
    343         self._distribute_positive_evidence(rows_used, folds, samples_with_combination, per_row_combinations)

~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _prepare_stratification(self, y)
    236 
    237         """
--> 238         self.n_samples, self.n_labels = y.shape
    239         self.desired_samples_per_fold = np.array([self.percentage_per_fold[i] * self.n_samples
    240                                                   for i in range(self.n_splits)])

ValueError: not enough values to unpack (expected 2, got 1)

I think this isn't an issue with my data, as

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df_train["text"].values,
    df_train["labels"].values,
    test_size = 0.2
)

Runs successfully. I'd really love to use this package, but these errors and the documentation gaps are really preventing me from doing so. Any advice would be great!

Also, as a side note, it's a little odd to me that sklearn's returned params are X_train, X_test, y_train, y_test while the multilearn returns are X_train, y_train, X_test, y_test

@edufonseca, did you ever find a solution?

edufonseca · 2020-05-25T00:05:47Z

@AlexMRuch no, I did not. It's a pity. It'd be great to have this work.

AlexMRuch · 2020-05-25T01:40:44Z

Yeah, looks like the last update was a year ago. Wonder if the package is dead :-(

valeriich · 2020-09-07T15:35:29Z

You may simply customize that function iterative_train_test_split for pandas Series with Text data as below:

from skmultilearn.model_selection import IterativeStratification

def iterative_train_test_split(X, y, test_size):
    stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])
    train_indexes, test_indexes = next(stratifier.split(X, y))

    X_train, y_train = X.iloc[train_indexes], y[train_indexes, :]
    X_test, y_test = X.iloc[test_indexes], y[test_indexes, :]

    return X_train, y_train, X_test, y_test

kevin-yauris · 2020-09-08T02:29:40Z

@AlexMRuch You may try this, and look if it works

X_train, y_train, X_test, y_test = iterative_train_test_split(
    df_train[["text"]].values,
    df_train[["labels"]].values,
    test_size = 0.2
)

I also got some error when using this method but using double bracket solved the error for me

zbeloki · 2023-03-06T12:25:35Z

As it states in the README, X and y must be matrices of two dimensions.

For instance, if you have a pandas column that you want to use as X, you should first convert it to a numpy array of shape (n, 1):

X = df.text.to_numpy()
X.shape
# (3,) -> bad! we need to add a new axis

X = X[..., np.newaxis]
# (3,1) -> good

To prepare the y parameter you can use MultiLabelBinarize from scikit-learn:

labels = [['white', 'black'], ['blue'], ['blue', 'white', 'pink']]
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(labels)
y.shape
# (3,4)

Jopepato mentioned this issue Mar 14, 2019

Set the order parameter in iterative_train_test_split #159

Closed

scikit-multilearn locked and limited conversation to collaborators Mar 14, 2023

ChristianSch converted this issue into discussion #282 Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Documentation of iterative_train_test_split incomplete #160

Documentation of iterative_train_test_split incomplete #160

edufonseca commented Mar 13, 2019

AlexMRuch commented May 23, 2020 •

edited

edufonseca commented May 25, 2020

AlexMRuch commented May 25, 2020

valeriich commented Sep 7, 2020 •

edited

kevin-yauris commented Sep 8, 2020

zbeloki commented Mar 6, 2023 •

edited

This issue was moved to a discussion.

This issue was moved to a discussion.

Documentation of iterative_train_test_split incomplete #160

Documentation of iterative_train_test_split incomplete #160

Comments

edufonseca commented Mar 13, 2019

AlexMRuch commented May 23, 2020 • edited

edufonseca commented May 25, 2020

AlexMRuch commented May 25, 2020

valeriich commented Sep 7, 2020 • edited

kevin-yauris commented Sep 8, 2020

zbeloki commented Mar 6, 2023 • edited

This issue was moved to a discussion.

AlexMRuch commented May 23, 2020 •

edited

valeriich commented Sep 7, 2020 •

edited

zbeloki commented Mar 6, 2023 •

edited