Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of iterative_train_test_split incomplete #160

Closed
edufonseca opened this issue Mar 13, 2019 · 6 comments
Closed

Documentation of iterative_train_test_split incomplete #160

edufonseca opened this issue Mar 13, 2019 · 6 comments

Comments

@edufonseca
Copy link

iterative_train_test_split is briefly documented here (at the bottom), but the input params X, y are not explained. I tried passing yas a list of lists, encoding the labels as categorical integers, eg

[[2], [0,3], [1], [0,2,3]]

but it crashed.

By debugging the example provided here, X, y turn out to be scipy.sparse.lil_matrix. Is this the only format allowed?

Any indication on the possible formats for X, y in iterative_train_test_split? Thanks

@AlexMRuch
Copy link

AlexMRuch commented May 23, 2020

I'm having this issue as well. I've tried converting my inputs to a list of lists, a np.array of lists, a np.array or np.arrays, etc.

I can only get the example to work with the test example, which will work for non-sparse matrices:

from skmultilearn.model_selection.iterative_stratification import iterative_train_test_split
from skmultilearn.dataset import load_dataset

X,y, _, _ = load_dataset('scene', 'undivided')

X_train, y_train, X_test, y_test = iterative_train_test_split(
    X.A,
    y.A,
    test_size = 0.2
)

^^^ This works fine for me

In this case, we have

print(type(X.A))
print(X.A.shape)
X.A

Return

<class 'numpy.ndarray'>
(2407, 294)
array([[0.646467, 0.666435, 0.685047, ..., 0.247298, 0.014025, 0.029709],
       [0.770156, 0.767255, 0.761053, ..., 0.137833, 0.082672, 0.03632 ],
       [0.793984, 0.772096, 0.76182 , ..., 0.051125, 0.112506, 0.083924],
       ...,
       [0.952281, 0.944987, 0.905556, ..., 0.0319  , 0.017547, 0.019734],
       [0.88399 , 0.899004, 0.901019, ..., 0.256158, 0.226332, 0.22307 ],
       [0.974915, 0.866425, 0.818144, ..., 0.005131, 0.025059, 0.004033]])

And

print(type(y.A))
print(y.A.shape)
y.A

return

<class 'numpy.ndarray'>
(2407, 6)
array([[1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1]])

However, with my own data,

print(type(df_train["text"].values))
print(df_train["text"].values.shape)
df_train["text"].values

Which returns

<class 'numpy.ndarray'>
(23455,)
array(['Wholeheartedly support these protests &amp; acts of civil disobedience &amp; will join when I can! #Ferguson #AllLivesMatter http://t.co/D8Phc8UakE',
       'This Sandra Bland situation man no disrespect rest her soul , but people die everyday in a unjustified matter #AllLivesMatter',
       'Commitment to peace, healing and loving neighbors. Give us strength and patience. #PortlandPride #AllLivesMatter #Peace',
       ...,
       'After losing the election to 2 unisex names, maybe it is time for the GOP to support Marriage Equality and Civil Unions. #Sandy #Christie',
       '@FoxNews:Price gouging, looting and rage: #Sandy crimes stories grow http://t.co/zL3iI, Good Luck with their Gun Control Laws and 0 cops!',
       "Might devastated #Sandy victims lose the oppurtunity to vote, thus having their rights violated? Looting their vote. It shouldn't happen."],
      dtype=object)

And

print(type(df_train["labels"].values))
print(df_train["labels"].values.shape)
df_train["labels"].values

Which returns

<class 'numpy.ndarray'>
(23455,)
array([list([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]),
       list([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]),
       list([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), ...,
       list([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]),
       list([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]),
       list([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])], dtype=object)

And coded another way as

print(type(df_train_labels_split))
print(df_train_labels_split.shape)
df_train_labels_split

Which returns

<class 'numpy.ndarray'>
(23455, 11)
array([[0, 0, 0, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]])

^^^ All of these give me errors:

X_train, y_train, X_test, y_test = iterative_train_test_split(
    df_train["text"].values,
    df_train_labels_split,
    test_size = 0.2
)

Throws

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-67-d7e6efda299e> in <module>
      1 # Get multi-label train/test splits of data
      2 from sklearn.model_selection import train_test_split
----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split(
      4     df_train["text"].values,
      5     df_train_labels_split,

~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size)
     93     train_indexes, test_indexes = next(stratifier.split(X, y))
     94 
---> 95     X_train, y_train = X[train_indexes, :], y[train_indexes, :]
     96     X_test, y_test = X[test_indexes, :], y[test_indexes, :]
     97 

IndexError: too many indices for array

^^^ The number of rows matches perfectly, so this is really unclear

And

X_train, y_train, X_test, y_test = iterative_train_test_split(
    df_train["text"].values,
    df_train["labels"].values,
    test_size = 0.2
)

Gives me

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-75472797614a> in <module>
      1 # Get multi-label train/test splits of data
      2 from sklearn.model_selection import train_test_split
----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split(
      4     df_train["text"].values,
      5     df_train["labels"].values,

~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size)
     91 
     92     stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])
---> 93     train_indexes, test_indexes = next(stratifier.split(X, y))
     94 
     95     X_train, y_train = X[train_indexes, :], y[train_indexes, :]

~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
    334                 .format(self.n_splits, n_samples))
    335 
--> 336         for train, test in super().split(X, y, groups):
    337             yield train, test
    338 

~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
     78         X, y, groups = indexable(X, y, groups)
     79         indices = np.arange(_num_samples(X))
---> 80         for test_index in self._iter_test_masks(X, y, groups):
     81             train_index = indices[np.logical_not(test_index)]
     82             test_index = indices[test_index]

~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
     90         By default, delegates to _iter_test_indices(X, y, groups)
     91         """
---> 92         for test_index in self._iter_test_indices(X, y, groups):
     93             test_mask = np.zeros(_num_samples(X), dtype=np.bool)
     94             test_mask[test_index] = True

~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _iter_test_indices(self, X, y, groups)
    339 
    340         rows, rows_used, all_combinations, per_row_combinations, samples_with_combination, folds = \
--> 341             self._prepare_stratification(y)
    342 
    343         self._distribute_positive_evidence(rows_used, folds, samples_with_combination, per_row_combinations)

~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _prepare_stratification(self, y)
    236 
    237         """
--> 238         self.n_samples, self.n_labels = y.shape
    239         self.desired_samples_per_fold = np.array([self.percentage_per_fold[i] * self.n_samples
    240                                                   for i in range(self.n_splits)])

ValueError: not enough values to unpack (expected 2, got 1)

I think this isn't an issue with my data, as

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df_train["text"].values,
    df_train["labels"].values,
    test_size = 0.2
)

Runs successfully. I'd really love to use this package, but these errors and the documentation gaps are really preventing me from doing so. Any advice would be great!

Also, as a side note, it's a little odd to me that sklearn's returned params are X_train, X_test, y_train, y_test while the multilearn returns are X_train, y_train, X_test, y_test

@edufonseca, did you ever find a solution?

@edufonseca
Copy link
Author

@AlexMRuch no, I did not. It's a pity. It'd be great to have this work.

@AlexMRuch
Copy link

Yeah, looks like the last update was a year ago. Wonder if the package is dead :-(

@valeriich
Copy link

valeriich commented Sep 7, 2020

You may simply customize that function iterative_train_test_split for pandas Series with Text data as below:

from skmultilearn.model_selection import IterativeStratification

def iterative_train_test_split(X, y, test_size):
    stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])
    train_indexes, test_indexes = next(stratifier.split(X, y))

    X_train, y_train = X.iloc[train_indexes], y[train_indexes, :]
    X_test, y_test = X.iloc[test_indexes], y[test_indexes, :]

    return X_train, y_train, X_test, y_test

@kevin-yauris
Copy link

@AlexMRuch You may try this, and look if it works

X_train, y_train, X_test, y_test = iterative_train_test_split(
    df_train[["text"]].values,
    df_train[["labels"]].values,
    test_size = 0.2
)

I also got some error when using this method but using double bracket solved the error for me

@zbeloki
Copy link

zbeloki commented Mar 6, 2023

As it states in the README, X and y must be matrices of two dimensions.

For instance, if you have a pandas column that you want to use as X, you should first convert it to a numpy array of shape (n, 1):

X = df.text.to_numpy()
X.shape
# (3,) -> bad! we need to add a new axis

X = X[..., np.newaxis]
# (3,1) -> good

To prepare the y parameter you can use MultiLabelBinarize from scikit-learn:

labels = [['white', 'black'], ['blue'], ['blue', 'white', 'pink']]
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(labels)
y.shape
# (3,4)

@scikit-multilearn scikit-multilearn locked and limited conversation to collaborators Mar 14, 2023
@ChristianSch ChristianSch converted this issue into discussion #282 Mar 14, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants