Different percentage of samples for each label after using MultilabelStratifiedKFold #14

Lance0218 · 2021-03-23T05:47:22Z

Hi trent-b:

Thanks for this nice repository, hope you can reply these questions below:

def multi2single_labels(y):
    d = {}
    for yy in y:
        d[str(yy)] = d.get(str(yy), 0) + 1
    return d
yy = np.array([[0,0,0,0]]*318+[[1,0,0,0]]*264+[[0,0,1,0]]*58+[[0,1,0,1]]*51+\
              [[1,0,0,1]]*81+[[0,1,0,0]]*151+[[0,1,1,0]]*33+[[0,0,1,1]]*27+\
              [[0,0,0,1]]*54+[[0,1,1,1]]*21+[[1,1,0,0]]*11+[[1,1,0,1]]*7+[[1,0,1,0]]*2)
xx = np.zeros((yy.shape[0],))
kfold = MultilabelStratifiedKFold(n_splits=2, random_state=42, shuffle=True)
for idx_fold, (idx_train, idx_valid) in enumerate(kfold.split(xx, yy)):
    print(f'Now in {idx_fold}th fold')
    y_valid = yy[idx_valid]
    d_y = multi2single_labels(y_valid)
    print(f'labels of y: {d_y}')

Using the code (simplest 2 fold) above will get result:
Now in 0th fold
labels of y: {'[0 0 0 0]': 155, '[1 0 0 0]': 136, '[0 0 1 0]': 28, '[0 1 0 1]': 25, '[1 0 0 1]': 37, '[0 1 0 0]': 76, '[0 1 1 0]': 18, '[0 0 1 1]': 15, '[0 0 0 1]': 31, '[0 1 1 1]': 9, '[1 1 0 0]': 5, '[1 1 0 1]': 4}
Now in 1th fold
labels of y: {'[0 0 0 0]': 163, '[1 0 0 0]': 128, '[0 0 1 0]': 30, '[0 1 0 1]': 26, '[1 0 0 1]': 44, '[0 1 0 0]': 75, '[0 1 1 0]': 15, '[0 0 1 1]': 12, '[0 0 0 1]': 23, '[0 1 1 1]': 12, '[1 1 0 0]': 6, '[1 1 0 1]': 3, '[1 0 1 0]': 2}
Q1: Why is '[1 0 1 0]' not be 1 in both two fold but all in 1th fold?
Q2: Why is number of some label so differ in each fold? (e.g.'[0 0 0 0]', '[1 0 0 0]')

Thanks!

The text was updated successfully, but these errors were encountered:

trent-b · 2021-03-27T02:46:12Z

Hi Lance0218,

Your questions are understandable. What you are observing is actually not unexpected though. The paper from Sechidis et al. (2011) discusses the pros and cons of a "labelset" approach versus their approach. I believe you are thinking more in terms of a "labelset" approach. The approach by Sechidis et al. considers the lowest sum of "1" labels summed across all target instances to determine which steps to take next. This is in contrast to the "labelset" approach which would look at your 4-element vectors and immediately put one [1 0 1 0] vector into the 0th fold and the other [1 0 1 0] vector into the 1th fold. It may help to take a look at this slide deck by one of the authors starting at Slide 9.

From a practical perspective, you can change the random_state to find a split that may be more suitable for you. I see that random_state=0 splits [1 0 1 0] between the two folds. I hope this helps.

Lance0218 · 2021-04-01T10:21:23Z

Hi trent-b:

Thank you for your reply, and I found that I can use "LabelEncoder" to do what I want easily, so this issue can be closed.
Thank you again for everything you’ve done!

Best,
Lance

Lance0218 closed this as completed Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different percentage of samples for each label after using MultilabelStratifiedKFold #14

Different percentage of samples for each label after using MultilabelStratifiedKFold #14

Lance0218 commented Mar 23, 2021 •

edited

trent-b commented Mar 27, 2021

Lance0218 commented Apr 1, 2021

Different percentage of samples for each label after using MultilabelStratifiedKFold #14

Different percentage of samples for each label after using MultilabelStratifiedKFold #14

Comments

Lance0218 commented Mar 23, 2021 • edited

trent-b commented Mar 27, 2021

Lance0218 commented Apr 1, 2021

Lance0218 commented Mar 23, 2021 •

edited