Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different percentage of samples for each label after using MultilabelStratifiedKFold #14

Closed
Lance0218 opened this issue Mar 23, 2021 · 2 comments

Comments

@Lance0218
Copy link

Lance0218 commented Mar 23, 2021

Hi trent-b:

Thanks for this nice repository, hope you can reply these questions below:

def multi2single_labels(y):
    d = {}
    for yy in y:
        d[str(yy)] = d.get(str(yy), 0) + 1
    return d
yy = np.array([[0,0,0,0]]*318+[[1,0,0,0]]*264+[[0,0,1,0]]*58+[[0,1,0,1]]*51+\
              [[1,0,0,1]]*81+[[0,1,0,0]]*151+[[0,1,1,0]]*33+[[0,0,1,1]]*27+\
              [[0,0,0,1]]*54+[[0,1,1,1]]*21+[[1,1,0,0]]*11+[[1,1,0,1]]*7+[[1,0,1,0]]*2)
xx = np.zeros((yy.shape[0],))
kfold = MultilabelStratifiedKFold(n_splits=2, random_state=42, shuffle=True)
for idx_fold, (idx_train, idx_valid) in enumerate(kfold.split(xx, yy)):
    print(f'Now in {idx_fold}th fold')
    y_valid = yy[idx_valid]
    d_y = multi2single_labels(y_valid)
    print(f'labels of y: {d_y}')

Using the code (simplest 2 fold) above will get result:
Now in 0th fold
labels of y: {'[0 0 0 0]': 155, '[1 0 0 0]': 136, '[0 0 1 0]': 28, '[0 1 0 1]': 25, '[1 0 0 1]': 37, '[0 1 0 0]': 76, '[0 1 1 0]': 18, '[0 0 1 1]': 15, '[0 0 0 1]': 31, '[0 1 1 1]': 9, '[1 1 0 0]': 5, '[1 1 0 1]': 4}
Now in 1th fold
labels of y: {'[0 0 0 0]': 163, '[1 0 0 0]': 128, '[0 0 1 0]': 30, '[0 1 0 1]': 26, '[1 0 0 1]': 44, '[0 1 0 0]': 75, '[0 1 1 0]': 15, '[0 0 1 1]': 12, '[0 0 0 1]': 23, '[0 1 1 1]': 12, '[1 1 0 0]': 6, '[1 1 0 1]': 3, '[1 0 1 0]': 2}
Q1: Why is '[1 0 1 0]' not be 1 in both two fold but all in 1th fold?
Q2: Why is number of some label so differ in each fold? (e.g.'[0 0 0 0]', '[1 0 0 0]')

Thanks!

@trent-b
Copy link
Owner

trent-b commented Mar 27, 2021

Hi Lance0218,

Your questions are understandable. What you are observing is actually not unexpected though. The paper from Sechidis et al. (2011) discusses the pros and cons of a "labelset" approach versus their approach. I believe you are thinking more in terms of a "labelset" approach. The approach by Sechidis et al. considers the lowest sum of "1" labels summed across all target instances to determine which steps to take next. This is in contrast to the "labelset" approach which would look at your 4-element vectors and immediately put one [1 0 1 0] vector into the 0th fold and the other [1 0 1 0] vector into the 1th fold. It may help to take a look at this slide deck by one of the authors starting at Slide 9.

From a practical perspective, you can change the random_state to find a split that may be more suitable for you. I see that random_state=0 splits [1 0 1 0] between the two folds. I hope this helps.

@Lance0218
Copy link
Author

Hi trent-b:

Thank you for your reply, and I found that I can use "LabelEncoder" to do what I want easily, so this issue can be closed.
Thank you again for everything you’ve done!

Best,
Lance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants