Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balanced sample with low number of one of the classes #6

Closed
miguelwon opened this issue Mar 21, 2019 · 3 comments
Closed

Balanced sample with low number of one of the classes #6

miguelwon opened this issue Mar 21, 2019 · 3 comments

Comments

@miguelwon
Copy link

I'm working with an extreme large multilabel problem and there are some rare classes. I was trying to use your package to balance by train/test split and notice that it does not guarantee at least one class in each set. The following example shows to the problem:

>>> import numpy as np
>>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
>>> X = np.arange(10)
>>> 
>>> 
>>> 
>>> import numpy as np
>>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
>>> 
>>> 
>>> X = np.arange(10)
>>> X
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> 
>>> y = np.array([[1,1,0],[0,1,0],[1,0,0],[1,0,0],[0,1,0],[0,1,0],[0,1,0],[1,1,0],[0,1,1],[1,0,1]])
>>> y
array([[1, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1]])
>>> 
>>> temp = MultilabelStratifiedShuffleSplit(n_splits = 1,test_size =.2,random_state = 0)
>>> train, test  = list(temp.split(X, y))[0]
>>> 
>>> train
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> 
>>> 
>>> test
array([0])

The train set contains both samples 8 and 9, which are the only ones that have the class with index 2.
How can I make sure that all splits have at least one sample per class?

@trent-b
Copy link
Owner

trent-b commented Mar 23, 2019

@miguelwon I can reproduce the problem you are seeing. Thank you for the code to do so. If you use a test_size > 0.2, it will work fine (8 and 9 will be split). Even a test_size of 0.2001 will work. I'm looking into why this is to see if I can fix it.

@trent-b
Copy link
Owner

trent-b commented Mar 24, 2019

@miguelwon I have looked into the issue that you reported. Unfortunately, for the example you provided, there is not a way to guarantee at least one class in each set within the iterative stratification algorithm. The algorithm tries to preserve distributions between folds. This means that the train fold should have 0.8 * 2 = 1.6 samples for the rightmost label in X, and the test fold should have 0.2 * 2 = 0.4 samples for the rightmost label in X. This attempt to preserve even distributions is obviously not going to allow the guarantee of very rare cases to be evenly split across folds.

Trying to add code for special cases like this within the algorithm ends up being a hack and creates a risk that the algorithm may malfunction for non-edge cases. For very rare cases, I think you may have to write some code to set those cases aside and them evenly split them across folds after the iterative stratification algorithm is run.

@miguelwon
Copy link
Author

Ok, thanks for the support @trent-b. Meanwhile, it turn out to be ok. The example I showed you was just an extreme case example I tried but when execute it with my real data, it was able to guarantee a minimum number of classes in each fold. Btw, I'm using your module because scikit-multilearn reproduce this same of problem with the same dataset.

@trent-b trent-b closed this as completed Mar 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants