Balanced sample with low number of one of the classes #6

miguelwon · 2019-03-21T15:49:15Z

I'm working with an extreme large multilabel problem and there are some rare classes. I was trying to use your package to balance by train/test split and notice that it does not guarantee at least one class in each set. The following example shows to the problem:

>>> import numpy as np
>>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
>>> X = np.arange(10)
>>> 
>>> 
>>> 
>>> import numpy as np
>>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
>>> 
>>> 
>>> X = np.arange(10)
>>> X
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> 
>>> y = np.array([[1,1,0],[0,1,0],[1,0,0],[1,0,0],[0,1,0],[0,1,0],[0,1,0],[1,1,0],[0,1,1],[1,0,1]])
>>> y
array([[1, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1]])
>>> 
>>> temp = MultilabelStratifiedShuffleSplit(n_splits = 1,test_size =.2,random_state = 0)
>>> train, test  = list(temp.split(X, y))[0]
>>> 
>>> train
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> 
>>> 
>>> test
array([0])

The train set contains both samples 8 and 9, which are the only ones that have the class with index 2.
How can I make sure that all splits have at least one sample per class?

The text was updated successfully, but these errors were encountered:

trent-b · 2019-03-23T17:53:56Z

@miguelwon I can reproduce the problem you are seeing. Thank you for the code to do so. If you use a test_size > 0.2, it will work fine (8 and 9 will be split). Even a test_size of 0.2001 will work. I'm looking into why this is to see if I can fix it.

trent-b · 2019-03-24T19:53:25Z

@miguelwon I have looked into the issue that you reported. Unfortunately, for the example you provided, there is not a way to guarantee at least one class in each set within the iterative stratification algorithm. The algorithm tries to preserve distributions between folds. This means that the train fold should have 0.8 * 2 = 1.6 samples for the rightmost label in X, and the test fold should have 0.2 * 2 = 0.4 samples for the rightmost label in X. This attempt to preserve even distributions is obviously not going to allow the guarantee of very rare cases to be evenly split across folds.

Trying to add code for special cases like this within the algorithm ends up being a hack and creates a risk that the algorithm may malfunction for non-edge cases. For very rare cases, I think you may have to write some code to set those cases aside and them evenly split them across folds after the iterative stratification algorithm is run.

miguelwon · 2019-03-25T14:37:22Z

Ok, thanks for the support @trent-b. Meanwhile, it turn out to be ok. The example I showed you was just an extreme case example I tried but when execute it with my real data, it was able to guarantee a minimum number of classes in each fold. Btw, I'm using your module because scikit-multilearn reproduce this same of problem with the same dataset.

trent-b closed this as completed Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Balanced sample with low number of one of the classes #6

Balanced sample with low number of one of the classes #6

miguelwon commented Mar 21, 2019

trent-b commented Mar 23, 2019 •

edited

trent-b commented Mar 24, 2019 •

edited

miguelwon commented Mar 25, 2019

Balanced sample with low number of one of the classes #6

Balanced sample with low number of one of the classes #6

Comments

miguelwon commented Mar 21, 2019

trent-b commented Mar 23, 2019 • edited

trent-b commented Mar 24, 2019 • edited

miguelwon commented Mar 25, 2019

trent-b commented Mar 23, 2019 •

edited

trent-b commented Mar 24, 2019 •

edited