New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balanced sample with low number of one of the classes #6
Comments
@miguelwon I can reproduce the problem you are seeing. Thank you for the code to do so. If you use a |
@miguelwon I have looked into the issue that you reported. Unfortunately, for the example you provided, there is not a way to guarantee at least one class in each set within the iterative stratification algorithm. The algorithm tries to preserve distributions between folds. This means that the train fold should have 0.8 * 2 = 1.6 samples for the rightmost label in X, and the test fold should have 0.2 * 2 = 0.4 samples for the rightmost label in X. This attempt to preserve even distributions is obviously not going to allow the guarantee of very rare cases to be evenly split across folds. Trying to add code for special cases like this within the algorithm ends up being a hack and creates a risk that the algorithm may malfunction for non-edge cases. For very rare cases, I think you may have to write some code to set those cases aside and them evenly split them across folds after the iterative stratification algorithm is run. |
Ok, thanks for the support @trent-b. Meanwhile, it turn out to be ok. The example I showed you was just an extreme case example I tried but when execute it with my real data, it was able to guarantee a minimum number of classes in each fold. Btw, I'm using your module because scikit-multilearn reproduce this same of problem with the same dataset. |
I'm working with an extreme large multilabel problem and there are some rare classes. I was trying to use your package to balance by train/test split and notice that it does not guarantee at least one class in each set. The following example shows to the problem:
The train set contains both samples 8 and 9, which are the only ones that have the class with index 2.
How can I make sure that all splits have at least one sample per class?
The text was updated successfully, but these errors were encountered: