New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dense svm and zeroed weight for samples of entire class #5150
Comments
Yeah that looks like a bug. I wonder if this shows in any other models or just the svm. If you are interested, could you maybe add a test to the common tests in (or you could just loop over all_estimators()). |
We could just fix it by handing all points to libsvm but that's probably not what we want to do, right? Does libsvm handle them efficiently? |
I've tested all classifiers that have sample_weights parameter in fit, and have predict_proba method. |
Thanks for checking. |
Here a script. I have included Regressors as well, but I am not sure whether that makes sense here/is relevant for the scope of this bug.
Output
|
thanks for that. You should use |
Now all classifiers which predict_proba method (except NuSVC which throws ValueError, and SVC of course) treat such input silently and predict_proba return columns for all classes from dataset. We can remove all "incorrect" classes inside BaseSVC's fit method, initialize internal variable classes_ from this fixed dataset, and feed this dataset into underlying implementation. At least now classes_ and predict_proba output will be consistent. |
|
Ok, i'll try to fix this. |
How about such results?
produces
Confusing, but at least it looks mathematically correct. And current master already produces same output if you will run this code:
Which is the same. Because here we vanish class_weight for second class, instead of doing it through sample_weights. |
so it is just using |
No, i forced SVC to use any sample_weights. To obtain such results i've removed this call https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/libsvm/svm.cpp#L2342 And all corresponding memory allocation/deallocation. Function call at that line is removing samples from dataset, whose sample_weight is equals to zero. And if it deletes samples of entire class - svm training starts on this 'truncated' dataset without knowledge about all possible classes. But output of this 'fixed' code now is consistent with case when you specify some class_weight=0. And as i found out original libsvm also produces same results.
Nevermind, i tried to explain such outputs, but i made a mistake in my thoughts.
Of course it should, but even original libsvm returns same probability estimates. dataset.txt:
code:
It produces in predictions.out:
|
Maybe it's easier to just fix meta estimators, so that they wouldn't pass 0 weight samples into estimators. And throw an error for any such input. Bagging for example can can chose samples for estimator training by choosing weights (default, if estimator supports sample_weights), or subsampling from dataset. Because in case with this bug, if you want to return 0 prob for any 'incorrect' class, you must take into account that SVM classifier also contains bunch of other attributes, like support_, n_support_, dual_coef_, coef_, what values will they have in this case? It will look ugly. |
I think this should be fixed in libsvm. Do you have an explanation for the libsvm behavior? It seems highly odd to me. I guess it is a combination of how the class weights change the loss combined with the OVR approach. I don't have time to go through the math right now. This seems to be separate from the "balanced' / "auto" issue, though, right? |
No.
Can you point at that issue? I don't know about which you are asking. |
this one. |
Ah, sorry, now i understood. Yep, it isn't related to balanced/auto. I've updated code listing in first post. |
So as i suspected, it's not a bug, i asked about it cjlin1/libsvm#50 (comment) If someone didn't understand what they said there: In terms of original optimization problem - C = 0 means that you cannot penalize xi values of entire class, and minimum of problem achieved when you classify any points in dataset with label of that strange class whose C=0 (because xi's could be any positive numbers, C=0 will not penalize them). |
basically: using class-weights with OVR is not a great idea without calibration. Or is there another conclusion? |
This bug appears in current master, and for any dense svm class.
Output:
Here we see that svmlib internally have lost 2nd class, at the same time sklean's wrapper class keeps all class labels inside, that's why predict_proba returns matrix of shape (n_samples, 2) instead of (n_sample, 3) (what is expected by bagging classifier implementation). I understand that it's insane usage of weights by itself, but together with bagging and dataset with many labels, bagging randomly zeroes complete classes, and this bug shows itself, because bagging expects that svm's return probability of classes which they hold (e.g. all classes).
I investigated this a little bit, and can try to fix this, if someone will say that all this usage with bagging makes sense (Because i don't really sure about this).
The text was updated successfully, but these errors were encountered: