Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug when combining SVC + class_weights='balanced' + LeaveOneOut #10233

Open
m-guggenmos opened this issue Nov 30, 2017 · 23 comments
Open

Comments

@m-guggenmos
Copy link

This piece of code yields perfect classification accuracy for random data:

import numpy as np
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.svm import SVC

scores = cross_val_score(SVC(kernel='linear', class_weight='balanced', C=1e-08), 
                         np.random.rand(79, 100), 
                         y=np.hstack((np.ones(20), np.zeros(59))), 
                         cv=LeaveOneOut())
print(scores)

The problem disappears when using class_weight=None or another CV.

Is it a bug or am I missing something?

Tested with version 0.19.1 of scikit-learn on Ubuntu Linux.

@m-guggenmos
Copy link
Author

Upon reflection this can probably be closed. What happens is that the classifier basically learns the class weights, when a low value of C is chosen. Through this, the classifier always predicts the class that must have been absent from the training data, essentially independent of the data of the test sample. This is also why it only 'works', when only a single class is in the test set (as in LeaveOneOut).

Still, until I boiled it down to this minimal example, this gave me a major headache in a larger piece of code. Maybe one could provide a warning in the docs?

@jnothman
Copy link
Member

jnothman commented Nov 30, 2017 via email

@m-guggenmos
Copy link
Author

m-guggenmos commented Dec 1, 2017

Possibly it is the custom scikit-learn code around class_weight='balanced' that causes problems. With this option, the class weights are computed anew for each cross-validation fold. Apparently, what can happen is that if a class 1 sample is left out for testing, the balance between class 1 and class -1 in training is exactly such that the sklearn-computed class weights make it more likely for the classifier to predict class 1, and vice versa.

It's tricky, because setting class_weight='balanced' appears as a good and innocent thing to do, but can lead to 100% classification accuracy with random data.

Sorry, this is not very constructive, but I hope these caveats could at least be informative for other users. Without a better understanding it's also a bit difficult to come up with an authoritative warning. The best I could come up with is: "Note that setting class weights can lead to biased results in certain cross-validation procedures (e.g. leave-one-sample-out). One workaround is to ensure an equal number of samples of each class in the test set."

((On a side note, without a better understanding I am also a bit skeptical about setting fixed values for class_weight based on the frequency of classes in the entire dataset, because strictly speaking this destroys the non-independence of training and testing. For instance, if I tell libsvm that class 1 is exactly 10x more frequent than class -1 (amounting to the libsvm options '-w1 10 -w-1 1') than the classifier is not blind about the test-data in a leave-one-out cross-validated procedure, because the class of the test sample could, in theory, be inferred from the class frequency in the training set. However, as opposed to the sklearn 'balanced' option, I don't have a demo example for this possible caveat and I don't know whether it's really an issue.))

@jnothman
Copy link
Member

jnothman commented Dec 3, 2017

It took me a while to be convinced by this. I'm going to conclude that it comes down to numerical imprecision: you can either draw 58 0s and 20 1s, in which case the class weights are [ 0.67241379 1.95 ] or you can draw 59 0s and 19 yielding [ 0.66101695 2.05263158]. Due to numerical imprecision, we get 0.67241379 * 58 < 1.95 * 20 but 0.66101695 * 59 > 2.05263158 * 19.

I suppose we could document the 'balanced' compute_class_weight option to say that it is brittle to numerical precision issues... but so is everything when there is no clear signal to learn.

An alternative might be to incorporate minor perturbation in compute_class_weights so as to ensure that there is a very marginal preference for the classes' true distribution, since (0.67241379 + 1e-8) * 58 > 1.95 * 20. But I hope you'll agree it is a fairly marginal case (no signal, no regularisation, an appropriate estimator, small number of samples, and perfect precision on the less frequent class but imperfect precision on the more frequent class) in which this appears to be needed.

I don't understand your final concern in the sklearn context, as our automated class weighting is performed only for a given training set, not the whole dataset.

@jnothman
Copy link
Member

jnothman commented Dec 3, 2017

For example patch see b07172d

@m-guggenmos
Copy link
Author

Thanks a lot for taking on this issue! Using the following code

import numpy as np
from sklearn.model_selection import cross_val_predict, LeaveOneOut
from sklearn.svm import SVC
from sklearn.metrics import balanced_accuracy_score

labels = np.hstack((np.ones(20), np.zeros(59)))
pred = cross_val_predict(SVC(kernel='linear', class_weight='balanced', C=1e-08),
                         np.random.rand(79, 100), y=labels,
                         cv=LeaveOneOut())

print(balanced_accuracy_score(labels, pred))

I verified that your proposed solution works in my case - the balanced accuracy is at chance (0.5).

Regarding my final concern: I put it in double brackets because I was referring to the case of setting fixed class weights (i.e. not using the 'balanced' option), which is a bit off topic. Here the problem is indeed that class weights are not separately computed for each training data set, but rather the weighting is often determined from the class frequency of the entire data set (at least that is what I have seen recommended in various CrossValidated comments).

@jnothman
Copy link
Member

jnothman commented Dec 4, 2017

Right. Now I understand. But it only destroys the strict independence of training and testing if you derive your fixed weights from examining the data distribution, which I agree is very possible, but it's certainly not an error on the part of the software. I suppose you would rather a class_weight that can be set as a function of the training data in a manner that isn't 'balanced'? I would certainly consider a PR that does that if there's evidence that users have other useful weighting schemes that are a function of the training distribution.

So is this problem something you came across organically, or something you found in a contrived situation? Do you think others would land up in a problem caused by this numerical imprecision, with real ML problems? Do you think we should try to fix it or that it's a very weird case for which it's not worth breaking backwards compatibility for existing models learnt with class_weight='balanced'?

@m-guggenmos
Copy link
Author

m-guggenmos commented Dec 4, 2017

I'm not aware of class weighting procedures other than 'balanced', which of course does not mean they don't exist. In my opinion, the way this is handled with the 'balanced' option in sklearn is exemplary, precisely because the weights are computed on the training data only.

I would say that I came across the problem relatively organically. To elaborate, I was using GridSearchCV on the C parameter of SVC and after setting class_weight='balanced' I suddenly got amazing accuracies on a real-world data set (i.e., not artificial/random data). I then realized that GridSearchCV was selecting very low values of C, i.e. no regularization at all, which at first was even weirder.

Based on this experience I'm inclined to recommend inclusion of your patch, because I'm sure many people will not investigate further when accuracies are good and 'publishable'. The effect of changing class weights in the order of 1e-8 should be negligible in almost all cases, and if not, it's likely because of this very issue. I see the trade-off with exact backwards compatibility though.

@jnothman
Copy link
Member

jnothman commented Dec 4, 2017 via email

@m-guggenmos
Copy link
Author

It has signal - around 60-65% classification accuracy. With class_weight='balanced' the accuracy became around 75%, which would have been a huge gain in this case.

@amueller
Copy link
Member

amueller commented Dec 4, 2017

I'm confused, but is the example not using enormous regularization and just fitting the intercept?

@amueller
Copy link
Member

amueller commented Dec 4, 2017

What's the class balance on your dataset? 60-65% classification accuracy seems like no signal in an imbalanced setting. I think this is more an issue of using accuracy and LOO on an imbalanced dataset.

@jnothman
Copy link
Member

jnothman commented Dec 4, 2017 via email

@m-guggenmos
Copy link
Author

@amueller you're of course right, it's an instance of high regularization, I mixed it up with λ = 1/C.

The accuracies mentioned in my last post refer to balanced accuracies though, so class imbalance is taken into account.

@amueller
Copy link
Member

amueller commented Dec 5, 2017

what version of balanced accuracy? ;) [less relevant to this issue maybe but part of my quest to find out what people mean when they say that]. Though depending on the definition you're using, chance performance could be anything depending on the imbalance.

@m-guggenmos
Copy link
Author

m-guggenmos commented Dec 5, 2017

Hmm, it is the version from a pip install git+https://github.com/scikit-learn/scikit-learn.git around a week ago. And then

from sklearn.metrics import balanced_accuracy_score

Does that help?

@amueller
Copy link
Member

amueller commented Dec 5, 2017

ah, makes sense. That always has chance performance of .5, right (if we always predict one class the recall will be 1 for that and 0 for the other)? Are we saying that in the docs anywhere?

@jnothman
Copy link
Member

jnothman commented Dec 5, 2017 via email

@mlplyler
Copy link

mlplyler commented Jul 19, 2018

I believe I have run into almost the exact same issue. I have a small dataset. I am testing many different feature extraction methods, and some have effectively no signal. I am using loocv. Is the proposed fix to add the perturbations to sklearn/utils/class_weight.py?

EDIT:
I think this code illustrates the discussion? I'm still kind of confused.

import numpy as np
from sklearn.metrics import accuracy_score

labels = np.hstack((np.ones(20), np.zeros(59))).astype(int)
X = np.random.rand(len(labels), 100)
ypred = np.empty(labels.shape)

ydumbs=[]
for i in range(len(labels)):
        
    xtrain = None#np.delete(X,i,axis=0)
    ytrain = np.delete(labels,i,axis=0)
    xtest = None#np.array(X[i])[np.newaxis,]
    ytest = np.array(labels[i])
    
    w1 = np.float16(len(ytrain)/(2*len(ytrain[ytrain==1])))
    w0 = np.float16(len(ytrain)/(2*len(ytrain[ytrain==0])))
    
    if w1*len(ytrain[ytrain==1]) > w0*len(ytrain[ytrain==0]):
        ydumb = 0
    else:
        ydumb = 1
    
    #print('ydumb', ydumb, ytest, ytest==ydumb)
    ydumbs.append(ydumb)        
    
print(accuracy_score(labels,ydumbs))

@wjxts
Copy link

wjxts commented Jan 31, 2020

I came across the same issue, too. The SOTA accuarcy of my problem is 70%. When I set C=0.001, I got accuracy 100% which is impossible. I find the answer here. Thank you!

@cmarmo cmarmo added the Needs Decision - Close Requires decision for closing label Dec 23, 2021
@ogrisel
Copy link
Member

ogrisel commented Jan 14, 2022

I came across the same issue, too. The SOTA accuarcy of my problem is 70%. When I set C=0.001, I got accuracy 100% which is impossible. I find the answer here. Thank you!

Was this also with class_weight="balanced" and LOO?

@ogrisel
Copy link
Member

ogrisel commented Jan 14, 2022

Maybe we could just document this pitfall in an example and add a short note in the relevant docstrings for class_weight="balanced" and the LOO doc?

@thomasjpfan thomasjpfan added Moderate Anything that requires some knowledge of conventions and best practices Hard Hard level of difficulty and removed Moderate Anything that requires some knowledge of conventions and best practices labels Jan 14, 2022
@jbschiratti
Copy link

@jnothman I'm following up on a previous discussion. Unless I am mistaken, if class_weight='balanced' is passed to LogisticRegressionCV, the class weights are computed the labels of the entire dataset. This breaks the independence of training and test data. Is there a specific reason for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants