Inflated results on random-data with SVM #25631

CriticalValue · 2023-02-17T14:45:57Z

Describe the bug

When trying to train/evaluate a support vector machine in scikit-learn, I am experiencing some unexpected behaviour and I am wondering whether I am doing something wrong or that this is a possible bug.

In a very specific subset of circumstances, namely:

LeaveOneOut() is used as cross-validation procedure
The SVM is used, with probability = True and a small C such as 0.01
The y labels are balanced (i.e. the mean of y is 0.5)

The results of the trained SVM are very good on randomly generated data - while they should be near chance. If the y labels are a bit different, or the SVM is swapped out for a LogisticRegression, it gives expected results (Brier of 0.25, AUC near 0.5).
But for the named circumstances, the Brier is roughly 0.10 - 0.15 and AUC > 0.9 if the y labels are balanced.

Steps/Code to Reproduce

from sklearn import svm
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold, LeaveOneOut, KFold
from sklearn.metrics import roc_auc_score, brier_score_loss
from tqdm import tqdm
import pandas as pd


N = 20
N_FEATURES = 50


scores = []
for z in tqdm(range(500)):
    X = np.random.normal(0, 1, size=(N, N_FEATURES))
    y = np.random.binomial(1, 0.5, size=N)
    
    if z < 10:
        y = np.array([0, 1] * int(N/2))
        y = np.random.permutation(y)

    y_real, y_pred = [], []
    skf_outer = LeaveOneOut()
    for train_index, test_index in skf_outer.split(X, y):
        X_train, X_test = X[train_index], X[test_index, :]
        y_train, y_test = y[train_index], y[test_index]

        clf = svm.SVC(probability=True, C=0.01)

        clf.fit(X_train, y_train)
        predictions = clf.predict_proba(X_test)[:, 1]

        y_pred.extend(predictions)
        y_real.extend(y_test)

    scores.append([np.mean(y), 
                   brier_score_loss(np.array(y_real), np.array(y_pred)), 
                   roc_auc_score(np.array(y_real), np.array(y_pred))])

df_scores = pd.DataFrame(scores)
df_scores.columns = ['y_label', 'brier', 'auc']
df_scores['y_0.5'] = df_scores['y_label'] == 0.5
df_scores = df_scores.groupby(['y_0.5']).mean()
print(df_scores)

Expected Results

I would expect that all results would be somewhat similar, with a Brier ~0.25 and AUC ~0.5.

Actual Results

        y_label     brier       auc
y_0.5                              
False  0.514649  0.298204  0.216884
True   0.500000  0.159728  0.999080

Here, you can see that if the np.mean of the y_labels is 0.5, the results are actually really really good.
While the data is randomly generated for 500 times

Versions

System:
    python: 3.8.15 (default, Nov 24 2022, 14:38:14) [MSC v.1916 64 bit (AMD64)]
executable: C:\ProgramData\Anaconda3\envs\test\python.exe
   machine: Windows-10-10.0.19044-SP0
Python dependencies:
      sklearn: 1.2.0
          pip: 22.2.2
   setuptools: 61.2.0
        numpy: 1.19.5
        scipy: 1.10.0
       Cython: 0.29.14
       pandas: 1.4.4
   matplotlib: 3.6.3
       joblib: 1.2.0
threadpoolctl: 2.2.0
Built with OpenMP: True
threadpoolctl info:
       filepath: C:\ProgramData\Anaconda3\envs\test\Library\bin\mkl_rt.1.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 8
threading_layer: intel
       filepath: C:\Users\manuser\AppData\Roaming\Python\Python38\site-packages\scipy.libs\libopenblas-802f9ed1179cb9c9b03d67ff79f48187.dll
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.18
    num_threads: 16
threading_layer: pthreads
   architecture: Prescott
       filepath: C:\ProgramData\Anaconda3\envs\test\Lib\site-packages\sklearn\.libs\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8
       filepath: C:\ProgramData\Anaconda3\envs\test\Library\bin\libiomp5md.dll
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8
       filepath: C:\Users\manuser\AppData\Roaming\Python\Python38\site-packages\mxnet\libopenblas.dll
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: None
    num_threads: 16
threading_layer: pthreads
   architecture: Prescott
       filepath: C:\ProgramData\Anaconda3\envs\test\Lib\site-packages\torch\lib\libiomp5md.dll
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 16
       filepath: C:\ProgramData\Anaconda3\envs\test\Lib\site-packages\torch\lib\libiompstubs5md.dll
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 1

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-02-17T16:35:30Z

The way you are evaluating is equivalent to using LeaveOneOut with cross_val_predict and getting the aggregated predictions to compute the evaluation metric.

It is known to not be appropriate to evaluate a model:

Warning: Note on inappropriate usage of cross_val_predict

The result of cross_val_predict may be different from those obtained using cross_val_score as the elements are grouped in different ways. The function cross_val_score takes an average over cross-validation folds, whereas cross_val_predict simply returns the labels (or probabilities) from several distinct models undistinguished. Thus, cross_val_predict is not an appropriate measure of generalization error.

So I think that what you observe here is this problem.

CriticalValue · 2023-02-20T08:28:58Z

Hi, thanks for your fast reply. However, I don't think this is the issue. If I read cross_val_predict, the problem is more that the original y_labels might not line-up with the outputted ones anymore. If you look in my code, why is it in inappropiate to generate predictions for X_test, as long as we save y_test to make the comparison fair later? Furthermore, that wouldn't explain why the issue is only there with the SVM with probability=True, but is not there with other models such as LogisticRegression

To verify this, I rewrote the code (lot simpler), to this (which should be better right?):

from sklearn import svm
import numpy as np
from sklearn.model_selection import LeaveOneOut, cross_val_score
from tqdm import tqdm
import pandas as pd

# set up parameters for the experiments with fake data
N = 20
N_FEATURES = 50
N_TRIALS = 500

scores = []
for z in tqdm(range(N_TRIALS)):
    # generate fake data
    X = np.random.normal(0, 1, size=(N, N_FEATURES))
    y = np.random.binomial(1, 0.5, size=N)

    # for the first 20% of trials, we want to make sure that the y_labels are perfectly balanced, i.e. the mean=0.5
    # we do this, because our hypothesis is that something weird happens in this setting
    if z < int(N_TRIALS/5):
        y = np.array([0, 1] * int(N/2))
        y = np.random.permutation(y)

    # initialize the SVC
    clf = svm.SVC(probability=True, C=0.01)
    
    # calculate scores using cross_val_score
    nested_score = cross_val_score(clf, X=X, y=y, cv=LeaveOneOut(), scoring='neg_brier_score')
    scores.append([np.mean(y),
                   np.mean(nested_score)])

# put the scores in a pandas DataFrame for nicer inspection
df_scores = pd.DataFrame(scores)
df_scores.columns = ['y_label', 'brier']
df_scores['y_0.5'] = df_scores['y_label'] == 0.5
df_scores = df_scores.groupby(['y_0.5']).mean()
print(df_scores)

but the issue still persists. And only if the mean of y_label is exactly 0.5. (i.e. the mean Brier of the random data with y_labels that are not balanced is 0.30, which is expected with random data, but for perfectly balanced classes, the mean is 0.16, and the highest error in these 500 trials is only 0.19.)

Could it be something with the internal cross-validation of the SVM because of probability=True?

CriticalValue · 2023-02-20T13:50:55Z

This is strange. When replacing

clf = svm.SVC(probability=True, C=0.01)

with

clf = CalibratedClassifierCV(svm.SVC(C=0.01), cv=5)

the results are as one would expect, while if I understand the documentation correctly, this is what SVC(probability=True) does under the hood?

glemaitre · 2023-02-20T16:24:55Z

This uses the same Platt method. However, CalibratedClassifierCV is implemented by scikit-learn while the Platt calibration for the SVM is done in the libsvm if I am not wrong. I will have a look.

amueller · 2023-02-21T17:59:11Z

I think the difference is that one of them (I think scikit-learn?) averages the cross-validated results while libsvm refits the svm on the whole data and uses the fitted sigmoid model? Though that wouldn't explain the mismatch I think? libsvm's platt scaling had some interesting edge cases I think, but I don't remember which one would explain this behavior.

Also see #16145.

My confusion on the same issue four years ago can be found here: #13662 (comment)

amueller · 2023-02-21T18:05:01Z

I think the conclusion there was that CalibratedClassifierCV uses stratified sampling and libsvm does not, and the LOO is indeed the culprit here.

CriticalValue · 2023-02-21T18:32:07Z

I think the difference is that one of them (I think scikit-learn?) averages the cross-validated results while libsvm refits the svm on the whole data and uses the fitted sigmoid model? Though that wouldn't explain the mismatch I think? libsvm's platt scaling had some interesting edge cases I think, but I don't remember which one would explain this behavior.

Also see #16145.

My confusion on the same issue four years ago can be found here: #13662 (comment)

thanks for these links! While these are definitely useful and closely related, they do not seem to mention this specific issue raised here though (but I might be missing/misunderstanding something of course).

What I think happens at a higher level, is that there is something wrong/strange in the libsvm's platt scaling in these specific circumstances (low C, LOO-CV, balanced y). I think the balanced y and LOO-CV are needed, because probably the fitted sigmoid model of libsvm is somehow biased, and because of the combination of the balanced y and LOO-CV, if the bias is against the missing label in y_test, somehow, you end up predicting very well, as is happening here.

I was wondering how it is possible that with random data, that is not passed to SVM(probability=True), the SVM still correctly predicts. The only thing I can think of is the scenario above. Let me clarify it with an example:

y=np.array([0,0,0,0,0,1,1,1,1,1]) (i.e. 5 positive classes, 5 negative classes)
we run LOO-CV, and during the first iteration, y_test is therefore 0
y_train now has 4 negative classes, 5 positive classes
I think something goes wrong in the libsvm Platt's scaling, that biases its predictions somewhat to 0
We correctly predict y_test as being closer to 0 than a random prediction
etc etc

Hope this makes sense?

EDIT: and maybe good to emphasize, with higher values for C (i.e. 10 or 1000) the strange results do not happen and the predictions are random

amueller · 2023-09-26T22:33:42Z

I think this is closely related to an issue that AutoGluon has seen in their stacking:
autogluon/autogluon#2779

Essentially calibration is stacking, and we're facing the same information leakage here.

CriticalValue added Bug Needs Triage Issue requires triage labels Feb 17, 2023

thomasjpfan added module:svm Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inflated results on random-data with SVM #25631

Inflated results on random-data with SVM #25631

CriticalValue commented Feb 17, 2023 •

edited by glemaitre

glemaitre commented Feb 17, 2023

CriticalValue commented Feb 20, 2023 •

edited

CriticalValue commented Feb 20, 2023

glemaitre commented Feb 20, 2023

amueller commented Feb 21, 2023 •

edited

amueller commented Feb 21, 2023

CriticalValue commented Feb 21, 2023 •

edited

amueller commented Sep 26, 2023

Inflated results on random-data with SVM #25631

Inflated results on random-data with SVM #25631

Comments

CriticalValue commented Feb 17, 2023 • edited by glemaitre

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Feb 17, 2023

CriticalValue commented Feb 20, 2023 • edited

CriticalValue commented Feb 20, 2023

glemaitre commented Feb 20, 2023

amueller commented Feb 21, 2023 • edited

amueller commented Feb 21, 2023

CriticalValue commented Feb 21, 2023 • edited

amueller commented Sep 26, 2023

CriticalValue commented Feb 17, 2023 •

edited by glemaitre

CriticalValue commented Feb 20, 2023 •

edited

amueller commented Feb 21, 2023 •

edited

CriticalValue commented Feb 21, 2023 •

edited