Version 1.0 breaks cross-validation with string targets #28841

stefan6419846 · 2024-04-15T16:18:40Z

Describe the bug

I just tried to upgrade the package from version 0.24.2 to the latest release. Doing so, my integration tests would start to fail, claiming that there would not be enough samples for at least one class. This only occurs if I use string-based targets instead of integers.

As far as I have seen, there is no API change documented inside the changelog. Doing some testing, it seems like version 1.0 introduced the breaking change.

Steps/Code to Reproduce

import sklearn; sklearn.show_versions()

from sklearn.calibration import CalibratedClassifierCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


pipeline = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', CalibratedClassifierCV(LinearSVC(), cv=3)),
])

pipeline.fit(
    ['word0 word1 word3 word4'] + ['word0 word1 word2 word3 word4'] * 10 + ['word5 word6 word7 word8 word9'] * 10,
    [1] + [1] * 10 + [2] * 10,
)


pipeline = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', CalibratedClassifierCV(LinearSVC(), cv=3)),
])

pipeline.fit(
    ['word0 word1 word3 word4'] + ['word0 word1 word2 word3 word4'] * 10 + ['word5 word6 word7 word8 word9'] * 10,
    ['1'] + ['1'] * 10 + ['2'] * 10,
)

Expected Results

Both pipelines (once with integer targets, once with string targets) can be trained without issues.

Actual Results

Traceback (most recent call last):
  File "/home/stefan/aaa/run.py", line 25, in <module>
    pipeline.fit(
  File "/home/stefan/aaa/venv/lib64/python3.9/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/stefan/aaa/venv/lib64/python3.9/site-packages/sklearn/pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/home/stefan/aaa/venv/lib64/python3.9/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/stefan/aaa/venv/lib64/python3.9/site-packages/sklearn/calibration.py", line 394, in fit
    raise ValueError(
ValueError: Requesting 3-fold cross-validation but provided less than 3 examples for at least one class.

Versions

Failing:

System:
    python: 3.9.18 (main, Sep 06 2023, 07:49:32) [GCC]
executable: /home/stefan/aaa/venv/bin/python
   machine: Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.4.2
          pip: 22.2.2
   setuptools: 68.2.2
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libgomp
       filepath: /home/stefan/aaa/venv/lib64/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /home/stefan/aaa/venv/lib64/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /home/stefan/aaa/venv/lib64/python3.9/site-packages/scipy.libs/libopenblasp-r0-24bff013.3.26.dev.so
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: Haswell

Working:

System:
    python: 3.9.18 (main, Sep 06 2023, 07:49:32) [GCC]
executable: /home/stefan/aaa/venv/bin/python
   machine: Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

Python dependencies:
          pip: 22.2.2
   setuptools: 68.2.2
      sklearn: 0.24.2
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-04-15T18:17:33Z

As some additional data point: Research showed that

scikit-learn/sklearn/calibration.py

Line 392 in 5dbf795

[np.sum(y == class_) < n_folds for class_ in self.classes_]

is the culprit as the array filtering does not seem to work with string-based lists anymore. This seems to indicate that scikit-learn would previously do some conversions for string targets beforehand which were dropped, as git blame does not show a change for this line.

A workaround is to convert my lists to a NumPy array with numpy.asarray() beforehand, but I still think that this is some breaking change/side-effect as integer-based lists continue to work as expected.

jeremiedbb · 2024-04-15T19:04:36Z

Thanks for the report @stefan6419846. We indeed used to convert y into a ndarray but this conversion is now delegated to the inner estimator.

This method for computing the number of occurrences for each class works if both are numpy arrays but fails when 1 is a list and elements are strings. I opened #28843 to fix it

stefan6419846 added Bug Needs Triage Issue requires triage labels Apr 15, 2024

jeremiedbb mentioned this issue Apr 15, 2024

FIX CalibratedClassifierCV with string targets #28843

Merged

jeremiedbb removed the Needs Triage Issue requires triage label Apr 15, 2024

OmarManzoor closed this as completed in #28843 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 1.0 breaks cross-validation with string targets #28841

Version 1.0 breaks cross-validation with string targets #28841

stefan6419846 commented Apr 15, 2024 •

edited

stefan6419846 commented Apr 15, 2024

jeremiedbb commented Apr 15, 2024 •

edited

Version 1.0 breaks cross-validation with string targets #28841

Version 1.0 breaks cross-validation with string targets #28841

Comments

stefan6419846 commented Apr 15, 2024 • edited

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

stefan6419846 commented Apr 15, 2024

jeremiedbb commented Apr 15, 2024 • edited

stefan6419846 commented Apr 15, 2024 •

edited

jeremiedbb commented Apr 15, 2024 •

edited