Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.0 breaks cross-validation with string targets #28841

Closed
stefan6419846 opened this issue Apr 15, 2024 · 2 comments · Fixed by #28843
Closed

Version 1.0 breaks cross-validation with string targets #28841

stefan6419846 opened this issue Apr 15, 2024 · 2 comments · Fixed by #28843
Labels

Comments

@stefan6419846
Copy link

stefan6419846 commented Apr 15, 2024

Describe the bug

I just tried to upgrade the package from version 0.24.2 to the latest release. Doing so, my integration tests would start to fail, claiming that there would not be enough samples for at least one class. This only occurs if I use string-based targets instead of integers.

As far as I have seen, there is no API change documented inside the changelog. Doing some testing, it seems like version 1.0 introduced the breaking change.

Steps/Code to Reproduce

import sklearn; sklearn.show_versions()

from sklearn.calibration import CalibratedClassifierCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


pipeline = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', CalibratedClassifierCV(LinearSVC(), cv=3)),
])

pipeline.fit(
    ['word0 word1 word3 word4'] + ['word0 word1 word2 word3 word4'] * 10 + ['word5 word6 word7 word8 word9'] * 10,
    [1] + [1] * 10 + [2] * 10,
)


pipeline = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', CalibratedClassifierCV(LinearSVC(), cv=3)),
])

pipeline.fit(
    ['word0 word1 word3 word4'] + ['word0 word1 word2 word3 word4'] * 10 + ['word5 word6 word7 word8 word9'] * 10,
    ['1'] + ['1'] * 10 + ['2'] * 10,
)

Expected Results

Both pipelines (once with integer targets, once with string targets) can be trained without issues.

Actual Results

Traceback (most recent call last):
  File "/home/stefan/aaa/run.py", line 25, in <module>
    pipeline.fit(
  File "/home/stefan/aaa/venv/lib64/python3.9/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/stefan/aaa/venv/lib64/python3.9/site-packages/sklearn/pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/home/stefan/aaa/venv/lib64/python3.9/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/stefan/aaa/venv/lib64/python3.9/site-packages/sklearn/calibration.py", line 394, in fit
    raise ValueError(
ValueError: Requesting 3-fold cross-validation but provided less than 3 examples for at least one class.

Versions

Failing:

System:
    python: 3.9.18 (main, Sep 06 2023, 07:49:32) [GCC]
executable: /home/stefan/aaa/venv/bin/python
   machine: Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.4.2
          pip: 22.2.2
   setuptools: 68.2.2
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libgomp
       filepath: /home/stefan/aaa/venv/lib64/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /home/stefan/aaa/venv/lib64/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /home/stefan/aaa/venv/lib64/python3.9/site-packages/scipy.libs/libopenblasp-r0-24bff013.3.26.dev.so
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: Haswell

Working:

System:
    python: 3.9.18 (main, Sep 06 2023, 07:49:32) [GCC]
executable: /home/stefan/aaa/venv/bin/python
   machine: Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

Python dependencies:
          pip: 22.2.2
   setuptools: 68.2.2
      sklearn: 0.24.2
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: True
@stefan6419846 stefan6419846 added Bug Needs Triage Issue requires triage labels Apr 15, 2024
@stefan6419846
Copy link
Author

As some additional data point: Research showed that

[np.sum(y == class_) < n_folds for class_ in self.classes_]
is the culprit as the array filtering does not seem to work with string-based lists anymore. This seems to indicate that scikit-learn would previously do some conversions for string targets beforehand which were dropped, as git blame does not show a change for this line.

A workaround is to convert my lists to a NumPy array with numpy.asarray() beforehand, but I still think that this is some breaking change/side-effect as integer-based lists continue to work as expected.

@jeremiedbb
Copy link
Member

jeremiedbb commented Apr 15, 2024

Thanks for the report @stefan6419846. We indeed used to convert y into a ndarray but this conversion is now delegated to the inner estimator.

This method for computing the number of occurrences for each class works if both are numpy arrays but fails when 1 is a list and elements are strings. I opened #28843 to fix it

@jeremiedbb jeremiedbb removed the Needs Triage Issue requires triage label Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants