Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG Wrong error raised for OneVsRestClassifier with a sub-estimator that doesn't allow partial_fit #28108

Closed
StefanieSenger opened this issue Jan 12, 2024 · 9 comments
Labels

Comments

@StefanieSenger
Copy link
Contributor

Describe the bug

When using the OneVsRestClassifier with a sub-estimator, that doesn't have partial_fit implemented, a misleading error message is shown: AttributeError: This 'OneVsRestClassifier' has no attribute 'partial_fit'.

Though, OneVsRestClassifier does implement partial_fit, but the underlying estimator doesn't.

There is an appropriate error raising already implemented in

if not hasattr(self.estimator, "partial_fit"):

            if not hasattr(self.estimator, "partial_fit"):
                raise ValueError(
                    ("Base estimator {0}, doesn't have partial_fit method").format(
                        self.estimator
                    )
                )

But it's not run, because the AttributeError from the @available_if decorator/descriptor thing pops up earlier.

I'd like to learn about that, repair that and add a test to make sure this doesn't happen again by accident.
I will also check if other methods from the multiclass classifiers are also affected.

Is it alright if I go ahead with this?

Steps/Code to Reproduce

from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np

iris = load_iris()
sample_weight = np.ones_like(iris.target, dtype=np.float64)
clf = OneVsRestClassifier(
estimator=LogisticRegression(random_state=42)
)
clf.partial_fit(iris.data, iris.target)

Expected Results

ValueError: LogisticRegression doesn't have partial_fit method

Actual Results

AttributeError: This 'OneVsRestClassifier' has no attribute 'partial_fit'.

Versions

System:
    python: 3.10.6 (main, Oct 10 2022, 12:43:33) [GCC 9.4.0]
executable: /home/stefanie/.pyenv/versions/3.10.6/envs/scikit-learn_dev/bin/python
   machine: Linux-5.15.0-91-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.5.dev0
          pip: 23.3.2
   setuptools: 63.2.0
        numpy: 1.26.0
        scipy: 1.11.3
       Cython: 0.29.34
       pandas: 2.1.4
   matplotlib: 3.8.0
       joblib: 1.3.2
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/stefanie/.pyenv/versions/3.10.6/envs/scikit-learn_dev/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/stefanie/.pyenv/versions/3.10.6/envs/scikit-learn_dev/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
        version: None
    num_threads: 8
@StefanieSenger StefanieSenger added Bug Needs Triage Issue requires triage labels Jan 12, 2024
@StefanieSenger StefanieSenger changed the title Wrong error raised when using OneVsRestClassifier with a sub-estimator that doesn't allow partial_fit BUG Wrong error raised for OneVsRestClassifier with a sub-estimator that doesn't allow partial_fit Jan 12, 2024
@glemaitre glemaitre removed the Needs Triage Issue requires triage label Jan 12, 2024
@glemaitre
Copy link
Member

Interesting, I never realize this issue. available_if is taking a check as parameter and this parameter can return True or False but also raise an AttributeError.

When returning False, it implies that the original object will raise an AttributeError (in our case this is OneVsRestClassifier. If we want the underlying estimator to raise the error then we need to do so in the check function.

This is already something that we do with the Pipeline class:

def _final_estimator_has(attr):
"""Check that final_estimator has `attr`.
Used together with `available_if` in `Pipeline`."""
def check(self):
# raise original `AttributeError` if `attr` does not exist
getattr(self._final_estimator, attr)
return True
return check

With this pattern, we get the right error with the following example:

from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

X, y = load_iris(return_X_y=True)
clf = make_pipeline(DecisionTreeClassifier()).fit(X, y)
clf.decision_function(X)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[21], line 6
      4 X, y = load_iris(return_X_y=True)
      5 clf = make_pipeline(DecisionTreeClassifier()).fit(X, y)
----> 6 clf.decision_function(X)

File ~/Documents/packages/scikit-learn/sklearn/utils/_available_if.py:31, in _AvailableIfDescriptor.__get__(self, obj, owner)
     25 attr_err = AttributeError(
     26     f"This {repr(owner.__name__)} has no attribute {repr(self.attribute_name)}"
     27 )
     28 if obj is not None:
     29     # delegate only on instances, not the classes.
     30     # this is to allow access to the docstrings.
---> 31     if not self.check(obj):
     32         raise attr_err
     33     out = MethodType(self.fn, obj)

File ~/Documents/packages/scikit-learn/sklearn/pipeline.py:53, in _final_estimator_has.<locals>.check(self)
     51 def check(self):
     52     # raise original `AttributeError` if `attr` does not exist
---> 53     getattr(self._final_estimator, attr)
     54     return True

AttributeError: 'DecisionTreeClassifier' object has no attribute 'decision_function'

So it means that we should change the following pattern:

def _estimators_has(attr):
"""Check if self.estimator or self.estimators_[0] has attr.
If `self.estimators_[0]` has the attr, then its safe to assume that other
values has it too. This function is used together with `avaliable_if`.
"""
return lambda self: (
hasattr(self.estimator, attr)
or (hasattr(self, "estimators_") and hasattr(self.estimators_[0], attr))
)

by something like:

def _estimators_has(attr):
    """Check if self.estimator or self.estimators_[0] has attr.

    If `self.estimators_[0]` has the attr, then its safe to assume that other
    values has it too. This function is used together with `available_if`.
    """
    def check(self):
        if hasattr(self, "estimators_"):
            getattr(self.estimators_[0], attr)
            return True
        else:
            getattr(self.estimator, attr)
            return True
    return check

getattr will raise an AttributeError for estimator or estimators_[0] that is the underlying estimator.

I think that we should search for the pattern return lambda self: in the repository and check that the estimators implemented it do not suffer from the same bug.

@StefanieSenger do you want to solve this issue?

@glemaitre
Copy link
Member

glemaitre commented Jan 12, 2024

Additionally, we should check if we can remove the code:

if not hasattr(self.estimator, "partial_fit"):
                raise ValueError(
                    ("Base estimator {0}, doesn't have partial_fit method").format(
                        self.estimator
                    )
                )

since it should be handled by available_if.

@StefanieSenger
Copy link
Contributor Author

@glemaitre Yes, sure I want to solve this issue. Thank you for the directions.

@StefanieSenger
Copy link
Contributor Author

Your suggestion with the wrapper function seems to work out over several layers, @glemaitre:

from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

iris = load_iris()
sample_weight = np.ones_like(iris.target, dtype=np.float64)

# 1st layer
pipe = Pipeline([("logreg", LogisticRegression())])
clf = OneVsRestClassifier(estimator=pipe).fit(iris.data, iris.target)
clf.transform(iris.data) #AttributeError: 'OneVsRestClassifier' object has no attribute 'transform'

# 2nd layer
pipe = Pipeline([("logreg", LogisticRegression())])
clf = OneVsRestClassifier(estimator=pipe).fit(iris.data, iris.target)
clf.partial_fit(iris.data) #AttributeError: 'Pipeline' object has no attribute 'partial_fit'

# 3rd layer
pipe = Pipeline([("kmeans", KMeans())])
clf = OneVsRestClassifier(estimator=pipe).fit(iris.data, iris.target)
clf.predict_proba(iris.data) # AttributeError: 'KMeans' object has no attribute 'predict_proba'

So, I would now look for more cases (searching for return lambda self: as suggested) and make a PR.

I'd also really like to know @thomasjpfan's opinion on that, because I have the impression that the original intention of available_if was rather to have the meta-estimator display AttribributeError if their sub-estimators don't implement a method.

However, I still find the default error message from available_if confusing in some cases and would be happy to have more custom ones (even more custom than the sub-estimator ones).

@thomasjpfan
Copy link
Member

thomasjpfan commented Jan 20, 2024

I'd also really like to know @thomasjpfan's opinion on that, because I have the impression that the original intention of available_if was rather to have the meta-estimator display AttribributeError if their sub-estimators don't implement a method.

The design of available_if was to be flexible and allow the sub-estimator to raise or the meta-estimator.

Though, OneVsRestClassifier does implement partial_fit, but the underlying estimator doesn't.

Although OneVsRestClassifier.partial_fit is implemented in the code, it is dynamically removed from the instance if the underlying estimator does not define partial_fit. In that sense, the AttributeError about OneVsRestClassifier is correct, but it is not informative enough to explain "why it is not defined".

As an overall solution, I would go with #28108 (comment), but also improve available_if to reraise the AttributeError. That way, we get the both AttributeErrors from the meta-estimator and the sub-estimator. For example:

from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = make_pipeline(DecisionTreeClassifier()).fit(X, y)
clf.decision_function(X)

raises the following:

Exception
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/Repos/scikit-learn-1/sklearn/utils/_available_if.py:29, in _AvailableIfDescriptor._check(self, obj, owner)
     28 try:
---> 29     check_result = self.check(obj)
     30 except Exception as e:

File ~/Repos/scikit-learn-1/sklearn/pipeline.py:53, in _final_estimator_has.<locals>.check(self)
     51 def check(self):
     52     # raise original `AttributeError` if `attr` does not exist
---> 53     getattr(self._final_estimator, attr)
     54     return True

AttributeError: 'DecisionTreeClassifier' object has no attribute 'decision_function'

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
Cell In[2], line 7
      5 X, y = load_iris(return_X_y=True)
      6 clf = make_pipeline(DecisionTreeClassifier()).fit(X, y)
----> 7 clf.decision_function(X)

File ~/Repos/scikit-learn-1/sklearn/utils/_available_if.py:40, in _AvailableIfDescriptor.__get__(self, obj, owner)
     36 def __get__(self, obj, owner=None):
     37     if obj is not None:
     38         # delegate only on instances, not the classes.
     39         # this is to allow access to the docstrings.
---> 40         self._check(obj, owner=owner)
     41         out = MethodType(self.fn, obj)
     43     else:
     44         # This makes it possible to use the decorated method as an unbound method,
     45         # for instance when monkeypatching.

File ~/Repos/scikit-learn-1/sklearn/utils/_available_if.py:31, in _AvailableIfDescriptor._check(self, obj, owner)
     29     check_result = self.check(obj)
     30 except Exception as e:
---> 31     raise AttributeError(attr_err_msg) from e
     33 if not check_result:
     34     raise AttributeError(attr_err_msg)

AttributeError: This 'Pipeline' has no attribute 'decision_function'

I opened #28198 with this proposal.

@StefanieSenger
Copy link
Contributor Author

@thomasjpfan Nice, I like it a lot!

Do I understand correctly that working on this PR still makes sense? For now, there is no inner error raised for the five estimators we talk about here and there should be, right?

@glemaitre
Copy link
Member

You are right @StefanieSenger. We still need this PR such that the check raise an AttributeError instead of returning False in order to trigger the from e in the _check of available_if.

@glemaitre
Copy link
Member

I merged #28198 so we can leverage it.

When I asked for tests, I think the pattern used by @thomasjpfan could be used here as well:
https://github.com/scikit-learn/scikit-learn/pull/28198/files#diff-a9256841e2ca6477c25abca76d3b5d785b69160e055765031b3e9cd8ae3b7b51R67-R72

@glemaitre
Copy link
Member

Closing this issue because the problem is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

3 participants