Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneVsOneClassifier does not accept custom input types #23779

Open
vnmabus opened this issue Jun 28, 2022 · 6 comments
Open

OneVsOneClassifier does not accept custom input types #23779

vnmabus opened this issue Jun 28, 2022 · 6 comments

Comments

@vnmabus
Copy link
Contributor

vnmabus commented Jun 28, 2022

Describe the bug

Due to the additional validation in #6626, OneVsOneClassifier cannot be used with custom types that work like arrays but cannot be converted to arrays. In my case I am the maintainer of scikit-fda, a functional data library that attempts to be compatible with scikit-learn. The metaestimators OneVsRestClassifier and OneVsOneClassifier should be trivially compatible with our data. Currently OneVsRestClassifier works fine, while OneVsOneClassifier doesn't.

I think that scikit-learn has sometimes very aggressive validation that makes it difficult to extend it for custom objects, as in this case.

Steps/Code to Reproduce

I have no example using only scikit-learn, as you need custom types and classifiers / transformers to trigger it
The following code is adapted from https://fda.readthedocs.io/en/latest/auto_tutorial/plot_skfda_sklearn.html#multiclass-and-multioutput-classification-utilities
Note that the code works if you replace OneVsOneClassifier with OneVsRestClassifier, as the later does not have that aggressive validation.

import skfda
import skfda.preprocessing.dim_reduction.variable_selection as vs
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier

X, y = skfda.datasets.fetch_phoneme(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipeline = Pipeline([
    ("dim_reduction", vs.RKHSVariableSelection(n_features_to_select=3)),
    ("classifier", SVC()),
])

multiclass = OneVsOneClassifier(pipeline)

multiclass.fit(X_train, y_train)
multiclass.score(X_test, y_test)

Expected Results

The result of the classification.

Actual Results

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'FDataGrid'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-2-66e12b97fe3e> in <module>
     17 multiclass = OneVsOneClassifier(pipeline)
     18 
---> 19 multiclass.fit(X_train, y_train)
     20 multiclass.score(X_test, y_test)

~/Programas/Utilidades/Lenguajes/miniconda3/envs/fda38/lib/python3.8/site-packages/sklearn/multiclass.py in fit(self, X, y)
    726         """
    727         # We need to validate the data because we do a safe_indexing later.
--> 728         X, y = self._validate_data(
    729             X, y, accept_sparse=["csr", "csc"], force_all_finite=False
    730         )

~/Programas/Utilidades/Lenguajes/miniconda3/envs/fda38/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/Programas/Utilidades/Lenguajes/miniconda3/envs/fda38/lib/python3.8/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    962         raise ValueError("y cannot be None")
    963 
--> 964     X = check_array(
    965         X,
    966         accept_sparse=accept_sparse,

~/Programas/Utilidades/Lenguajes/miniconda3/envs/fda38/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744                     array = array.astype(dtype, casting="unsafe", copy=False)
    745                 else:
--> 746                     array = np.asarray(array, order=order, dtype=dtype)
    747             except ComplexWarning as complex_warning:
    748                 raise ValueError(

ValueError: setting an array element with a sequence.

Versions

System:
    python: 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:11:38)  [GCC 7.3.0]
executable: /home/carlos/Programas/Utilidades/Lenguajes/miniconda3/envs/fda38/bin/python3.8
   machine: Linux-5.4.0-92-generic-x86_64-with-glibc2.10

Python dependencies:
          pip: 21.3.1
   setuptools: 60.5.0
      sklearn: 1.0.2
        numpy: 1.22.0
        scipy: 1.7.3
       Cython: 0.29.26
       pandas: 1.4.0
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True
@vnmabus vnmabus added Bug Needs Triage Issue requires triage labels Jun 28, 2022
@vnmabus
Copy link
Contributor Author

vnmabus commented Jun 28, 2022

I have found an example of the same behaviour using just scikit-learn. The following example works for OneVsRestClassifier but not for OneVsOneClassifier:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
X, y = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42, return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipeline = Pipeline([
    ("preprocessing", CountVectorizer()),
    ("classifier", SVC()),
])

multiclass = OneVsOneClassifier(pipeline)

multiclass.fit(X_train, y_train)
multiclass.score(X_test, y_test)

@glemaitre
Copy link
Member

We should probably be more lenient in the validation and not force a 2D matrix.
We will then delegate the validation to the different pipeline steps.

Thus, it could boil down to the following changes:

diff --git a/sklearn/multiclass.py b/sklearn/multiclass.py
index b46b4bfb8b..5f10d4e6bb 100644
--- a/sklearn/multiclass.py
+++ b/sklearn/multiclass.py
@@ -659,7 +659,12 @@ class OneVsOneClassifier(MetaEstimatorMixin, ClassifierMixin, BaseEstimator):
         """
         # We need to validate the data because we do a safe_indexing later.
         X, y = self._validate_data(
-            X, y, accept_sparse=["csr", "csc"], force_all_finite=False
+            X,
+            y,
+            accept_sparse=["csr", "csc"],
+            force_all_finite=False,
+            ensure_2d=False,
+            dtype=None,
         )
         check_classification_targets(y)
 
@@ -738,6 +743,8 @@ class OneVsOneClassifier(MetaEstimatorMixin, ClassifierMixin, BaseEstimator):
             accept_sparse=["csr", "csc"],
             force_all_finite=False,
             reset=first_call,
+            ensure_2d=False,
+            dtype=None,
         )
         check_classification_targets(y)
         combinations = itertools.combinations(range(self.n_classes_), 2)
@@ -806,6 +813,8 @@ class OneVsOneClassifier(MetaEstimatorMixin, ClassifierMixin, BaseEstimator):
             accept_sparse=True,
             force_all_finite=False,
             reset=False,
+            ensure_2d=False,
+            dtype=None,
         )
 
         indices = self.pairwise_indices_

It would lead to the following results:

In [1]: from sklearn.datasets import fetch_20newsgroups
   ...: from sklearn.feature_extraction.text import CountVectorizer
   ...: from sklearn.model_selection import train_test_split
   ...: from sklearn.svm import SVC
   ...: from sklearn.pipeline import Pipeline
   ...: from sklearn.multiclass import OneVsOneClassifier
   ...: 
   ...: categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
   ...: X, y = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42, return_X_y=True)
   ...: 
   ...: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
   ...: 
   ...: pipeline = Pipeline([
   ...:     ("preprocessing", CountVectorizer()),
   ...:     ("classifier", SVC()),
   ...: ])
   ...: 
   ...: multiclass = OneVsOneClassifier(pipeline)
   ...: 
   ...: multiclass.fit(X_train, y_train)
   ...: multiclass.score(X_test, y_test)

Out[1]: 0.7805309734513274

@thomasjpfan thomasjpfan added module:multiclass and removed Needs Triage Issue requires triage labels Jun 28, 2022
@vnmabus
Copy link
Contributor Author

vnmabus commented Jun 29, 2022

@glemaitre Your solution works for the text example, but it still transforms the input object to a NumPy array. That causes an error in the first example, as the first transformer expects the custom object.

@glemaitre
Copy link
Member

We could bypass the validation on X completely but in this case, we would be in trouble with the computation of n_features_in_ if I am not mistaken.

@jeremiedbb do you remember anything about the need of validating X in this estimator?

@thomasjpfan
Copy link
Member

We could bypass the validation on X completely but in this case, we would be in trouble with the computation of n_features_in_ if I am not mistaken.

We can call _check_n_features directly and it will not learn anything if it can not figure out the number of features on the axis=1

OneVsOneClassifier will end up calling _fit_ovo_binary which requires slicing X on the axis=0 by calling safe_indexing. This "works" if the custom object defines __getitem__, but it returns a list:

from sklearn.utils import _safe_indexing

class CustomData:
    def __init__(self):
        self.items = [1, 2, 3, 4, 5, 6]
    def __getitem__(self, k):
        return self.items[k]

custom_data = CustomData()

_safe_indexing(custom_data, [3, 4])
# [4, 5]

I suspect the desired behavior is for _safe_indexing to actually slice custom_data and return a CustomData object. The most straight forward solution is for CustomData.__getitem__ to support slices or array-likes for indexing and for _safe_indexing to actually pass the slice or array-like directly into CustomData.__getitem__.

@vnmabus
Copy link
Contributor Author

vnmabus commented Jun 29, 2022

We could bypass the validation on X completely but in this case, we would be in trouble with the computation of n_features_in_ if I am not mistaken.

n_features_in_ should be the same as the one returned by the estimator passed to OneVsOneClassifier, so just delegate to it.

This "works" if the custom object defines getitem, but it returns a list

If the custom object has a shape attribute (which my object does) _safe_indexing works just fine. I am currently using train_test_split in the example without problems, and it uses _safe_indexing internally if I am not mistaken.

Objects without shape, such as Awkward arrays may have problems with it, though (I didn't check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants