Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutputCodeClassifier does not work with sparse input data #17218

Closed
zoj613 opened this issue May 14, 2020 · 5 comments · Fixed by #17233
Closed

OutputCodeClassifier does not work with sparse input data #17218

zoj613 opened this issue May 14, 2020 · 5 comments · Fixed by #17233
Labels

Comments

@zoj613
Copy link
Contributor

zoj613 commented May 14, 2020

Describe the bug

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array. is thrown when passing a sparse matrix to the fit method

Steps/Code to Reproduce

import scipy.sparse as sparse
import numpy as np
from xgboost import XGBClassifier
from sklearn.multiclass import OutputCodeClassifier

xdemo = sparse.random(100, 200, random_state=10)
ydemo = np.random.choice((0, 1, 3, 4), size=100)
xgb = XGBClassifier(random_state=10)
OutputCodeClassifier(xgb, n_jobs=-1, random_state=10, code_size=2).fit(xdemo, ydemo)

Expected Results

No error thrown, successful fitting

Actual Results

~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/multiclass.py in fit(self, X, y)
    763         self
    764         """
--> 765         X, y = check_X_y(X, y)
    766         if self.code_size <= 0:
    767             raise ValueError("code_size should be greater than 0, got {0}"

~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    737                     ensure_min_features=ensure_min_features,
    738                     warn_on_dtype=warn_on_dtype,
--> 739                     estimator=estimator)
    740     if multi_output:
    741         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    493                                       dtype=dtype, copy=copy,
    494                                       force_all_finite=force_all_finite,
--> 495                                       accept_large_sparse=accept_large_sparse)
    496     else:
    497         # If np.array(..) gives ComplexWarning, then we convert the warning

~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse)
    293 
    294     if accept_sparse is False:
--> 295         raise TypeError('A sparse matrix was passed, but dense '
    296                         'data is required. Use X.toarray() to '
    297                         'convert to a dense numpy array.')

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

It appears that the check_X_y function causes the exception and is not set to allow sparse matrices.

This is especially bad when using this classifier in a pipeline where the previous step outputs a sparse matrix. The easy workaround in this case was to create an intermediate transformer to convert the sparse to dense

class TurnToDense(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.A

unfortunately this causes everything to crash because of ram being filled up by using a huge dense matrix. Simply adding the keyword argument allow_sparse=True to the check_X_y function fixes this bug.

Versions


System:
    python: 3.7.3 (default, Apr  8 2020, 16:07:18)  [GCC 6.5.0 20181026]
executable: /home/.pyenv/versions/3.7.3/envs/metro/bin/python3.7
   machine: Linux-4.15.0-76-generic-x86_64-with-debian-buster-sid

Python dependencies:
       pip: 19.0.3
setuptools: 46.1.3
   sklearn: 0.22
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 1.0.1
matplotlib: 3.2.1
    joblib: 0.14.1

Built with OpenMP: True
@glemaitre
Copy link
Member

From what I can see, it seems that we could delegate the check to the underlying classifier.
I think that your fix is correct and if the classifier provided does not support sparse data, it will raise an error.

So we can easily add a non-regression test without relying on XGBoost as well.

@zoj613 do you want to make a PR with the fix and the non-regression test?

@glemaitre glemaitre added Bug and removed Bug: triage labels May 14, 2020
@zoj613
Copy link
Contributor Author

zoj613 commented May 14, 2020

From what I can see, it seems that we could delegate the check to the underlying classifier.
I think that your fix is correct and if the classifier provided does not support sparse data, it will raise an error.

So we can easily add a non-regression test without relying on XGBoost as well.

@zoj613 do you want to make a PR with the fix and the non-regression test?

It seems that there is also another change that needs to be made so that it works too when calling predict. The check on this line

X = check_array(X)

also needs accept_sparse=True to work.

Also, ive just noticed that the code on the master branch has changed. The fit method now calls _validate_data instead of check_X_y

X, y = self._validate_data(X, y)

and it seems that _validate_data allows **check_params keywords to be passed. Those keyword arguments are then passed to check_array (which allows accept_sparse=True parameter).

Do you think adding an extra keyword argument to OutputCodeClassifier's fit and predict method is enough or do you have a more elegant approach in mind? I don't mind submitting the PR for this

@glemaitre
Copy link
Member

Do you think adding an extra keyword argument to OutputCodeClassifier's fit and predict method is enough or do you have a more elegant approach in mind? I don't mind submitting the PR for this

No I think that it is more elegant to just accept sparse in both _validate_data and check_array and let the estimator under complain. Adding parameter means that we expect people reading the documentation and understanding it while this is a more advance thing here.

@zoj613
Copy link
Contributor Author

zoj613 commented May 14, 2020

Do you think adding an extra keyword argument to OutputCodeClassifier's fit and predict method is enough or do you have a more elegant approach in mind? I don't mind submitting the PR for this

No I think that it is more elegant to just accept sparse in both _validate_data and check_array and let the estimator under complain. Adding parameter means that we expect people reading the documentation and understanding it while this is a more advance thing here.

Okay, no problem. Im on it.

@zoj613
Copy link
Contributor Author

zoj613 commented May 14, 2020

NVM, i figured it out. made a stupid error of not using self

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants