Skip to content

OutputCodeClassifier does not work with sparse input data #17218

@zoj613

Description

@zoj613

Describe the bug

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array. is thrown when passing a sparse matrix to the fit method

Steps/Code to Reproduce

import scipy.sparse as sparse
import numpy as np
from xgboost import XGBClassifier
from sklearn.multiclass import OutputCodeClassifier

xdemo = sparse.random(100, 200, random_state=10)
ydemo = np.random.choice((0, 1, 3, 4), size=100)
xgb = XGBClassifier(random_state=10)
OutputCodeClassifier(xgb, n_jobs=-1, random_state=10, code_size=2).fit(xdemo, ydemo)

Expected Results

No error thrown, successful fitting

Actual Results

~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/multiclass.py in fit(self, X, y)
    763         self
    764         """
--> 765         X, y = check_X_y(X, y)
    766         if self.code_size <= 0:
    767             raise ValueError("code_size should be greater than 0, got {0}"

~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    737                     ensure_min_features=ensure_min_features,
    738                     warn_on_dtype=warn_on_dtype,
--> 739                     estimator=estimator)
    740     if multi_output:
    741         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    493                                       dtype=dtype, copy=copy,
    494                                       force_all_finite=force_all_finite,
--> 495                                       accept_large_sparse=accept_large_sparse)
    496     else:
    497         # If np.array(..) gives ComplexWarning, then we convert the warning

~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse)
    293 
    294     if accept_sparse is False:
--> 295         raise TypeError('A sparse matrix was passed, but dense '
    296                         'data is required. Use X.toarray() to '
    297                         'convert to a dense numpy array.')

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

It appears that the check_X_y function causes the exception and is not set to allow sparse matrices.

This is especially bad when using this classifier in a pipeline where the previous step outputs a sparse matrix. The easy workaround in this case was to create an intermediate transformer to convert the sparse to dense

class TurnToDense(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.A

unfortunately this causes everything to crash because of ram being filled up by using a huge dense matrix. Simply adding the keyword argument allow_sparse=True to the check_X_y function fixes this bug.

Versions


System:
    python: 3.7.3 (default, Apr  8 2020, 16:07:18)  [GCC 6.5.0 20181026]
executable: /home/.pyenv/versions/3.7.3/envs/metro/bin/python3.7
   machine: Linux-4.15.0-76-generic-x86_64-with-debian-buster-sid

Python dependencies:
       pip: 19.0.3
setuptools: 46.1.3
   sklearn: 0.22
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 1.0.1
matplotlib: 3.2.1
    joblib: 0.14.1

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions