-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Labels
Description
Describe the bug
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
is thrown when passing a sparse matrix to the fit
method
Steps/Code to Reproduce
import scipy.sparse as sparse
import numpy as np
from xgboost import XGBClassifier
from sklearn.multiclass import OutputCodeClassifier
xdemo = sparse.random(100, 200, random_state=10)
ydemo = np.random.choice((0, 1, 3, 4), size=100)
xgb = XGBClassifier(random_state=10)
OutputCodeClassifier(xgb, n_jobs=-1, random_state=10, code_size=2).fit(xdemo, ydemo)
Expected Results
No error thrown, successful fitting
Actual Results
~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/multiclass.py in fit(self, X, y)
763 self
764 """
--> 765 X, y = check_X_y(X, y)
766 if self.code_size <= 0:
767 raise ValueError("code_size should be greater than 0, got {0}"
~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
737 ensure_min_features=ensure_min_features,
738 warn_on_dtype=warn_on_dtype,
--> 739 estimator=estimator)
740 if multi_output:
741 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
493 dtype=dtype, copy=copy,
494 force_all_finite=force_all_finite,
--> 495 accept_large_sparse=accept_large_sparse)
496 else:
497 # If np.array(..) gives ComplexWarning, then we convert the warning
~/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse)
293
294 if accept_sparse is False:
--> 295 raise TypeError('A sparse matrix was passed, but dense '
296 'data is required. Use X.toarray() to '
297 'convert to a dense numpy array.')
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
It appears that the check_X_y
function causes the exception and is not set to allow sparse matrices.
This is especially bad when using this classifier in a pipeline where the previous step outputs a sparse matrix. The easy workaround in this case was to create an intermediate transformer to convert the sparse to dense
class TurnToDense(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X.A
unfortunately this causes everything to crash because of ram being filled up by using a huge dense matrix. Simply adding the keyword argument allow_sparse=True
to the check_X_y
function fixes this bug.
Versions
System:
python: 3.7.3 (default, Apr 8 2020, 16:07:18) [GCC 6.5.0 20181026]
executable: /home/.pyenv/versions/3.7.3/envs/metro/bin/python3.7
machine: Linux-4.15.0-76-generic-x86_64-with-debian-buster-sid
Python dependencies:
pip: 19.0.3
setuptools: 46.1.3
sklearn: 0.22
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 1.0.1
matplotlib: 3.2.1
joblib: 0.14.1
Built with OpenMP: True