Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Fix for ridge regression with sparse matrix input #2552

Open
wants to merge 1 commit into from

5 participants

@bdkearns

Fix for ridge regression with sparse matrix input

  • Re-enable selecting SVD for sparse input (even though it is converted to dense, one day sparse solving may be supported, and there are current valid use cases where sparse input is typical and doesn't blow up the SVD, but does blow up the eigen solver)

(Fixes #2354)

@coveralls

Coverage Status

Coverage remained the same when pulling 8facf356b1d668af863559538bf902ab46d98998 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@bdkearns bdkearns Fix for ridge regression with sparse matrix input
Re-enable selecting SVD for sparse input (even though it is converted to
dense, one day sparse solving may be supported, and there are current
valid use cases where sparse input is typical and doesn't blow up the
SVD, but does blow up the eigen solver)
c7529c7
@coveralls

Coverage Status

Coverage remained the same when pulling c7529c7 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling c7529c7 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling c7529c7 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@larsmans
Owner

I'd rather have sparse matrix SVD support directly, then, using either scipy.sparse.linalg.svds or sklearn.utils.extmath.randomized_svd. (The latter is very fast and stable.)

@bdkearns

Sure, sparse SVD would be nice to have, but you don't consider it a regression that perfectly working usages under 0.13 don't only raise MemoryError because they get forced into the eigen solver, but are actually forbidden from working even manually specifying the svd solver? And for what gain? Why was this usage suddenly banned in 0.14? IMO the advice for the other person getting a MemoryError with a nearly 100k x 100k square matrix (testing the limits on either solver really, just happened to trigger on SVD) would be to simply specify the solver by hand, not to suddenly force every sparse input into the eigen solver. Seems a bit knee jerk.

Now someone with say 100k samples x 100 features (which only needs a 100x100 economy SVD right?) who happens to have a sparse matrix input (say coming from a textual feature problem) gets forced into eigen (when they used to get svd) and gets MemoryError, and if they happen to deduce why and try to change back to svd, find it banned? I guess they can work around by converting the input to dense beforehand, but that seems an odd thing to force users to do.

@mblondel mblondel commented on the diff
sklearn/linear_model/ridge.py
@@ -645,8 +645,8 @@ def _values(self, alpha, y, v, Q, QT_y):
return y - (c / G_diag), c
def _pre_compute_svd(self, X, y):
- if sparse.issparse(X):
- raise TypeError("SVD not supported for sparse matrices")
@mblondel Owner

I'm -1 on this PR because densifying a sparse matrix behind the scenes is really not a good idea. I'd rather improve the error message to tell users to call X = X.toarray() in their script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bdkearns

Well, if that's the case, then why only disable sparse inputs for the SVD solver? The eigen solver does safe_sparse_dot(dense_output=True), creating a MxM dense matrix possibly even larger than the sparse input. And you guys changed it so sparse inputs previously fed to the SVD solver (which worked fine when densifying), now without warning feed to eigen, even when n_samples >> n_features, and give MemoryErrors in these cases. Consider text problems -- the inputs lend themselves to sparse representation (users probably won't think to densify first, even if it is possible), n_samples >> n_features, so eigen will explode by default now, even if densify+SVD would have and did work in the past.

See issue #2354 for an example that creates a memory error under 0.14, was fine under 0.13.

@mblondel
Owner

The eigen solver does safe_sparse_dot(dense_output=True)

Yes but the resulting output has shape [n_samples, n_samples] when dense NumPy arrays are passed too. So densifying the output matrix doesn't change anything here.

now without warning feed to eigen, even when n_samples >> n_features, and give MemoryErrors in these cases

Indeed that's bad. Maybe we should disable this behavior then.

@mblondel
Owner

I'd rather have sparse matrix SVD support directly, then, using either scipy.sparse.linalg.svds or sklearn.utils.extmath.randomized_svd. (The latter is very fast and stable.)

I'm not sure if we can use a truncated SVD here but it's worth a try.

@bdkearns

Yes but the resulting output has shape [n_samples, n_samples] when dense NumPy arrays are passed too. So densifying the output matrix doesn't change anything here.

Yes, but with dense matrices, anything with n_samples > n_features is sent to SVD, not eigen, and produces n_features x n_features array instead. Which silently changed for sparse, while banning them from SVD?

One can construct sparse matrices that blow up either solver, as they both convert to dense. IMO there are two paths -- allow sparse inputs, obviously documenting that both solvers use dense computations, or flat out ban them from both. This "oh one user blew up one solver with this shape problem, let's ban sparse from that one and send all problems to the other" is careless.

@bdkearns

Also, one might consider that other parts of sklearn like feature_extraction.text.CountVectorizer produce sparse matrices. And users probably expect to be able to feed these directly into the linear models like these, as in the examples. And these sparse matrices are of the sort typically with large n_samp, small n_feat. Changing the model here to ban sparse or force them into a solver that produces huge arrays (n_samp x n_samp even when n_samp >> n_feat) when it didn't before seems odd. You guys are producing a library here, one would think API consistency would be valued. One might think you guys would evaluate typical usages patterns/examples before making changes like this that break consistency.

@bdkearns

And this breakage was for one user who was getting a MemoryError? And what part of the change actually "fixed" their problem? Hard coding all sparse arrays to eigen. It wasn't even the removal of the densify in SVD. And now all users of the problems described above get MemoryErrors! What a 'fix'. Anyways, I'm surprised anyone would support hard coding all sparse inputs to one solver (or equally be -1 against reverting it).

@bdkearns bdkearns closed this
@bdkearns bdkearns reopened this
@coveralls

Coverage Status

Coverage remained the same when pulling c7529c7 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@mblondel
Owner
@bdkearns bdkearns commented on the diff
sklearn/linear_model/ridge.py
@@ -701,7 +701,7 @@ def fit(self, X, y, sample_weight=1.0):
with_sw = len(np.shape(sample_weight))
if gcv_mode is None or gcv_mode == 'auto':
- if sparse.issparse(X) or n_features > n_samples or with_sw:
+ if n_features > n_samples or with_sw:
gcv_mode = 'eigen'

Here, sparse matrices with n_samp >> n_feat (say typical of the type produced by text feature problems) are sent to eigen, where a dense n_samp x n_samp is produced (X dot X.T), when dense matrices of this shape would have been sent to SVD (and in the past sparse were also).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel
Owner

Changing the model here to ban sparse or force them into a solver that produces huge arrays (n_feat x n_feat) when it didn't before seems odd

Please don't edit your posts. Otherwise people who read the discussion later can't understand why I asked

What line creates a (n_features, n_features) matrix?

@bdkearns

Simple typo, don't forget you're free to edit as well.

@amueller amueller added the Bug label
@amueller amueller added this to the 0.16 milestone
@amueller
Owner

hum was there any conclusion here?

@mblondel
Owner

Silently densifying a sparse matrix is a bad idea and we've always tried to avoid it. The real fix would be to support sparse matrices directly (so this should be labeled as "new feature" rather than "bug"). We can try to compute a sparse SVD by svds or sklearn.utils.extmath.randomized_svd but they only compute the first k singular vectors so the results won't be the same. I would be curious to know if this can work well (for k sufficiently large).

I suggest we close this PR. I agree with the problem but not with the solution.

@amueller
Owner

I only skimmed it but it seemed like a regression, but if you say it is not then I'll take your word for it ;)

@amueller amueller removed the Bug label
@amueller amueller removed this from the 0.16 milestone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Oct 27, 2013
  1. @bdkearns

    Fix for ridge regression with sparse matrix input

    bdkearns authored
    Re-enable selecting SVD for sparse input (even though it is converted to
    dense, one day sparse solving may be supported, and there are current
    valid use cases where sparse input is typical and doesn't blow up the
    SVD, but does blow up the eigen solver)
This page is out of date. Refresh to see the latest.
Showing with 3 additions and 3 deletions.
  1. +3 −3 sklearn/linear_model/ridge.py
View
6 sklearn/linear_model/ridge.py
@@ -645,8 +645,8 @@ def _values(self, alpha, y, v, Q, QT_y):
return y - (c / G_diag), c
def _pre_compute_svd(self, X, y):
- if sparse.issparse(X):
- raise TypeError("SVD not supported for sparse matrices")
@mblondel Owner

I'm -1 on this PR because densifying a sparse matrix behind the scenes is really not a good idea. I'd rather improve the error message to tell users to call X = X.toarray() in their script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ if sparse.issparse(X) and hasattr(X, 'toarray'):
+ X = X.toarray()
U, s, _ = linalg.svd(X, full_matrices=0)
v = s ** 2
UT_y = np.dot(U.T, y)
@@ -701,7 +701,7 @@ def fit(self, X, y, sample_weight=1.0):
with_sw = len(np.shape(sample_weight))
if gcv_mode is None or gcv_mode == 'auto':
- if sparse.issparse(X) or n_features > n_samples or with_sw:
+ if n_features > n_samples or with_sw:
gcv_mode = 'eigen'

Here, sparse matrices with n_samp >> n_feat (say typical of the type produced by text feature problems) are sent to eigen, where a dense n_samp x n_samp is produced (X dot X.T), when dense matrices of this shape would have been sent to SVD (and in the past sparse were also).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
else:
gcv_mode = 'svd'
Something went wrong with that request. Please try again.