[WIP] Ridge path #3383

Open
wants to merge 38 commits into
from

Conversation

Projects
None yet
9 participants
Contributor

eickenberg commented Jul 14, 2014

Adressing #3373

A first implementation of a ridge path using 'eigen' solver. This solver is now available in ridge_regression, which calls upon a reduced functionality of ridge_path.

Next steps:

  • The same path approach can be straightforwardly used with SVD as well: Write svd path and replace in ridge_regression
  • Attack RidgeCV: Replace the GridSearchCV by ridge_path if eigen or svd are selected.
  • Incorporate ridge_path_looe from the gist https://gist.github.com/eickenberg/e9eb106f12e40d017fd0
  • Adjust RidgeCV to use the looe functionalities

Coverage Status

Coverage decreased (-0.02%) when pulling a60385b on eickenberg:ridge_path into ffaea27 on scikit-learn:master.

sklearn/linear_model/ridge.py
+ return output
+
+
+def precomp_kernel_ridge_path_eigen(gramX_train, Y_train, alphas,
@mblondel

mblondel Jul 15, 2014

Owner

I'd rather avoid using a public function.

@mblondel

mblondel Jul 15, 2014

Owner

I think you should combine this function with the one above.

@eickenberg

eickenberg Jul 15, 2014

Contributor

Changed.

My reasoning for making the function public was that it is fully functional for a precomputed kernel. Olivier reminded me of the fact that public functions, if changed in the future need to be deprecated etc., and since all this kernel stuff is not fully established yet, it is definitely better to keep this private.

Dividing the function in 2 was premature optimization: I was hoping to be able to reuse the same decomposition for a potential grid search with refinement. As suggested, I merged them into 1 function.

sklearn/linear_model/ridge.py
+
+
+def kernel_ridge_path_eigen(X_train, Y_train, alphas, X_test,
+ kernel=_linear_kernel):
@eickenberg

eickenberg Jul 15, 2014

Contributor

Addressed

sklearn/linear_model/ridge.py
+ return X_test.dot(coef).transpose(1, 0, 2)
+
+
+def feature_ridge_path_eigen(X_train, Y_train, alphas, X_test=None):
@eickenberg

eickenberg Jul 15, 2014

Contributor

Addressed

sklearn/linear_model/ridge.py
+ X_test : ndarray, optional
+ shape=(n_test_samples, n_features)
+ Test set. If specified, ridge_path returns predictions on X_test,
+ otherwise it returns coefficients.
@mblondel

mblondel Jul 15, 2014

Owner

Could you add a comment on why this is useful?

@eickenberg

eickenberg Jul 15, 2014

Contributor

Comment added, recommending using X_test in ridge_path to bypass establishing coef_ if possible.

+ Returns
+ -------
+ coef or predictions : ndarray
+ shape = (n_penalties, n_features, n_targets) (coef) or
@mblondel

mblondel Jul 15, 2014

Owner

(n_penalties, n_features) if only one target

@eickenberg

eickenberg Jul 15, 2014

Contributor

Addressed

Owner

fabianp commented Jul 15, 2014

I would add ridge_path to linear_model/__init__.py so that it is possible to import as from sklearn.linear_model import ridge_path

sklearn/linear_model/ridge.py
+
+ XTY = X_train.T.dot(Y_train)
+ v, V = np.linalg.eigh(X_train.T.dot(X_train))
+
@fabianp

fabianp Jul 15, 2014

Owner

Better to use scipy.linalg whenever possible instead of np.linalg since the first is always compiled against optimized LAPACK.

Also I find v, V confusing as variable names. What about renaming v to eigvals ?

@eickenberg

eickenberg Jul 15, 2014

Contributor

Addressed.

sklearn/linear_model/ridge.py
+ # to do (V.T.dot(X.T)).dot(Y)
+ inv_v_alpha = 1. / (v_train[np.newaxis, :, np.newaxis] +
+ alphas[:, np.newaxis])
+
@fabianp

fabianp Jul 15, 2014

Owner

Border case but we should consider the possibility of (v_train[np.newaxis, :, np.newaxis] + alphas[:, np.newaxis]) being zero (this can be reproduced e.g. with a zero design matrix and alpha=0). Similarly as what we do in _solve_svd I think we replace that like with something like this

    tmp = (v_train[np.newaxis, :, np.newaxis] + alphas[:, np.newaxis])
    idx = (tmp != 0)
    inv_v_alpha = np.zeros_like(tmp)
    inv_v_alpha[idx] = 1. / tmp[idx]

what do you think?

@eickenberg

eickenberg Jul 15, 2014

Contributor

Addressed

sklearn/linear_model/ridge.py
+ ----------
+
+ X_train : ndarray
+ shape = (n_samples, n_features)
@agramfort

agramfort Jul 15, 2014

Owner

shape are documented on first line

X_train : ndarray, shape (n_samples, n_features)

fix below too

@eickenberg

eickenberg Jul 15, 2014

Contributor

Addressed

Coverage Status

Coverage decreased (-0.02%) when pulling a50ffb9 on eickenberg:ridge_path into f7e9527 on scikit-learn:master.

Coverage Status

Coverage decreased (-0.02%) when pulling e73564d on eickenberg:ridge_path into f7e9527 on scikit-learn:master.

Contributor

eickenberg commented Jul 15, 2014

Thanks for the very helpful input!

I have added a ridge_path eigen-solver in RidgeCV, it gives the same results as the wrapped GridSearchCV as can be seen in the test I added.

Next step: Do the same for SVD and LOO

sklearn/linear_model/ridge.py
+ v_alpha = (eig_val_train[np.newaxis, :, np.newaxis] +
+ alphas[:, np.newaxis])
+ v_alpha_inv = np.zeros_like(v_alpha)
+ passes_threshold = v_alpha > eig_val_thresh
@eickenberg

eickenberg Jul 16, 2014

Contributor

Here I look at what passes threshold after adding alpha. I think this may be suboptimal. Any comments? Above in _ridge_path_svd I select only singular values that pass threshold. That way I also avoid some calculations.

@fabianp

fabianp Jul 16, 2014

Owner

I don't think it's worth to worry for the efficiency of that border case. Looks good to me.

@eickenberg

eickenberg Aug 19, 2014

Contributor

In rewriting this thing in other places and taking into account discussions with @bthirion on the instability of the method when using singular matrices, which are also described in #1807, I think I will change this so cropping of singular values before adding alpha.

@eickenberg

eickenberg Aug 20, 2014

Contributor

Turns out that if I crop singular values and don't add the penalty to those dimensions, then I would have to take those dimensions into account outside this eigenvalue inverse as in the SVDgcv method. So I think it should stay this way.

Owner

amueller commented Jul 16, 2014

I find it a bit weird to call something "kernel" because it is solved in the dual. That is what the Ridge code does, right? Anyhow that doesn't seem to be related to the pr...

Contributor

eickenberg commented Jul 16, 2014

All kernel methods work in the dual, because their feature space is potentially infinite dimensional. This is definitely kernel ridge with a linear kernel (i.e. you can replace the linear kernel by a different kernel and it will work up to data centering in feature space etc which has not been taken care of). However I agree that one may say that we are stretching the notion a little bit, because one could equally speak of matrix inversion lemma or just 'working in the dual' as you say. Honestly I don't mind how it is called as long as we have the functionality ;) ...

Owner

amueller commented Jul 16, 2014

All kernel methods are implemented in the dual, but not all dual optimizers allow for kernels, for example ours does not ;)

+ assert_array_almost_equal(ridge_cv.coef_, ridge_cv_path.coef_[0])
+
+
+def make_noisy_forward_data(
@amueller

amueller Jul 16, 2014

Owner

can you document what the intend is?

@eickenberg

eickenberg Jul 17, 2014

Contributor

I added a docstring - I need to generate simple noisy linear models, didn't want to go into the data generators for that. Should I?

Coverage Status

Coverage increased (+0.04%) when pulling 8619c32 on eickenberg:ridge_path into f7e9527 on scikit-learn:master.

Owner

ogrisel commented Jul 29, 2014

The AppVeyor failure under windows 32 bit is unrelated and reported here: #3370

Coverage Status

Coverage decreased (-0.0%) when pulling 44d97c4 on eickenberg:ridge_path into 8680a6b on scikit-learn:master.

eickenberg added some commits Jul 14, 2014

Added test to show that sample weights are not taken into account cor…
…rectly in ridge_gcv (they are multiplied to the penalty which is added to the eigenvectors thus weighting the eigenvector penalties and not the samples)
+from ..cross_validation import check_cv
+
+
+def _ridge_path_svd(X_train, Y_train, alphas,
@eickenberg

eickenberg Aug 19, 2014

Contributor

This function is multi-target, multi-penalty. It will return ridge coefficients for all targets and all penalties, i.e. ridge paths for all targets. If X_test is specified, then predictions on X_test will be returned instead of coefficients, again for all targets and all alpha, i.e. a prediction path.

+ return output
+
+
+def _ridge_gcv_path_svd(X_train, Y_train, alphas,
@eickenberg

eickenberg Aug 19, 2014

Contributor

This function employs the SVD method to do generalized cross-validation. It is multi-target and multi-penalty and can return either a LOO error path or a LOO value path.

+ raise ValueError("mode must be in ['looe', 'loov']. Got %s" % mode)
+
+
+def _precomp_kernel_ridge_path_eigen(gramX_train, Y_train, alphas,
@eickenberg

eickenberg Aug 19, 2014

Contributor

This function uses the eigendecomposition of the gram matrix (X.dot(X.T) in the linear kernel case, and here we treat only the linear case) to calculate a ridge path. If mode is specified to normal, then it will return dual coefficients, or, if gramX_test is specified, predictions on the corresponding test samples. If mode is set to loov or looe, then analytical Leave-One-Out error or prediction values will be returned. This function is multi-target and multi-penalty.

+ return safe_sparse_dot(X, Y.T, dense_output=True)
+
+
+def _kernel_ridge_path_eigen(X_train, Y_train, alphas, X_test=None,
@eickenberg

eickenberg Aug 19, 2014

Contributor

Private function called from ridge_path and ridge_regression, which calculates gram matrices and uses _precomp_kernel_ridge_path_eigen on the results.

+ mode=mode, copy=True)
+
+
+def _feature_ridge_path_eigen(X_train, Y_train, alphas, X_test=None,
@eickenberg

eickenberg Aug 19, 2014

Contributor

Implements multi-target multi-penalty ridge path in feature space, generally used when n_samples > n_features. Uses eigendecomposition to solve the linear systems.

+ return 1 - err / (sq_norm_y_true + 1e-18)
+
+
+def ridge_path(X_train, Y_train, alphas, X_test=None, solver="eigen"):
@eickenberg

eickenberg Aug 19, 2014

Contributor

This function contains the logic for calling the appropriate private *_path* functions and returning a ridge path.

@@ -165,17 +473,6 @@ def _solve_cholesky_kernel(K, y, alpha, sample_weight=None):
return dual_coefs.T
-def _solve_svd(X, y, alpha):
@eickenberg

eickenberg Aug 19, 2014

Contributor

made obsolete by _ridge_path_svd

@@ -639,70 +944,6 @@ def __init__(self, alphas=[0.1, 1.0, 10.0],
self.gcv_mode = gcv_mode
self.store_cv_values = store_cv_values
- def _pre_compute(self, X, y):
@eickenberg

eickenberg Aug 19, 2014

Contributor

All underscore methods pertaining to GCV calculation removed and replaced by the corresponding GCV paths.

@@ -737,25 +978,10 @@ def fit(self, X, y, sample_weight=None):
gcv_mode = 'eigen'
else:
gcv_mode = 'svd'
- elif gcv_mode == "svd" and with_sw:
- # FIXME non-uniform sample weights not yet supported
@eickenberg

eickenberg Aug 19, 2014

Contributor

sample weights are now supported

@@ -764,19 +990,24 @@ def fit(self, X, y, sample_weight=None):
score_overrides_loss=True)
error = scorer is None
- for i, alpha in enumerate(self.alphas):
- weighted_alpha = (sample_weight * alpha
@eickenberg

eickenberg Aug 19, 2014

Contributor

This is a bug and does not actually add sample weights to the samples, but to the eigenvalues. Bug is fixed and non-regression test provided.

+ mode = 'looe' if error else 'loov'
+ y_is_raveled = y.ndim == 1
+ y = np.atleast_2d(y.T).T
+ if gcv_mode == 'eigen':
@eickenberg

eickenberg Aug 19, 2014

Contributor

what follows are the calls to the ridge gcv path functions.

+ # XXX
+ # This is the wrong application of the scorer: It is given all
+ # of y and all cv values, although it should only be given one
+ # value at a time, because the CV we are using here is LOO!
@eickenberg

eickenberg Aug 19, 2014

Contributor

Please comment on the above comment!

if error:
- best = cv_values.mean(axis=0).argmin()
+ best = cv_values.mean(axis=0).argmin(axis=1)
@eickenberg

eickenberg Aug 19, 2014

Contributor

best alpha is now chose per target, a few lines down dual coefs are chosen individually according to best individual penalty.

else:
- if self.store_cv_values:
- raise ValueError("cv!=None and store_cv_values=True "
+ if self.solver in ['eigen', 'svd']:
@eickenberg

eickenberg Aug 19, 2014

Contributor

Added new logic for RidgeCV in the case of eigen and svd solvers. Still falling back to GridSearchCV for the others. Probably the best ideas is, as discussed, only to allow solvers that actually provide a speedup with respect to just wrapping using GridSearchCV.

Contributor

eickenberg commented Aug 19, 2014

This is a big PR. I have addressed all the issues I had mentioned, come across many more and fixed them on the way, but am now a little apprehensive about the size of this contribution. I don't see how to easily break it into parts, so I am hoping for some kind souls :) to take a look at the code to see if this is going in the right direction. I provided comments in the diff explaining the basic ideas of what I have changed/added.

sklearn/linear_model/ridge.py
+ X_test=None, sg_val_thresh=1e-15):
+
+ U, S, VT = linalg.svd(X_train, full_matrices=False)
+ nnzmax = np.where(S > sg_val_thresh)[0].max() + 1
sklearn/linear_model/ridge.py
+ UTY = U.T[:nnzmax].dot(Y_train)
+ s_alpha = S[np.newaxis, :nnzmax, np.newaxis] / (
+ S[np.newaxis, :nnzmax, np.newaxis] ** 2 + alphas[:, np.newaxis])
+ s_alpha_UTY = s_alpha * UTY[np.newaxis] # possibly inplace
@bthirion

bthirion Aug 19, 2014

Owner

possibly inplace ?

@eickenberg

eickenberg Aug 20, 2014

Contributor

my reasoning was that s_alpha should be dense with all axes already broadcast, so I should just multiply UTY[np.newaxis] to it, since all these are intermediate variables anyway. I will check to see if this actually works.

@eickenberg

eickenberg Aug 20, 2014

Contributor

Actually they are not of the same shape, since in the multitarget case an axis is broadcasted. So I wouldn't want to add decision logic to do inplace multiplication iff n_targets==1

sklearn/linear_model/ridge.py
+
+ U, S, VT = linalg.svd(X_train, full_matrices=False)
+ del VT
+ nnzmax = np.where(S > sg_val_thresh)[0].max() + 1
sklearn/linear_model/ridge.py
+ penalized_inv_S = (1. / (S[np.newaxis, :, np.newaxis] ** 2 +
+ alphas[:, np.newaxis, :]) -
+ 1. / alphas[:, np.newaxis, :])
+
@bthirion

bthirion Aug 19, 2014

Owner

Note sure why you call it penalized_inv_S: since this is always negative, it looks like something different from the "penalised inverse of S"

@eickenberg

eickenberg Aug 20, 2014

Contributor

Indeed this is always negative. The 1. / alphas outside the singular values comes from the fact that VT was created with full_matrices=False and we need to take into account the nullspace of the design. This is shown in the chapter on "The SVD method" in the "Notes on regularized least squares".

@eickenberg

eickenberg Aug 20, 2014

Contributor

I will try to find a better name.

@eickenberg

eickenberg Aug 20, 2014

Contributor

I called it inv_S_and_alpha ...

sklearn/linear_model/ridge.py
+ v_alpha_inv = np.zeros_like(v_alpha)
+ passes_threshold = v_alpha > eig_val_thresh
+ v_alpha_inv[passes_threshold] = 1. / v_alpha[passes_threshold]
+ # should probably be done inplace
@eickenberg

eickenberg Aug 20, 2014

Contributor

I'll check if it works

@eickenberg

eickenberg Aug 20, 2014

Contributor

Not possible due to unbroadcasted axes.

+
+ Notes
+ -----
+ This solver does not fit the intercept.
@bthirion

bthirion Aug 19, 2014

Owner

Sorry if this is a silly question, but why ?

@eickenberg

eickenberg Aug 20, 2014

Contributor

I basically copied the logic of the ridge_regression function, which doesn't fit the intercept either. I think the reason there may have been to avoid inconsistent behaviour when dealing with sparse matrices, because centering them is impossible without densifying them. However, if this is the case, then, since ridge_path only has support for sparse matrices in the eigen kernel ridge, and actually densifies in that case, I would be all in favour of adding fit_intercept=True to this function. The main reason is that it would make the code in RidgeCV much much lighter (currently I have to center each fold separately and undo all that for prediction). In this case, the function would have to return the intercept as well. Could you comment on this @mblondel ?

@mblondel

mblondel Aug 21, 2014

Owner

@agramfort or @fabianp wrote the original Ridge code so they know better but I think it was a separation of concerns. It is easier to handle the intercept as a pre / post processing than to implement it directly in each and every one solver. Of course, the problem with sparse data was not anticipated.

@agramfort

agramfort Aug 21, 2014

Owner

Of course, the problem with sparse data was not anticipated.

indeed...

Contributor

eickenberg commented Aug 20, 2014

For this PR one main question remains: Should I remove _RidgeGCV and incorporate all CV logic into BaseRidgeCV?

- fit_params = {}
- gs = GridSearchCV(Ridge(fit_intercept=self.fit_intercept),
+ parameters = {'alpha': self.alphas}
+ # FIXME: sample_weight must be split into training/validation data
@bthirion

bthirion Aug 20, 2014

Owner

You should at least raise a warning, shouldn't you ?

Contributor

dohmatob commented Oct 21, 2015

This one needs rebasing ;)

Contributor

eickenberg commented Oct 22, 2015

or redoing ;)

I cherry-picked the RidgeGCV sample weight fix into a new PR which is more
or less up to date.

The rest also has to be cut into pieces, because nobody is going to review
this monolith.

I'll see if I get round to it

On Wed, Oct 21, 2015 at 6:58 PM, DOHMATOB Elvis notifications@github.com
wrote:

This one needs rebasing ;)


Reply to this email directly or view it on GitHub
#3383 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment