Implemented shrinkage LDA classifier. #3105

Closed
wants to merge 39 commits into
from

Projects

None yet

9 participants

@cbrnr
Contributor
cbrnr commented Apr 24, 2014

@kazemakase and I implemented shrinkage LDA (see also my comment in #1649). Note that unlike lda.LDA, our implementation does only classification and not transformation (dimensionality reduction). Note that we are not using the existing implementation because shrinkage is not possible with SVD.

We would be happy if someone could review the code.

@mbillingr
Contributor

Here is an example of the shrinkage LDA in action:
slda
(20 training samples, 100 test samples, 50 repetitions, normally distributed features)

@cle1109 the lambda function makes the build fail on python 2.
Any idea what's causing the TypeError: unorderable types: type() < type() on python 3?

@coveralls

Coverage Status

Coverage remained the same when pulling 2edb383 on cle1109:sldac into 463cea5 on scikit-learn:master.

@larsmans
Member

Does shrinkage mean regularization? (Statistics terminology is driven me crazy...)

@larsmans larsmans commented on an outdated diff Apr 28, 2014
sklearn/slda.py
@@ -0,0 +1,235 @@
+"""
+The :mod:`sklearn.slda` module implements Shrinkage LDA.
@larsmans
larsmans Apr 28, 2014 Member

If it's an LDA variant, it should go in sklearn.lda.

@larsmans larsmans commented on an outdated diff Apr 28, 2014
sklearn/tests/test_slda.py
@@ -0,0 +1,72 @@
+import numpy as np
@larsmans
larsmans Apr 28, 2014 Member

... and the tests can be merged with the LDA tests. There's probably some overlap that can be exploited.

@larsmans larsmans commented on an outdated diff Apr 28, 2014
sklearn/slda.py
+ # TODO: support equal priors (with priors=='equal')
+ if self.priors is not None:
+ if (self.priors < 0).any():
+ raise ValueError('priors must be non-negative')
+ if self.priors.sum() != 1:
+ print('warning: the priors do not sum to 1. Renormalizing')
+ self.priors = self.priors / self.priors.sum()
+
+ if shrinkage is not None:
+ if shrinkage == 'auto':
+ self._cov_estimator = _ledoit_wolf
+ else:
+ print('warning: unknown shrinkage method, using no shrinkage')
+ self._cov_estimator = empirical_covariance
+ else:
+ self._cov_estimator = empirical_covariance
@larsmans
larsmans Apr 28, 2014 Member

We do input validation in fit, not __init__. Otherwise set_params can still be used to set inconsistent parameters. Also, don't print warnings to stdout please, either use the warnings module or raise an exception. I think the latter is more appropriate.

@larsmans
Member

Also, why is this a separate estimator? Can't this feature be added to the existing LDA class?

@cbrnr
Contributor
cbrnr commented Apr 28, 2014

Yes, you are correct, shrinkage means regularization (we apply it to the covariance estimate). I can put it in sklearn.lda. I will fix the init function, and this should also be done for the original class in sklearn.lda(I took the code from there).

Concerning creating a separate estimator, I did not want to change the original class to avoid breaking anything. There are some problems when we also want to have LDA transformation with our implementation. To use shrinkage, we have to work with covariance estimates. The original LDA class uses SVD instead, which does have some advantages. For example, to compute the transformation, we would have to solve a generalized eigenvalue problem. In an underdetermined case, this works well (that's what shrinkage is used for), but it fails when using no shrinkage (eigh raises an exception I believe). In contrast, this does work with the current SVD implementation.

In addition, I'm not sure how to handle issue #1649 with our code. It looks like our implementation isn't robust to scaling either (same like the SVD implementation). However, with our implementation, this problem is not limited to the underdetermined case, but also seems to occur in the overdetermined case. This is not to say that the algorithm does not work; after all, it's the standard LDA algorithm described in Duda & Hart, which everyone seems to be using.

@larsmans
Member

You can avoid breaking working code by hardening the tests--I prefer new features to be added to existing estimators, as long as the classes don't become too heavy. Could you try to add the functionality to the existing class to see if it works?

@amueller
Member

Maybe a stupid question, but could you explain why it is not possible to use shrinkage in the SVD implementation by @ksemb?

@cbrnr
Contributor
cbrnr commented Apr 29, 2014

Because we shrink covariance matrices used in the generalized eigenvalue problem. SVD does not work on covariance matrices, but instead decomposes the data directly into singular values.

It would be great if we could use shrinkage with SVD, but I believe this is not possible (correct me if I'm wrong).

@mbillingr
Contributor

The reason we put the implementation in it's own class and module (for now) is that we made two fundamental changes to the original LDA.

  1. Different API: lda.LDA is both, a classifier and a transform, while slda.SLDA is a classifier only.
  2. Different implementation: SVD vs. covariance estimation

Once we know how to tackle these differences we can merge the different LDAs according to your guidlines.

Would be great to know if SVD can be shrunk somehow, but I don't believe that's possible either.

@cbrnr
Contributor
cbrnr commented Apr 29, 2014

Still we could put the SLDA class in the lda module. However, because of these issues, it would be very cumbersome to merge everything into one class. Even creating a base LDA class would not be feasible, because we couldn't reuse anything.

How would you like us to proceed?

@cbrnr
Contributor
cbrnr commented Apr 29, 2014

I accidentally rebased my sldac branch with upstream/master. Is it possible to remove the unrelated commits from this PR somehow?

Clemens Brunner Replaced np.linalg.solve() with np.linalg.lstsq(); fixed Python 2 com…
…patibility for plot_slda example.
5e67d52
@GaelVaroquaux
Member

How would you like us to proceed?

It seems to me that a different class, named 'ShrinlageLDA' or
'ShrunkLDA' to be consistent with the covariance estimator, but indeed
in the same file, would be a good approach.

@GaelVaroquaux
Member

I accidentally rebased my sldac branch with upstream/master. Is it possible to
remove the unrelated commits from this PR somehow?

No, it's a limitation of github.

@cbrnr
Contributor
cbrnr commented Apr 30, 2014

By replacing np.linalg.solve() with its least squares counterpart np.linalg.lstsq(), our algorithm without shrinkage behaves now like the original implementation:
figure_1
Therefore, I would like to keep the option to use our new implementation without shrinkage. In the long run, it would be nice to have only one class for LDA, but for now I agree that putting our class (let's rename it to ShrinkageLDA) into the lda module is fine. As a next step, we might add transform functionality, but this should probably be discussed in a separate PR once this one has been merged.

@cbrnr
Contributor
cbrnr commented May 28, 2014

Is there anything else I should do, or do you think this PR could be merged now?

@ksemb
ksemb commented May 28, 2014

If by shrinkage you mean adding a constant to the diagonal of the covariance matrix, then I would like to draw attention to the fact that shrinkage IS possible with an SVD implementation.

Let M be a symmetric positive definite matrix, let v be an eigenvector of that matrix, let a be the eigenvalue belonging to v, and let c be the shrinkage constant.

Then v is also an eigenvector of the regularized matrix and the eigenvalue corresponding to the regularized matrix is a+c, because v'_(M+c_I)_v=v'_M_v+c_v'*v=a+c

However one faces the problem that the SVD of M misses the eigenvectors of the regularized covariance matrix that have an eigenvalue of c, because those eigenvalues are 0 in M.

In this comment #1649 (comment) I wanted to say that I have implemented an SVD that takes this into account. That implementation has, however, still implemented the bug that I commented on here #1967 (comment)

@mbillingr
Contributor

By shrinkage we mean adding a constant to the diagonal and rescaling of the covariance matrix (see [1]). More specifically, we refer to Ledoit-Wolf shrinkage, which determines the shrinkage coefficient from the data.

@ksemb Your implementation [2] seems to support "manual" regularization. The advantage of Ledoit-Wolf shrinkage is that the parameter is determined automatically. It may be worth joining forces here to get the best of both our approaches.

By the way, what exactly is the advantage of an SVD implementation? It seems to me that our approach of (1) compute covariance, and (2) solve for the weights is rather straight-forward, and thus easily maintainable. Whereas the SVD implementations I have seen so far have more complicated code, but maybe I'm thinking too simplistic here.

[1] http://scikit-learn.org/stable/modules/covariance.html#shrunk-covariance
[2] https://gist.github.com/ksemb/4c3c86c6b62e7a99989b

@larsmans
Member

There's some stray edits to the tree code in this PR. Also, please edit doc/modules/lda_qda.rst to briefly explain the SLDA class, what it does, and when to use it. Don't forget to make html in the docs dir to see if the example runs properly.

@ksemb
ksemb commented Jun 2, 2014

@kazemakase: I agree that the covariance version is more straight-forward to implement. The SVD implementation has the following benefits:

  1. The SVD implementation requires less RAM than the covariance implementation if the number of observations is smaller than the number of features. This is the typical scenario where you actually want regularization.
  2. It is possible to implement regularization strategies other than adding a constant to the diagonal. In the SVD implementation, you can modify the singular values in different ways (which I did in my implementation).
@cbrnr
Contributor
cbrnr commented Jun 3, 2014

@ksemb, memory consumption could be an issue. What is a typical ballpark number of features in your field?

Concerning your second argument, I don't think that non-diagonal regularization is limited to SVD. In the case of Ledoit-Wolf shrinkage, we only modify the diagonal, but we could use other regularization methods (such as Tikhonov regularization).

Do you have a suggestion on how we could move forward with the two LDA classifier implementations? I am kind of confused, because there seem to be two issues at the moment:

  1. SVD vs. covariance-based algorithms
  2. Issue #1649

I think we should focus on the first point here, and discuss the second issue separately.

@SmashTV
SmashTV commented Jun 27, 2014

I just wanted to briefly point out that I'm very glad that someone wrote code for carrying out shrinkage LDA in scikit-learn. Before I found this thread, I tried implementing it myself but I was not familiar with the implementation of LDA via SVD instead of explicit covariance matrices. If it isn't possible to perform shrinkage in the SVD scenario, I'd strongly favor the algorithm that calculates covariance matrices (because, from a pragmatic point of view, if the data are big, there is no benefit in computational efficiency if the estimators are unstable)

@cbrnr cbrnr closed this Aug 1, 2014
@mblondel
Member

I think automatic selection of the regularization constant by cross-validation is also useful. What you really want is to maximize classification accuracy on unseen data (generalization performance). I would thus add an alpha parameter (like other regularized classifiers) which can take a numerical value (e.g. alpha=1e-3) or a string (e.g. alpha="ledoit-wolf").

@mbillingr
Contributor

This would certainly make sense. Note that we no longer work on this PR, but continue in #3523. Could you move your suggestion over there, please?

Btw, I vaguely remember that Blankertz proofed in his shrinkage-LDA paper that the ledoit-wolf solution is optimal. If that is case cross-validation would probably not be required.

@mblondel
Member

I'd guess it is optimal on training but not on test data, isn't it? CV uses validation data so the result would be different.

@mbillingr
Contributor

Well, in that case 100% overfitting would be optimal :)

Sorry, I don't remember the details, but I'm sure that guy knew what he did. Anyway, I might be wrong. In that case a numerical parameter would certainly be useful.

@mblondel
Member

What is the title of the paper?

@mbillingr
Contributor

Blankertz et al. "Single-Trial Analysis and Classification of ERP Components - a Tutorial", NeuroImage, 2010.

I looked at the paper again and have to admit I was mistaken. There is no proof, and they even state that cross-validation could get give better results :) Sorry for the fuss.

@cbrnr
Contributor
cbrnr commented Aug 12, 2014

I made a new comment in #3523, please continue the discussion over there.

@cbrnr cbrnr deleted the unknown repository branch Sep 22, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment