Univariate variate scaling in LDA #1649

Closed
ksemb opened this Issue Feb 2, 2013 · 27 comments

Comments

Projects
None yet
6 participants

ksemb commented Feb 2, 2013

This bug report concerns the univariate normalization that is done in the scikit-learn LDA implementation. The following shows the source code lines from file /usr/lib/pymodules/python2.7/sklearn/lda.py of scikit-learn version 0.13-1 which are relevant for the bug report.

148: std = Xc.std(axis=0)
150: std[std == 0] = 1.
154: X = np.sqrt(fac) * (Xc / std)
156: U, S, V = linalg.svd(X, full_matrices=0)
162: scalings = (V[:rank] / std).T / S[:rank]

If I understand the code correctly it does the following: Normalize X before SVD (line 154), perform SVD (line 156), correct SVD results for the previous normalization (line 162).

The code seems to work when more training samples than features are present, i.e. when the linear system is overdetermined. When the linear system is underdetermined, i.e. when there are less samples than features, the code fails. I have, however, not yet been able to find a mathematical formulation of the algorithm that shows what is going on in the code and why it is working at all.

I have not yet seen this normalization step in any other LDA code (e.g. Matlab's "classify" method does not do this normalization), nor have I ever heard of this normalization before.

Below is the test case that I ran with normalization enabled (the original code) and with normalization disabled (I replaced line 150 by "std[:] = 1."). If normalization is disabled, everything works fine. If normalization is enabled, the code works in the overdetermined case (i.e. more training samples than feature dimensions) and it fails in the underdetermined case (less training samples than data dimensions).

############# test case ############
import sklearn.lda
import numpy.random

classif = sklearn.lda.LDA()

def test(N):
    numpy.random.seed(0)

    print 'N=%d'%N
    p=80

    Xf=numpy.random.rand(N,p)
    y=numpy.random.random_integers(0,1,(N,1))
    Xp=numpy.random.rand(11,p)

    # Generate a random unitary matrix Q
    [Q,R] = numpy.linalg.qr(numpy.random.rand(p,p))

    # Apply different scalings to the features.
    # The classification results should NOT change if the LDA implementation is correct.
    for scale in numpy.logspace(1,10,5):
        D = numpy.eye(p)
        D[0,0] = scale
        # generate a scaling matrix
        S = Q.dot(D).dot(Q.T)
        classif.fit(Xf.dot(S),y,tol=1e-11)
        print classif.predict(Xp.dot(S))

print('overdetermined case (many more samples than dimensions)')
test(100)
print('underdetermined case (less samples than dimensions)')
test(10)

############# results with normalization enabled ##############
./bugreport.py 
overdetermined case (many more samples than dimensions)
N=100
[0 0 0 0 0 1 0 0 1 1 1]
[0 0 0 0 0 1 0 0 1 1 1]
[0 0 0 0 0 1 0 0 1 1 1]
[0 0 0 0 0 1 0 0 1 1 1]
[0 0 0 0 0 1 0 0 1 1 1]
underdetermined case (less samples than dimensions)
N=10
[1 0 0 0 1 1 1 0 1 1 1]
[1 0 0 1 0 1 0 0 1 1 0]
[1 0 0 1 0 1 0 0 1 1 0]
[1 0 0 1 1 1 0 0 1 1 1]
[1 0 0 1 1 1 0 0 1 1 1]

############# results with normalization disabled ##############
./bugreport.py 
overdetermined case (many more samples than dimensions)
N=100
[0 0 0 0 0 1 0 0 1 1 1]
[0 0 0 0 0 1 0 0 1 1 1]
[0 0 0 0 0 1 0 0 1 1 1]
[0 0 0 0 0 1 0 0 1 1 1]
[0 0 0 0 0 1 0 0 1 1 1]
underdetermined case (less samples than dimensions)
N=10
/usr/lib/pymodules/python2.7/sklearn/lda.py:161: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")
[1 0 0 1 1 1 0 0 1 1 1]
[1 0 0 1 1 1 0 0 1 1 1]
[1 0 0 1 1 1 0 0 1 1 1]
[1 0 0 1 1 1 0 0 1 1 1]
[1 0 0 1 1 1 0 0 1 1 1]

ksemb closed this Feb 2, 2013

ksemb reopened this Feb 2, 2013

Owner

amueller commented Feb 2, 2013

Hi @kesmb.
Thanks or the report.
I think no-one really touched the LDA and QDA for quite a while.
I went through it do document the attributes recently and it did seem a bit odd to me.
We'l have to investigate.

Cheers,
Andy

ksemb commented Feb 4, 2013

I did some more experiments and updated the bug report accordingly. In particular I added the "tol=1e-11".

Owner

agramfort commented Feb 5, 2013

@ksemb what do you recommend? get rid of the normalization? have an option to turn it on or off? does the examples still look reasonable if you get rid of it?

Owner

amueller commented Feb 5, 2013

It seems like this is one for numerical stability (?) @agramfort did you have a look at the math? I didn't have the time yet.
Not sure who wrote this.

Owner

agramfort commented Feb 5, 2013

It seems like this is one for numerical stability (?) @agramfort did you have a look at the math? I didn't have the time yet.

no I didn't

Not sure who wrote this.

I think it's @MatthieuPerrot

Hi @amueller,

I am not sure I am the author of each line of the LDA since this version was based on older piece of codes from our lab. If I remember well, the numerical stability is indeed the reason of this trick, but it could be only valid for overdetermined cases. We will need to return to the math if we want to use it only when its usage is essential...

For the exact underlying math and tricks, I think @duchesnay could remember where they come from.

Owner

amueller commented Feb 5, 2013

Thanks for the comments @MatthieuPerrot. I'll try to figure it out in the next couple of days.

Owner

amueller commented Feb 11, 2013

(yeah right). If any one has time, feel free ;)

Owner

amueller commented Feb 13, 2013

what I find a bit irritating about the code is that scalings is not merged into coef_. Why do two matrix multiplications instead of just one in predict and transform? To me it looks as if the intermediate result has dimensionality as least as high as if both where done at once.

Owner

GaelVaroquaux commented Feb 17, 2013

As a side note, a user reported that while scikit-learn does it differently than Matlab, it agrees with R:
http://sourceforge.net/mailarchive/message.php?msg_id=30486846
This makes me a bit reluctent to change the default: the established standard in statistics is R, not Matlab.

My plan is to add an option to turn this scaling off, but to have it on by default. Any suggestions on how this option should be named?

ksemb commented Feb 17, 2013

This is not an sklearn vs. Matlab question, it is about the mathematical properties of the algorithm. I reported that the LDA in sklearn is not invariant w.r.t. scaling of the data (I assume that it is also not rotation invariant, but I did not implement a testcase for this). And a correct LDA implementation MUST be invariant w.r.t. scaling.

Also the issue reported in the link inside the previous post is absolutely not related to this bug report, because the user there did a univariate LDA with 5000 samples. This means that his problem is in no way underdetermined. The bug we are facing occurs in the underdetermined case.

Owner

amueller commented Feb 17, 2013

From what I understood I also think the two issues are not related. We could have a look at the R results but I generally agree with @ksemb.

Owner

amueller commented Feb 20, 2013

Does any one know why the algorithm is done in the way it is? It looks overly complicated. But I don't like to mess with something I don't really understand.

ksemb commented Feb 20, 2013

To keep backward compatibility with old user codes I would prefer to leave the scaling inside the code but to modify the scaling such that everything works correctly.

Apart from the suggestion to keep the official sklearn version more or less as it is and to fix only the parts that are broken, I would like to note that I converted a simpler LDA implementation from Matlab to Python. I did this because I needed a regularized LDA for my own work. In particular I chose this Matlab implementation to start with.

http://www.mathworks.de/matlabcentral/fileexchange/29673-lda-linear-discriminant-analysis

According to the text on the webpage it has been "(..) verified against statistical software (..)"

The author explains parts of the algorithm in his blog post.
http://matlabdatamining.blogspot.de/2010/12/linear-discriminant-analysis-lda.html

My implementation can be found here.
https://gist.github.com/ksemb/4c3c86c6b62e7a99989b

In the testcases I created up to now the implementation produces results identical to those of the original sklearn version when scaling is disabled. The simplified version will behave different to the original sklearn version when group means are collinear.

Owner

amueller commented Feb 20, 2013

Your implementation looks very interesting (though pep8 would be nice ;)
That is more how I would do it, though I don't fully understand the _lstsq_scatter part... This is done by pinvh currently, right?

ksemb commented Feb 20, 2013

Yes, it is something like pinvh on a matrix who's eigenvalue decomposition is already known. I did not know of pinvh when I wrote the code, and I think one could modify the _lstsq_scatter function s.t. its signature becomes very similar to pinvh's.

Owner

amueller commented Feb 20, 2013

sounds smart ^^

I really don't know enough about the algorithm to say anything substantial but your implementation looks way better to me ;)

Owner

agramfort commented Feb 21, 2013

I agree that this implementation is simpler. I suspect it is less efficient
though. Can you check? I also suspect, knowing the author of the current
code, that the R implementation was a source of inspiration.

@agramfort : +1 on both assumptions

ksemb commented Feb 21, 2013

I suspect the runtime of both codes to be determined by the cost of the first SVD, which is the same in both codes. So both codes should have the same runtime. I cannot comment on numerical stability.

Looking at the R code might make sense at this point.

Owner

GaelVaroquaux commented Aug 4, 2013

Retagging for 0.15.

ksemb commented Aug 23, 2013

I have now compared the sklearn implementation to the R implementation (more precisely, the MASS::lda function). Both implementations are identical, except for the transform function (I think sklearn's transform function is not correct). As a separate bug report has already been opened for that issue I will write a comment there.

I also had a second look at Matlab's LDA implementation (toolbox/stats/classreg/ClassificationDiscriminant.m) and saw that in fact it also does that univariate scaling (in line 720, multiplication by "invD").

I then compared the results of sklearn.linear_model.LinearRegression to those of the sklearn implementation with univariate scaling and those of the LDA implementation that I posted. All classifiers give identical results, except for the underdetermined case. There, only the LinearRegression gives reasonable results in the underdetermined case, all LDA implementations do not give very good results.

To summarize: The original LDA implementation, sklearn LDA implementation with univariate scaling removed, and the LDA implementation I posted all give wrong answers in the underdetermined case.

Here is a very simple example problem that illustrates what happens. Suppose you have the following setup:

LDA().fit(X=[
    [0,0],
    [1,0],
    [0,1]], y=[
    0,
    0,
    1])

As the data is demeaned separately for each class during the covariance computation, the covariance matrix has only one nonnegative eigenvalue, with eigenvector [1,0]. As the LDA seeks the normal vector of the separating hyperplane by inverting the covariance matrix, the found normal vector will be inside the span of the covariance matrix and will be [1,0]. However, a reasonable separating hyperplane would have the normal vector [0,1], which is exactly orthogonal to the one found by LDA.

So, the main problem is that demeaning removes an important dimension in the underdetermined case. In the meantime I implemented a hack that adds back that lost dimension to the eigenvectors of the covariance matrix, but I would prefer an implementation of the least squares LDA www.machinelearning.org/proceedings/icml2007/papers/87.pdf which does not require demeaning of the data.

ksemb closed this Aug 23, 2013

ksemb reopened this Aug 23, 2013

Owner

amueller commented Jan 4, 2014

should we try to get @ksemb's implementation in before the release?

Owner

agramfort commented Jan 5, 2014

+1 if there is no regression in terms of output and speed

Contributor

cbrnr commented Feb 24, 2014

I had a question related to shrinkage LDA on the mailing list, but I think it makes sense to post the relevant contents here as well.

The proposed code by @ksemb doesn't seem to solve my problem (I want to use shrinkage LDA). As far as I understood it, this issue here is about the normalization step that doesn't seem to work in an underdetermined case. In @ksemb's proposed code, he still uses SVD to implement LDA (in contrast to the MATLAB implementation he used as a template). This means that regularization of the covariance matrix is only possible in the original MATLAB code, but not in @ksemb's Python version.

Following this discussion, I think that with ksemb's last comment he makes it clear that there are still problems even with his revised version. I suggest to use the classical straightforward formulation by e.g. Duda and Hart, pp. 117-121, where you simply compute the scatter matrices and get the weights via

w = S_W^{-1} * (m_1 - m_2)

This is exactly what happens in the MATLAB template that @ksemb is referring to. In this case, the calculation of the within-class scatter matrix S_W can be shrunk when computing S_1 and S_2 (and potentially of course for more than two classes).

I've been using such an implementation (at least in MATLAB) for a long time and never experienced any problems.

Owner

agramfort commented Feb 24, 2014

if the implementation you propose makes your life easier without a
performance regression you should submit a PR

@amueller amueller modified the milestone: 0.15.1, 0.15 Jul 18, 2014

Owner

agramfort commented Jan 28, 2015

closing as LDA code was recently improved and refactored. Reopen if needed.

agramfort closed this Jan 28, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment