Creation of the attribute LDA.explained_variance_ratio_, for the eige… #5216

JPFrancoia · 2015-09-05T15:53:28Z

Creation of the attribute LDA.explained_variance_ratio_, for the eigen solver. It is an analogy to PCA.explained_variance_ratio, and works in the same way.

Only one line of code has been added, there was not much to change. However, the attribute is not available for the default solver, as I'm not good enough to change this part of the code.

This is a PR to solve the issue #5210

…n solver. It is an analogy to PCA.explained_variance_ratio, and works in the same way.

agramfort · 2015-09-06T08:03:44Z

there is no test and no docstring update

explained_variance_ratio_. Creation of the function test_lda_explained_variance_ratio to test this new attribute.

JPFrancoia · 2015-09-06T09:49:20Z

I'm very sorry, I don't contribute a lot to big projects usually. I hope this is better now.

agramfort · 2015-09-06T14:57:23Z

sklearn/lda.py

@@ -180,6 +180,12 @@ class LDA(BaseEstimator, LinearClassifierMixin, TransformerMixin):
    covariance_ : array-like, shape (n_features, n_features)
        Covariance matrix (shared by all classes).

+    explained_variance_ratio_ : array, [n_components]


explained_variance_ratio_ : array, shape (n_components,)

just replace the line with what I suggest. The formatting is not consistent with the rest of the docstring.

agramfort · 2015-09-06T15:00:52Z

besides LTGM. It is consistent with PCA API.

JPFrancoia · 2015-09-06T15:07:45Z

Yes, I'm trying...What does this note means ? Is something incorrect ?

agramfort · 2015-09-06T15:23:16Z

sklearn/lda.py

@@ -180,6 +180,12 @@ class LDA(BaseEstimator, LinearClassifierMixin, TransformerMixin):
    covariance_ : array-like, shape (n_features, n_features)
        Covariance matrix (shared by all classes).

+    explained_variance_ratio_ : array, (n_components)


you forgot the ,

(n_components) -> (n_components,)

it's a tuple of length 1

agramfort · 2015-09-06T15:24:22Z

+1 for merge when travis is happy.

amueller · 2015-09-08T21:00:22Z

Is this a standard thing for LDA? I'm not that familiar with it.
I feel it is slightly confusing, as these are not the components that the transformer transforms to, right?
Say you have three gaussian blobs in 2d sitting on the x axis, and the std in y direction is very high for each, and the std in x direction is very low. Then explained variance ratio for y will be near one, and for x it will be near zero. But y is uninformative for the classification task.

JPFrancoia · 2015-09-09T07:27:20Z

Here is an example of what chemists do in publications:

It is extracted from Elci, S. G.; Moyano, D. F.; Rana, S.; Tonga, G. Y.; Phillips, R. L.; Bunz, U. H. F.; Rotello, V. M. Chem. Sci. 2013, 4, 2076–2080

If you have access and want to check.

To answer your question:

these are not the components that the transformer transforms to, right?

Yes, these are the components that the transformer transforms to. If you look into the code, the solver sorts the eigen vectors and take the n most important ones, n being the number of dimensions you requested.

amueller · 2015-09-09T15:18:37Z

Oh, sorry, I misread that. I thought these were the class covariances, not the eigenvectors of Sb. That means we only have it for one of the solvers, though. Can we have it for all of them? ;)
At least the docs should say that it is only available for solver="svd".

JPFrancoia · 2015-09-10T06:33:02Z

Nope, actually it is only available for solver='eigen', I already put it in the doc.

I would like to do it for all the solvers, however I'm not able to do it for all of them. I'm not good enough in maths, and I don't understand well the code for the other solvers.

amueller · 2015-09-10T15:26:21Z

Ok, looks good then.

amueller · 2015-09-10T15:26:41Z

Do you want to add an entry to whatsnew?

Creation of the attribute LDA.explained_variance_ratio_, for the eige…

agramfort · 2015-09-11T09:42:43Z

thanks

mblondel · 2015-09-14T07:03:47Z

It would be nice to add it for the SVD solver as well.

JPFrancoia · 2015-09-14T07:31:43Z

Yep, I tried, but I don't really understand the code. For the eigen solver, it was easy. Following this tutorial: http://sebastianraschka.com/Articles/2014_python_lda.html#table-of-contents, it's simple to understand how that works. But here is the interesting part of the SVD solver:

        n_samples, n_features = X.shape
        n_classes = len(self.classes_)

        self.means_ = _class_means(X, y)
        if store_covariance:
            self.covariance_ = _class_cov(X, y, self.priors_)

        Xc = []
        for idx, group in enumerate(self.classes_):
            Xg = X[y == group, :]
            Xc.append(Xg - self.means_[idx])

        self.xbar_ = np.dot(self.priors_, self.means_)

        Xc = np.concatenate(Xc, axis=0)

        # 1) within (univariate) scaling by with classes std-dev
        std = Xc.std(axis=0)
        # avoid division by zero in normalization
        std[std == 0] = 1.
        fac = 1. / (n_samples - n_classes)

        # 2) Within variance scaling
        X = np.sqrt(fac) * (Xc / std)
        # SVD of centered (within)scaled data
        U, S, V = linalg.svd(X, full_matrices=False)

        rank = np.sum(S > tol)
        if rank < n_features:
            warnings.warn("Variables are collinear.")
        # Scaling of within covariance is: V' 1/S
        scalings = (V[:rank] / std).T / S[:rank]

        # 3) Between variance scaling
        # Scale weighted centers
        X = np.dot(((np.sqrt((n_samples * self.priors_) * fac)) *
                    (self.means_ - self.xbar_).T).T, scalings)
        # Centers are living in a space with n_classes-1 dim (maximum)
        # Use SVD to find projection in the space spanned by the
        # (n_classes) centers
        _, S, V = linalg.svd(X, full_matrices=0)

        rank = np.sum(S > tol * S[0])
        self.scalings_ = np.dot(scalings, V.T[:, :rank])
        coef = np.dot(self.means_ - self.xbar_, self.scalings_)
        self.intercept_ = (-0.5 * np.sum(coef ** 2, axis=1)
                           + np.log(self.priors_))
        self.coef_ = np.dot(coef, self.scalings_.T)
        self.intercept_ -= np.dot(self.xbar_, self.coef_.T)

I don't know if it calculates the same thing (eigen values or something equivalent). The answer should be in S, but I'm not entirely sure. The variance ratios should be the same as with the eigen solver, right ?

AbdealiLoKo · 2015-10-23T17:00:41Z

Would really like this for SVD solver

JPFrancoia · 2015-11-01T09:42:55Z

FYI, thanks to this little PR, our publication was accepted, and scikit-learn has one more citation.
DOI: 10.1039/C5CC07628E

mbillingr · 2015-12-14T16:32:18Z

@JPFrancoia congratulations for the publication.
Are you still interested in implementing this for the SVD solver?

If yes, you might find it helpful to think of singular values as sqare-rooted eigenvalues.

Further, I don't think the variance ratios need to be exactly the same for both solvers because the SVD solver implicitly does regularization (as I recently learned). However, they should not be too far apart :)

JPFrancoia · 2015-12-14T19:16:12Z

Hi @mbillingr ,

If yes, you might find it helpful to think of singular values as sqare-rooted eigenvalues.

It's easy then, I just made a PR.
However, @hlin117 already added this attribute to the class. The way he calculates it is different from mine, and gives different results for the same data set.

Ex:
with solver='eigen':
[ 9.34274621e-01 6.10608131e-02 4.66389537e-03 6.70863023e-07]

with solver='svd'
[ 0.75349411 0.19262998]

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#L400
It returns only the variance ratios for the number of components chosen for the dimension reduction. It should return all the values however, which is for my data set:

[ 7.53494106e-01 1.92629978e-01 5.32374179e-02 6.38497956e-04]

With the PR I made, I get the exact same values with both solvers:
[ 9.34274621e-01 6.10608131e-02 4.66389537e-03 6.70863023e-07]

However, I'm not good enough in maths to say he is wrong or not. He actually doesn't square S.
@mbillingr , you sure the the singular values are the eigenvalues, square-rooted ?

hlin117 · 2015-12-14T19:46:40Z

Hey @JPFrancoia,

Thanks for pinging me. I set the number of variables for explained_variance_ratio_ to equal the number of components because this is how PCA does it.

As for the numeric differences, what you might be experiencing is a problem with the solvers themselves. (Example, one algorithm initializes it differently than another. #5970 also mentions how different initialization values yields different results.)

If you could could write a regression test by using one of scikit-learn's datasets here, and post your code, that would be great. Use sklearn.utils.testing.assert_arrays_almost_equal to show that the arrays aren't equal.

JPFrancoia · 2015-12-14T20:19:53Z

You are right on this point, then https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#337 is not consistent, and should be:

self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components]

As for the numeric differences, what you might be experiencing is a problem with the solvers themselves. (Example, one algorithm initializes it differently than another. #5970 also mentions how different initialization values yields different results.)

The numerical differences here are not due to the solvers. You calculate the variance ratios this way:

 self.explained_variance_ratio_ = S[:self.n_components] / S.sum()

While I do:

self.explained_variance_ratio_ = np.sort(S**2 / np.sum(S**2))[::-1][:self.n_components]

EDIT:
I could also do that, it is clearer:

self.explained_variance_ratio_ = (S**2 / np.sum(S**2))[:self.n_components]

My point is:
You basically don't square S, while I do. @mbillingr, says the the singular values (so S) are the eigenvalues, square-rooted.

hlin117 · 2015-12-14T20:29:01Z

Hmm, that's right... It's odd that the test cases that I wrote passed.

Can you open up a new issue, and write a corresponding PR to add these test cases?

JPFrancoia · 2015-12-14T20:31:49Z

What test cases did you write ?

 self.explained_variance_ratio_ = S / S.sum()

You are probably checking that the sum of this array returns 1, which it does. Because you are dividing each element of S by the sum of S.

EDIT:

My bad, you are checking:

    assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
                              clf_lda_eigen.explained_variance_ratio_)

EDIT 2:

Actually, couldn't this test be right in many situations, even if the values of one of these arrays are wrong ?

JPFrancoia · 2015-12-14T20:56:20Z

Ok, some from the tests:

def test_lda_explained_variance_ratio():
    # Test if the sum of the normalized eigen vectors values equals 1,
    # Also tests whether the explained_variance_ratio_ formed by the
    # eigen solver is the same as the explained_variance_ratio_ formed
    # by the svd solver
    n_features = 2
    n_classes = 2
    n_samples = 1000
    X, y = make_blobs(n_samples=n_samples, n_features=n_features,
                      centers=n_classes, random_state=11)

    clf_lda_eigen = LinearDiscriminantAnalysis(solver="eigen")
    clf_lda_eigen.fit(X, y)
    assert_almost_equal(clf_lda_eigen.explained_variance_ratio_.sum(), 1.0, 3)

    print(clf_lda_eigen.explained_variance_ratio_)

    clf_lda_svd = LinearDiscriminantAnalysis(solver="svd")
    clf_lda_svd.fit(X, y)
    print(clf_lda_svd.explained_variance_ratio_)
    assert_almost_equal(clf_lda_svd.explained_variance_ratio_.sum(), 1.0, 3)
    assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
                              clf_lda_eigen.explained_variance_ratio_)

Output:

[ 1.00000000e+00 8.01933309e-16]
[ 1.00000000e+00 5.35764370e-17]

So yeah, the tests will pass. This is with your implementation of explained_variance_ratio.

hlin117 · 2015-12-14T21:48:23Z

The numbers are different, but that could be perceived as a precision error. We should find a more plausible example of differences.

This seems to be failing too, but it could be perceived as a precision error than anything...

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> from sklearn.lda import LDA
>>> X = iris.data
>>> y = iris.target
>>>
>>> eigen = LDA(solver='eigen').fit(X, y)
>>> svd = LDA(solver='svd').fit(X, y)
>>>
>>> eigen.explained_variance_ratio_
array([  9.91472476e-01,   8.52752434e-03,   9.11678219e-16,
        -1.12681164e-16])
>>> svd.explained_variance_ratio_
array([  9.15130046e-01,   8.48699541e-02,   1.62683440e-16])

JPFrancoia · 2015-12-14T22:07:09Z

The numbers are different, but that could be perceived as a precision error. We should find a more plausible example of differences.

They might be different programmaticaly (hard to say, they are very close), but for the test, they are equal for sure. So the test will pass, while the numbers might be different.

Try with this script, they are personal experimental data:

example.txt

(simply rename example.txt to example.py)

hlin117 · 2015-12-14T23:59:49Z

I tried your script. I agree that the results are entirely different.

I wanted to make a more concrete counterexample, so I wrote this:

>>> import numpy as np
>>> from sklearn.lda import LDA
>>> from sklearn.utils.testing import assert_array_almost_equal
>>>
>>> state = np.random.RandomState(0)
>>> X = state.normal(loc=0, scale=100, size=(40, 20))
>>> y = state.randint(0, 3, size=(40, 1))
>>>
>>> # Train the LDA classifier. Use the eigen solver
>>> lda_eigen = LDA(solver='eigen', n_components=5)
>>> lda_eigen.fit(X, y)
>>>
>>> lda_svd = LDA(n_components=5, solver='svd')
>>> lda.fit(X, y)
>>> assert_array_almost_equal(lda_eigen.explained_variance_ratio_, lda_svd.explained_variance_ratio_)

It's not entirely clear why these results are not equal. It's not as simple as being the squared of another...

hlin117 · 2015-12-15T00:05:22Z

@JPFrancoia: I opened up another issue surfacing this discussion. Let's take this discussion out of this PR.

Thanks for finding these differences.

JPFrancoia · 2015-12-15T08:58:05Z

I don't really know where I should post right now, but I will do it here, because there are no numerical differences between the solvers. Three things:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#L337 is not correct. As I said, it should be:

self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components]

This is my fault, I was the one who created explained_variance_ratio for this solver. It is not really an error, the array returned contains the right values, but it does not have the right length.

explained_variance_ratio is not calculated correctly with the svd solver. For now, it is calculated like that:

self.explained_variance_ratio_ = S[:self.n_components] / S.sum()

And it should be (assuming the eigen values are the singular values squared):

self.explained_variance_ratio_ = (S**2 / np.sum(S**2))[:self.n_components]

If the length of explained_variance_ratio varies with the solver used, you cannot compare them directly in you tests

I modified your script. If we calculate explained_variance_ratio_ with S**2, and if we compare arrays of the same lengths, the test PASS:

import numpy as np
from sklearn.lda import LDA
from sklearn.utils.testing import assert_array_almost_equal


state = np.random.RandomState(0)
X = state.normal(loc=0, scale=100, size=(40, 20))
y = state.randint(0, 3, size=(40, 1))

# Train the LDA classifier. Use the eigen solver
lda_eigen = LDA(solver='eigen', n_components=5)
lda_eigen.fit(X, y)

lda_svd = LDA(n_components=5, solver='svd')
lda_svd.fit(X, y)

print("eigen:")
print(lda_eigen.explained_variance_ratio_[:3])
print("\n")
print("svd:")
print(lda_svd.explained_variance_ratio_)

assert_array_almost_equal(lda_eigen.explained_variance_ratio_[:3], lda_svd.explained_variance_ratio_)

hlin117 · 2015-12-15T09:43:41Z

@JPFrancoia: #6031 is the PR you're going to be looking for =]

Please open up a PR that implements S**2, as well as this regression test.

JPFrancoia · 2015-12-15T09:47:47Z

Actually, I had already opened one :) #6027

However, agramfort asks me to "update the unit tests to show this issue ", and you are asking for "regression test".

Could you please explain to me what they are ? I'm not familiar with this kind of things.

hlin117 · 2015-12-15T09:52:58Z

Stackoverflow is your friend =]
http://stackoverflow.com/questions/520064/what-is-unit-test-integration-test-smoke-test-regression-test

JPFrancoia · 2015-12-18T10:49:36Z

Nope it's ok, My bad.

solver. See issue #5216. However, this attribute is corrected from the previous commit which, might not return the expected values.

@hlin117

Used the script of @hlin117 in #5216 to generate better test data, make_blobs was not good enough.

solver. See issue scikit-learn#5216. However, this attribute is corrected from the previous commit which, might not return the expected values.

@hlin117

Used the script of @hlin117 in scikit-learn#5216 to generate better test data, make_blobs was not good enough.

solver. See issue scikit-learn#5216. However, this attribute is corrected from the previous commit which, might not return the expected values.

@hlin117

Used the script of @hlin117 in scikit-learn#5216 to generate better test data, make_blobs was not good enough.

Creation of the attribute LDA.explained_variance_ratio_, for the eige…

5a1f183

…n solver. It is an analogy to PCA.explained_variance_ratio, and works in the same way.

djipey added 2 commits September 6, 2015 11:46

Addition of a piece of docstring for the attribute

3ad9972

explained_variance_ratio_. Creation of the function test_lda_explained_variance_ratio to test this new attribute.

Correction of the name of a function.

099b483

agramfort reviewed Sep 6, 2015
View reviewed changes

djipey added 2 commits September 6, 2015 17:19

Changing [ to (. Formatting consistency.

cf4bc71

Modification of a line after @agramfort's suggestion.

6e7e9d3

agramfort reviewed Sep 6, 2015
View reviewed changes

agramfort added a commit that referenced this pull request Sep 11, 2015

Merge pull request #5216 from JPFrancoia/master

9ebc6fe

Creation of the attribute LDA.explained_variance_ratio_, for the eige…

agramfort merged commit 9ebc6fe into scikit-learn:master Sep 11, 2015

mblondel mentioned this pull request Sep 14, 2015

Get loadings factors from the LDA class #5210

Closed

cbrnr mentioned this pull request Nov 2, 2015

LDA bugfix #5092 lost in refactoring #5679

Closed

JPFrancoia mentioned this pull request Dec 14, 2015

[MRG] Fix #6031: changed calculation of explained_variance_ratio_, SVD solver #6027

Closed

hlin117 mentioned this pull request Dec 15, 2015

Precision errors in LDA.explained_variance_ratio_ #6031

Closed

JPFrancoia mentioned this pull request Dec 15, 2015

LDA.explained_variance_ratio_ is of the wrong size #6032

Closed

agramfort pushed a commit that referenced this pull request Apr 10, 2016

Attribute explained_variance ratio is now available for LDA with the SVD

a71d31c

solver. See issue #5216. However, this attribute is corrected from the previous commit which, might not return the expected values.

agramfort pushed a commit that referenced this pull request Apr 10, 2016

Regression test for explained_variance_ratio_, LDA, SVD solver.

5d33974

Used the script of @hlin117 in #5216 to generate better test data, make_blobs was not good enough.

mannby pushed a commit to mannby/scikit-learn that referenced this pull request Apr 22, 2016

Regression test for explained_variance_ratio_, LDA, SVD solver.

78eaff6

Used the script of @hlin117 in scikit-learn#5216 to generate better test data, make_blobs was not good enough.

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

Regression test for explained_variance_ratio_, LDA, SVD solver.

a6b0fc0

Used the script of @hlin117 in scikit-learn#5216 to generate better test data, make_blobs was not good enough.

Creation of the attribute LDA.explained_variance_ratio_, for the eige… #5216

Creation of the attribute LDA.explained_variance_ratio_, for the eige… #5216

Conversation

JPFrancoia commented Sep 5, 2015

agramfort commented Sep 6, 2015

JPFrancoia commented Sep 6, 2015

agramfort Sep 6, 2015

Choose a reason for hiding this comment

agramfort Sep 6, 2015

Choose a reason for hiding this comment

agramfort commented Sep 6, 2015

JPFrancoia commented Sep 6, 2015

agramfort Sep 6, 2015

Choose a reason for hiding this comment

agramfort commented Sep 6, 2015

amueller commented Sep 8, 2015

JPFrancoia commented Sep 9, 2015

amueller commented Sep 9, 2015

JPFrancoia commented Sep 10, 2015

amueller commented Sep 10, 2015

amueller commented Sep 10, 2015

agramfort commented Sep 11, 2015

mblondel commented Sep 14, 2015

JPFrancoia commented Sep 14, 2015

AbdealiLoKo commented Oct 23, 2015

JPFrancoia commented Nov 1, 2015

mbillingr commented Dec 14, 2015

JPFrancoia commented Dec 14, 2015

hlin117 commented Dec 14, 2015

JPFrancoia commented Dec 14, 2015

hlin117 commented Dec 14, 2015

JPFrancoia commented Dec 14, 2015

JPFrancoia commented Dec 14, 2015

hlin117 commented Dec 14, 2015

JPFrancoia commented Dec 14, 2015

hlin117 commented Dec 14, 2015

hlin117 commented Dec 15, 2015

JPFrancoia commented Dec 15, 2015

hlin117 commented Dec 15, 2015

JPFrancoia commented Dec 15, 2015

hlin117 commented Dec 15, 2015

JPFrancoia commented Dec 18, 2015