Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of the attribute LDA.explained_variance_ratio_, for the eige… #5216

Merged
merged 5 commits into from Sep 11, 2015
Merged

Conversation

JPFrancoia
Copy link
Contributor

Creation of the attribute LDA.explained_variance_ratio_, for the eigen solver. It is an analogy to PCA.explained_variance_ratio, and works in the same way.

Only one line of code has been added, there was not much to change. However, the attribute is not available for the default solver, as I'm not good enough to change this part of the code.

This is a PR to solve the issue #5210

…n solver.

It is an analogy to PCA.explained_variance_ratio, and works in the same
way.
@agramfort
Copy link
Member

there is no test and no docstring update

djipey added 2 commits September 6, 2015 11:46
explained_variance_ratio_. Creation of the function
test_lda_explained_variance_ratio to test this new attribute.
@JPFrancoia
Copy link
Contributor Author

I'm very sorry, I don't contribute a lot to big projects usually. I hope this is better now.

@@ -180,6 +180,12 @@ class LDA(BaseEstimator, LinearClassifierMixin, TransformerMixin):
covariance_ : array-like, shape (n_features, n_features)
Covariance matrix (shared by all classes).

explained_variance_ratio_ : array, [n_components]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained_variance_ratio_ : array, shape (n_components,)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just replace the line with what I suggest. The formatting is not consistent with the rest of the docstring.

@agramfort
Copy link
Member

besides LTGM. It is consistent with PCA API.

@JPFrancoia
Copy link
Contributor Author

Yes, I'm trying...What does this note means ? Is something incorrect ?

@@ -180,6 +180,12 @@ class LDA(BaseEstimator, LinearClassifierMixin, TransformerMixin):
covariance_ : array-like, shape (n_features, n_features)
Covariance matrix (shared by all classes).

explained_variance_ratio_ : array, (n_components)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you forgot the ,

(n_components) -> (n_components,)

it's a tuple of length 1

@agramfort
Copy link
Member

+1 for merge when travis is happy.

@amueller
Copy link
Member

amueller commented Sep 8, 2015

Is this a standard thing for LDA? I'm not that familiar with it.
I feel it is slightly confusing, as these are not the components that the transformer transforms to, right?
Say you have three gaussian blobs in 2d sitting on the x axis, and the std in y direction is very high for each, and the std in x direction is very low. Then explained variance ratio for y will be near one, and for x it will be near zero. But y is uninformative for the classification task.

@JPFrancoia
Copy link
Contributor Author

Here is an example of what chemists do in publications:
screen

It is extracted from Elci, S. G.; Moyano, D. F.; Rana, S.; Tonga, G. Y.; Phillips, R. L.; Bunz, U. H. F.; Rotello, V. M. Chem. Sci. 2013, 4, 2076–2080

If you have access and want to check.

To answer your question:

these are not the components that the transformer transforms to, right?

Yes, these are the components that the transformer transforms to. If you look into the code, the solver sorts the eigen vectors and take the n most important ones, n being the number of dimensions you requested.

@amueller
Copy link
Member

amueller commented Sep 9, 2015

Oh, sorry, I misread that. I thought these were the class covariances, not the eigenvectors of Sb. That means we only have it for one of the solvers, though. Can we have it for all of them? ;)
At least the docs should say that it is only available for solver="svd".

@JPFrancoia
Copy link
Contributor Author

Nope, actually it is only available for solver='eigen', I already put it in the doc.

I would like to do it for all the solvers, however I'm not able to do it for all of them. I'm not good enough in maths, and I don't understand well the code for the other solvers.

@amueller
Copy link
Member

Ok, looks good then.

@amueller
Copy link
Member

Do you want to add an entry to whatsnew?

agramfort added a commit that referenced this pull request Sep 11, 2015
Creation of the attribute LDA.explained_variance_ratio_, for the eige…
@agramfort agramfort merged commit 9ebc6fe into scikit-learn:master Sep 11, 2015
@agramfort
Copy link
Member

thanks

@mblondel
Copy link
Member

It would be nice to add it for the SVD solver as well.

@JPFrancoia
Copy link
Contributor Author

Yep, I tried, but I don't really understand the code. For the eigen solver, it was easy. Following this tutorial: http://sebastianraschka.com/Articles/2014_python_lda.html#table-of-contents, it's simple to understand how that works. But here is the interesting part of the SVD solver:

        n_samples, n_features = X.shape
        n_classes = len(self.classes_)

        self.means_ = _class_means(X, y)
        if store_covariance:
            self.covariance_ = _class_cov(X, y, self.priors_)

        Xc = []
        for idx, group in enumerate(self.classes_):
            Xg = X[y == group, :]
            Xc.append(Xg - self.means_[idx])

        self.xbar_ = np.dot(self.priors_, self.means_)

        Xc = np.concatenate(Xc, axis=0)

        # 1) within (univariate) scaling by with classes std-dev
        std = Xc.std(axis=0)
        # avoid division by zero in normalization
        std[std == 0] = 1.
        fac = 1. / (n_samples - n_classes)

        # 2) Within variance scaling
        X = np.sqrt(fac) * (Xc / std)
        # SVD of centered (within)scaled data
        U, S, V = linalg.svd(X, full_matrices=False)

        rank = np.sum(S > tol)
        if rank < n_features:
            warnings.warn("Variables are collinear.")
        # Scaling of within covariance is: V' 1/S
        scalings = (V[:rank] / std).T / S[:rank]

        # 3) Between variance scaling
        # Scale weighted centers
        X = np.dot(((np.sqrt((n_samples * self.priors_) * fac)) *
                    (self.means_ - self.xbar_).T).T, scalings)
        # Centers are living in a space with n_classes-1 dim (maximum)
        # Use SVD to find projection in the space spanned by the
        # (n_classes) centers
        _, S, V = linalg.svd(X, full_matrices=0)

        rank = np.sum(S > tol * S[0])
        self.scalings_ = np.dot(scalings, V.T[:, :rank])
        coef = np.dot(self.means_ - self.xbar_, self.scalings_)
        self.intercept_ = (-0.5 * np.sum(coef ** 2, axis=1)
                           + np.log(self.priors_))
        self.coef_ = np.dot(coef, self.scalings_.T)
        self.intercept_ -= np.dot(self.xbar_, self.coef_.T)

I don't know if it calculates the same thing (eigen values or something equivalent). The answer should be in S, but I'm not entirely sure. The variance ratios should be the same as with the eigen solver, right ?

@AbdealiLoKo
Copy link

Would really like this for SVD solver

@JPFrancoia
Copy link
Contributor Author

FYI, thanks to this little PR, our publication was accepted, and scikit-learn has one more citation.
DOI: 10.1039/C5CC07628E

@mbillingr
Copy link
Contributor

@JPFrancoia congratulations for the publication.
Are you still interested in implementing this for the SVD solver?

If yes, you might find it helpful to think of singular values as sqare-rooted eigenvalues.

Further, I don't think the variance ratios need to be exactly the same for both solvers because the SVD solver implicitly does regularization (as I recently learned). However, they should not be too far apart :)

@JPFrancoia
Copy link
Contributor Author

Hi @mbillingr ,

If yes, you might find it helpful to think of singular values as sqare-rooted eigenvalues.

It's easy then, I just made a PR.
However, @hlin117 already added this attribute to the class. The way he calculates it is different from mine, and gives different results for the same data set.

Ex:
with solver='eigen':
[ 9.34274621e-01 6.10608131e-02 4.66389537e-03 6.70863023e-07]

with solver='svd'
[ 0.75349411 0.19262998]

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#L400
It returns only the variance ratios for the number of components chosen for the dimension reduction. It should return all the values however, which is for my data set:

[ 7.53494106e-01 1.92629978e-01 5.32374179e-02 6.38497956e-04]

With the PR I made, I get the exact same values with both solvers:
[ 9.34274621e-01 6.10608131e-02 4.66389537e-03 6.70863023e-07]

However, I'm not good enough in maths to say he is wrong or not. He actually doesn't square S.
@mbillingr , you sure the the singular values are the eigenvalues, square-rooted ?

@hlin117
Copy link
Contributor

hlin117 commented Dec 14, 2015

Hey @JPFrancoia,

Thanks for pinging me. I set the number of variables for explained_variance_ratio_ to equal the number of components because this is how PCA does it.

As for the numeric differences, what you might be experiencing is a problem with the solvers themselves. (Example, one algorithm initializes it differently than another. #5970 also mentions how different initialization values yields different results.)

If you could could write a regression test by using one of scikit-learn's datasets here, and post your code, that would be great. Use sklearn.utils.testing.assert_arrays_almost_equal to show that the arrays aren't equal.

@JPFrancoia
Copy link
Contributor Author

You are right on this point, then https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#337 is not consistent, and should be:

self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components]

As for the numeric differences, what you might be experiencing is a problem with the solvers themselves. (Example, one algorithm initializes it differently than another. #5970 also mentions how different initialization values yields different results.)

The numerical differences here are not due to the solvers. You calculate the variance ratios this way:

 self.explained_variance_ratio_ = S[:self.n_components] / S.sum()

While I do:

self.explained_variance_ratio_ = np.sort(S**2 / np.sum(S**2))[::-1][:self.n_components]

EDIT:
I could also do that, it is clearer:

self.explained_variance_ratio_ = (S**2 / np.sum(S**2))[:self.n_components]

My point is:
You basically don't square S, while I do. @mbillingr, says the the singular values (so S) are the eigenvalues, square-rooted.

@hlin117
Copy link
Contributor

hlin117 commented Dec 14, 2015

Hmm, that's right... It's odd that the test cases that I wrote passed.

Can you open up a new issue, and write a corresponding PR to add these test cases?

@JPFrancoia
Copy link
Contributor Author

What test cases did you write ?

 self.explained_variance_ratio_ = S / S.sum()

You are probably checking that the sum of this array returns 1, which it does. Because you are dividing each element of S by the sum of S.

EDIT:

My bad, you are checking:

    assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
                              clf_lda_eigen.explained_variance_ratio_)

EDIT 2:

Actually, couldn't this test be right in many situations, even if the values of one of these arrays are wrong ?

@JPFrancoia
Copy link
Contributor Author

Ok, some from the tests:

def test_lda_explained_variance_ratio():
    # Test if the sum of the normalized eigen vectors values equals 1,
    # Also tests whether the explained_variance_ratio_ formed by the
    # eigen solver is the same as the explained_variance_ratio_ formed
    # by the svd solver
    n_features = 2
    n_classes = 2
    n_samples = 1000
    X, y = make_blobs(n_samples=n_samples, n_features=n_features,
                      centers=n_classes, random_state=11)

    clf_lda_eigen = LinearDiscriminantAnalysis(solver="eigen")
    clf_lda_eigen.fit(X, y)
    assert_almost_equal(clf_lda_eigen.explained_variance_ratio_.sum(), 1.0, 3)

    print(clf_lda_eigen.explained_variance_ratio_)

    clf_lda_svd = LinearDiscriminantAnalysis(solver="svd")
    clf_lda_svd.fit(X, y)
    print(clf_lda_svd.explained_variance_ratio_)
    assert_almost_equal(clf_lda_svd.explained_variance_ratio_.sum(), 1.0, 3)
    assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
                              clf_lda_eigen.explained_variance_ratio_)

Output:

[ 1.00000000e+00 8.01933309e-16]
[ 1.00000000e+00 5.35764370e-17]

So yeah, the tests will pass. This is with your implementation of explained_variance_ratio.

@hlin117
Copy link
Contributor

hlin117 commented Dec 14, 2015

The numbers are different, but that could be perceived as a precision error. We should find a more plausible example of differences.

This seems to be failing too, but it could be perceived as a precision error than anything...

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> from sklearn.lda import LDA
>>> X = iris.data
>>> y = iris.target
>>>
>>> eigen = LDA(solver='eigen').fit(X, y)
>>> svd = LDA(solver='svd').fit(X, y)
>>>
>>> eigen.explained_variance_ratio_
array([  9.91472476e-01,   8.52752434e-03,   9.11678219e-16,
        -1.12681164e-16])
>>> svd.explained_variance_ratio_
array([  9.15130046e-01,   8.48699541e-02,   1.62683440e-16])

@JPFrancoia
Copy link
Contributor Author

The numbers are different, but that could be perceived as a precision error. We should find a more plausible example of differences.

They might be different programmaticaly (hard to say, they are very close), but for the test, they are equal for sure. So the test will pass, while the numbers might be different.

Try with this script, they are personal experimental data:

example.txt

(simply rename example.txt to example.py)

@hlin117
Copy link
Contributor

hlin117 commented Dec 14, 2015

I tried your script. I agree that the results are entirely different.

I wanted to make a more concrete counterexample, so I wrote this:

>>> import numpy as np
>>> from sklearn.lda import LDA
>>> from sklearn.utils.testing import assert_array_almost_equal
>>>
>>> state = np.random.RandomState(0)
>>> X = state.normal(loc=0, scale=100, size=(40, 20))
>>> y = state.randint(0, 3, size=(40, 1))
>>>
>>> # Train the LDA classifier. Use the eigen solver
>>> lda_eigen = LDA(solver='eigen', n_components=5)
>>> lda_eigen.fit(X, y)
>>>
>>> lda_svd = LDA(n_components=5, solver='svd')
>>> lda.fit(X, y)
>>> assert_array_almost_equal(lda_eigen.explained_variance_ratio_, lda_svd.explained_variance_ratio_)

It's not entirely clear why these results are not equal. It's not as simple as being the squared of another...

@hlin117
Copy link
Contributor

hlin117 commented Dec 15, 2015

@JPFrancoia: I opened up another issue surfacing this discussion. Let's take this discussion out of this PR.

Thanks for finding these differences.

@JPFrancoia
Copy link
Contributor Author

I don't really know where I should post right now, but I will do it here, because there are no numerical differences between the solvers. Three things:

self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components]

This is my fault, I was the one who created explained_variance_ratio for this solver. It is not really an error, the array returned contains the right values, but it does not have the right length.

  • explained_variance_ratio is not calculated correctly with the svd solver. For now, it is calculated like that:
self.explained_variance_ratio_ = S[:self.n_components] / S.sum()

And it should be (assuming the eigen values are the singular values squared):

self.explained_variance_ratio_ = (S**2 / np.sum(S**2))[:self.n_components]
  • If the length of explained_variance_ratio varies with the solver used, you cannot compare them directly in you tests

I modified your script. If we calculate explained_variance_ratio_ with S**2, and if we compare arrays of the same lengths, the test PASS:

import numpy as np
from sklearn.lda import LDA
from sklearn.utils.testing import assert_array_almost_equal


state = np.random.RandomState(0)
X = state.normal(loc=0, scale=100, size=(40, 20))
y = state.randint(0, 3, size=(40, 1))

# Train the LDA classifier. Use the eigen solver
lda_eigen = LDA(solver='eigen', n_components=5)
lda_eigen.fit(X, y)

lda_svd = LDA(n_components=5, solver='svd')
lda_svd.fit(X, y)

print("eigen:")
print(lda_eigen.explained_variance_ratio_[:3])
print("\n")
print("svd:")
print(lda_svd.explained_variance_ratio_)

assert_array_almost_equal(lda_eigen.explained_variance_ratio_[:3], lda_svd.explained_variance_ratio_)

@hlin117
Copy link
Contributor

hlin117 commented Dec 15, 2015

@JPFrancoia: #6031 is the PR you're going to be looking for =]

Please open up a PR that implements S**2, as well as this regression test.

@JPFrancoia
Copy link
Contributor Author

Actually, I had already opened one :) #6027

However, agramfort asks me to "update the unit tests to show this issue ", and you are asking for "regression test".

Could you please explain to me what they are ? I'm not familiar with this kind of things.

@hlin117
Copy link
Contributor

hlin117 commented Dec 15, 2015

@JPFrancoia
Copy link
Contributor Author

Nope it's ok, My bad.

agramfort pushed a commit that referenced this pull request Apr 10, 2016
solver. See issue #5216. However, this attribute is corrected from the
previous commit which, might not return the expected values.
agramfort pushed a commit that referenced this pull request Apr 10, 2016
Used the script of @hlin117 in #5216 to generate better test data,
make_blobs was not good enough.
mannby pushed a commit to mannby/scikit-learn that referenced this pull request Apr 22, 2016
solver. See issue scikit-learn#5216. However, this attribute is corrected from the
previous commit which, might not return the expected values.
mannby pushed a commit to mannby/scikit-learn that referenced this pull request Apr 22, 2016
Used the script of @hlin117 in scikit-learn#5216 to generate better test data,
make_blobs was not good enough.
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
solver. See issue scikit-learn#5216. However, this attribute is corrected from the
previous commit which, might not return the expected values.
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
Used the script of @hlin117 in scikit-learn#5216 to generate better test data,
make_blobs was not good enough.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants