LDA.explained_variance_ratio_ is of the wrong size #6032

hlin117 · 2015-12-15T00:13:01Z

The docs say that LDA.explained_variance_ratio_ should have only n_components_. But it doesn't.

It looks like this bug only exists when we use the eigen solver, not the svd solver.

>>> import numpy as np
>>> from sklearn.lda import LDA
>>> from sklearn.utils.testing import assert_equal
>>>
>>> state = np.random.RandomState(0)
>>> X = state.normal(loc=0, scale=100, size=(40, 20))
>>> y = state.randint(0, 3, size=(40, 1))
>>>
>>> # Train the LDA classifier. Use the eigen solver
>>> lda_eigen = LDA(solver='eigen', n_components=5)
>>> lda_eigen.fit(X, y)
>>> assert_equal(lda_eigen.explained_variance_ratio_.shape, (5,))
AssertionError: Tuples differ: (20,) != (5,)

First differing element 0:
20
5

- (20,)
+ (5,)

Looks like we fix either the docs or the code. Which one?

Pinging @JPFrancoia.

Addresses an issue in #6031.

The text was updated successfully, but these errors were encountered:

JPFrancoia · 2015-12-15T09:02:43Z

Yep sorry, this is my fault, as I said in #5216 , explained_variance_ratio_ should have a length of n_components.

the code should be:

self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components]

But this is an API change.

hlin117 · 2015-12-15T09:41:42Z

@JPFrancoia: It might be an API change, or a documentation change.

I'm concerned because there are two routes we can go from here:

Make a change to the docs. Which is easy, but it might seem counterintuitive to some that explained_variance_ratio_ is not of length n_components
Make a change to the source code. This is the ideal solution. However, this would limit backwards compatibility.

I'm waiting for the package owners to help guide this issue, because each one has consequences. Nonetheless, this PR should be addressed in some form.

JPFrancoia · 2015-12-15T09:58:12Z

Yep I agree.

The thing is, explained_variance_ratio_ is length n_components for the svd solver, but not for the eigen solver. So changing the doc would be problematic.

I think the "less worse" solution is to go for the code. But this is my opinion, the call should be to the package owner.

However, it is not a huge problem, because explained_variance_ratio_ is generally not used in calculations, but more as an "explanation" :) of what vectors were chosen in the dimension reduction.

amueller · 2016-10-07T18:10:19Z

I vote this is a bugfix and the change is ok. Was the attribute introduced in 0.18 or before?

JPFrancoia · 2016-10-07T18:21:34Z

It was merged in master on the 11 Sep 2015. Before 0.18, it's already present in the 0.17.1

amueller · 2016-10-08T03:00:06Z

that's too bad. I still would probably change it.

JPFrancoia · 2016-10-08T08:22:45Z

Shall I open a PR ?

amueller · 2016-10-08T17:52:57Z

@JPFrancoia go for it :)

JPFrancoia · 2016-10-08T21:46:44Z

The PR will fix the issue. However we still have a problem: explained_variance_ratio_, for the svd solver, does not have the right length either !

>>> import numpy as np
>>> from sklearn.lda import LDA
>>> from sklearn.utils.testing import assert_equal
>>>
>>> state = np.random.RandomState(0)
>>> X = state.normal(loc=0, scale=100, size=(40, 20))
>>> y = state.randint(0, 3, size=(40,))
>>>
>>> # Train the LDA classifier. Use the SVD solver
>>> lda_svd = LDA(solver='svd', n_components=5)
>>> lda_svd.fit(X, y)
>>> assert_equal(lda_svd.explained_variance_ratio_.shape, (5,))

AssertionError: Tuples differ: (3,) != (5,)

First differing element 0:
3
5

- (3,)
?  ^

+ (5,)
?  ^

With the svd solver, explained_variance_ratio will have a length of maximum n_classes (3 here). Three classes possible because of this line:

y = state.randint(0, 3, size=(40,))

explained_variance_ratio_ should be of length 5, because we asked for 5 components.

Also, the test function for the attribute is biased:

def test_lda_explained_variance_ratio():
    # Test if the sum of the normalized eigen vectors values equals 1,
    # Also tests whether the explained_variance_ratio_ formed by the
    # eigen solver is the same as the explained_variance_ratio_ formed
    # by the svd solver

    state = np.random.RandomState(0)
    X = state.normal(loc=0, scale=100, size=(40, 20))
    y = state.randint(0, 3, size=(40,))

    clf_lda_eigen = LinearDiscriminantAnalysis(solver="eigen", n_components=5)
    clf_lda_eigen.fit(X, y)
    assert_almost_equal(clf_lda_eigen.explained_variance_ratio_.sum(), 1.0, 3)

    clf_lda_svd = LinearDiscriminantAnalysis(solver="svd", n_components=7)
    clf_lda_svd.fit(X, y)
    assert_almost_equal(clf_lda_svd.explained_variance_ratio_.sum(), 1.0, 3)
    # print(clf_lda_svd.explained_variance_ratio_.shape[0]) -> 3 !!!

    tested_length = min(clf_lda_svd.explained_variance_ratio_.shape[0],
                        clf_lda_eigen.explained_variance_ratio_.shape[0])

    # print(tested_length) -> 3 !!!

    # NOTE: clf_lda_eigen.explained_variance_ratio_ is not of n_components
    # length. Make it the same length as clf_lda_svd.explained_variance_ratio_
    # before comparison.
    assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
                              clf_lda_eigen.explained_variance_ratio_[:tested_length])

amueller · 2016-10-08T23:48:08Z

n_components an be at most n_classes in LDA.

JPFrancoia · 2016-10-09T07:10:34Z

Ok, I updated my PR then. It should be ok now. I also updated the non regression test.

LinearDiscriminantAnalysis class will be of length n_components, if provided. If not provided, will have a maximum length of n_classes. The attribute will have the same length, whatever the solver (SVD, Eigen).

mohamed-ali · 2018-02-26T14:44:20Z

@jnothman @amueller, this issue is marked as closed in here, after being fixed with the PR #7632, however, it's in the list of issues without PR in here: https://github.com/scikit-learn/scikit-learn/projects/5.

I found that accidentally while looking for issues to work on.

jnothman · 2018-02-26T21:58:35Z

Thanks. we've not managed to use "projects" effectively, and this one is about a previous release. We should probably just double check and delete It

JPFrancoia mentioned this issue Dec 15, 2015

[MRG] Fix #6031: changed calculation of explained_variance_ratio_, SVD solver #6027

Closed

amueller added the Bug label Sep 14, 2016

amueller added this to the 0.19 milestone Sep 14, 2016

amueller modified the milestone: 0.19 Sep 29, 2016

JPFrancoia mentioned this issue Oct 7, 2016

Precision errors in LDA.explained_variance_ratio_ #6031

Closed

amueller added Easy Well-defined and straightforward way to resolve Need Contributor labels Oct 7, 2016

amueller removed the Need Contributor label Oct 8, 2016

This was referenced Oct 8, 2016

Fix issue #6032 #7615

Closed

[MRG+1] Correcting length of explained_variance_ratio_, eigen solver #7616

Closed

JPFrancoia mentioned this issue Oct 10, 2016

[MRG+1] Correcting length of explained_variance_ratio_, eigen solver, final PR #7632

Merged

jnothman closed this as completed in #7632 Oct 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDA.explained_variance_ratio_ is of the wrong size #6032

LDA.explained_variance_ratio_ is of the wrong size #6032

hlin117 commented Dec 15, 2015

JPFrancoia commented Dec 15, 2015

hlin117 commented Dec 15, 2015

JPFrancoia commented Dec 15, 2015

amueller commented Oct 7, 2016

JPFrancoia commented Oct 7, 2016 •

edited

amueller commented Oct 8, 2016

JPFrancoia commented Oct 8, 2016

amueller commented Oct 8, 2016

JPFrancoia commented Oct 8, 2016

amueller commented Oct 8, 2016

JPFrancoia commented Oct 9, 2016

mohamed-ali commented Feb 26, 2018

jnothman commented Feb 26, 2018 via email

LDA.explained_variance_ratio_ is of the wrong size #6032

LDA.explained_variance_ratio_ is of the wrong size #6032

Comments

hlin117 commented Dec 15, 2015

JPFrancoia commented Dec 15, 2015

hlin117 commented Dec 15, 2015

JPFrancoia commented Dec 15, 2015

amueller commented Oct 7, 2016

JPFrancoia commented Oct 7, 2016 • edited

amueller commented Oct 8, 2016

JPFrancoia commented Oct 8, 2016

amueller commented Oct 8, 2016

JPFrancoia commented Oct 8, 2016

amueller commented Oct 8, 2016

JPFrancoia commented Oct 9, 2016

mohamed-ali commented Feb 26, 2018

jnothman commented Feb 26, 2018 via email

JPFrancoia commented Oct 7, 2016 •

edited