Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA.explained_variance_ratio_ is of the wrong size #6032

Closed
hlin117 opened this issue Dec 15, 2015 · 13 comments · Fixed by #7632
Closed

LDA.explained_variance_ratio_ is of the wrong size #6032

hlin117 opened this issue Dec 15, 2015 · 13 comments · Fixed by #7632
Labels
Bug Easy Well-defined and straightforward way to resolve
Milestone

Comments

@hlin117
Copy link
Contributor

hlin117 commented Dec 15, 2015

The docs say that LDA.explained_variance_ratio_ should have only n_components_. But it doesn't.

It looks like this bug only exists when we use the eigen solver, not the svd solver.

>>> import numpy as np
>>> from sklearn.lda import LDA
>>> from sklearn.utils.testing import assert_equal
>>>
>>> state = np.random.RandomState(0)
>>> X = state.normal(loc=0, scale=100, size=(40, 20))
>>> y = state.randint(0, 3, size=(40, 1))
>>>
>>> # Train the LDA classifier. Use the eigen solver
>>> lda_eigen = LDA(solver='eigen', n_components=5)
>>> lda_eigen.fit(X, y)
>>> assert_equal(lda_eigen.explained_variance_ratio_.shape, (5,))
AssertionError: Tuples differ: (20,) != (5,)

First differing element 0:
20
5

- (20,)
+ (5,)

Looks like we fix either the docs or the code. Which one?

Pinging @JPFrancoia.

Addresses an issue in #6031.

@JPFrancoia
Copy link
Contributor

Yep sorry, this is my fault, as I said in #5216 , explained_variance_ratio_ should have a length of n_components.

the code should be:

self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components]

But this is an API change.

@hlin117
Copy link
Contributor Author

hlin117 commented Dec 15, 2015

@JPFrancoia: It might be an API change, or a documentation change.

I'm concerned because there are two routes we can go from here:

  1. Make a change to the docs. Which is easy, but it might seem counterintuitive to some that explained_variance_ratio_ is not of length n_components
  2. Make a change to the source code. This is the ideal solution. However, this would limit backwards compatibility.

I'm waiting for the package owners to help guide this issue, because each one has consequences. Nonetheless, this PR should be addressed in some form.

@JPFrancoia
Copy link
Contributor

Yep I agree.

The thing is, explained_variance_ratio_ is length n_components for the svd solver, but not for the eigen solver. So changing the doc would be problematic.

I think the "less worse" solution is to go for the code. But this is my opinion, the call should be to the package owner.

However, it is not a huge problem, because explained_variance_ratio_ is generally not used in calculations, but more as an "explanation" :) of what vectors were chosen in the dimension reduction.

@amueller
Copy link
Member

amueller commented Oct 7, 2016

I vote this is a bugfix and the change is ok. Was the attribute introduced in 0.18 or before?

@amueller amueller added Easy Well-defined and straightforward way to resolve Need Contributor labels Oct 7, 2016
@JPFrancoia
Copy link
Contributor

JPFrancoia commented Oct 7, 2016

It was merged in master on the 11 Sep 2015. Before 0.18, it's already present in the 0.17.1

@amueller
Copy link
Member

amueller commented Oct 8, 2016

that's too bad. I still would probably change it.

@JPFrancoia
Copy link
Contributor

Shall I open a PR ?

@amueller
Copy link
Member

amueller commented Oct 8, 2016

@JPFrancoia go for it :)

@JPFrancoia
Copy link
Contributor

The PR will fix the issue. However we still have a problem: explained_variance_ratio_, for the svd solver, does not have the right length either !

>>> import numpy as np
>>> from sklearn.lda import LDA
>>> from sklearn.utils.testing import assert_equal
>>>
>>> state = np.random.RandomState(0)
>>> X = state.normal(loc=0, scale=100, size=(40, 20))
>>> y = state.randint(0, 3, size=(40,))
>>>
>>> # Train the LDA classifier. Use the SVD solver
>>> lda_svd = LDA(solver='svd', n_components=5)
>>> lda_svd.fit(X, y)
>>> assert_equal(lda_svd.explained_variance_ratio_.shape, (5,))

AssertionError: Tuples differ: (3,) != (5,)

First differing element 0:
3
5

- (3,)
?  ^

+ (5,)
?  ^

With the svd solver, explained_variance_ratio will have a length of maximum n_classes (3 here). Three classes possible because of this line:

y = state.randint(0, 3, size=(40,))

explained_variance_ratio_ should be of length 5, because we asked for 5 components.

Also, the test function for the attribute is biased:

def test_lda_explained_variance_ratio():
    # Test if the sum of the normalized eigen vectors values equals 1,
    # Also tests whether the explained_variance_ratio_ formed by the
    # eigen solver is the same as the explained_variance_ratio_ formed
    # by the svd solver

    state = np.random.RandomState(0)
    X = state.normal(loc=0, scale=100, size=(40, 20))
    y = state.randint(0, 3, size=(40,))

    clf_lda_eigen = LinearDiscriminantAnalysis(solver="eigen", n_components=5)
    clf_lda_eigen.fit(X, y)
    assert_almost_equal(clf_lda_eigen.explained_variance_ratio_.sum(), 1.0, 3)

    clf_lda_svd = LinearDiscriminantAnalysis(solver="svd", n_components=7)
    clf_lda_svd.fit(X, y)
    assert_almost_equal(clf_lda_svd.explained_variance_ratio_.sum(), 1.0, 3)
    # print(clf_lda_svd.explained_variance_ratio_.shape[0]) -> 3 !!!

    tested_length = min(clf_lda_svd.explained_variance_ratio_.shape[0],
                        clf_lda_eigen.explained_variance_ratio_.shape[0])

    # print(tested_length) -> 3 !!!

    # NOTE: clf_lda_eigen.explained_variance_ratio_ is not of n_components
    # length. Make it the same length as clf_lda_svd.explained_variance_ratio_
    # before comparison.
    assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
                              clf_lda_eigen.explained_variance_ratio_[:tested_length])

@amueller
Copy link
Member

amueller commented Oct 8, 2016

n_components an be at most n_classes in LDA.

@JPFrancoia
Copy link
Contributor

Ok, I updated my PR then. It should be ok now. I also updated the non regression test.

JPFrancoia added a commit to JPFrancoia/scikit-learn that referenced this issue Oct 10, 2016
LinearDiscriminantAnalysis class will be of length n_components, if
provided. If not provided, will have a maximum length of n_classes. The
attribute will have the same length, whatever the solver (SVD, Eigen).
@mohamed-ali
Copy link
Contributor

@jnothman @amueller, this issue is marked as closed in here, after being fixed with the PR #7632, however, it's in the list of issues without PR in here: https://github.com/scikit-learn/scikit-learn/projects/5.

I found that accidentally while looking for issues to work on.

@jnothman
Copy link
Member

jnothman commented Feb 26, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve
Projects
No open projects
scikit-learn 0.19
Issues Without PR
5 participants