New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LDA.explained_variance_ratio_ is of the wrong size #6032
Comments
Yep sorry, this is my fault, as I said in #5216 , explained_variance_ratio_ should have a length of n_components. the code should be: self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components] But this is an API change. |
@JPFrancoia: It might be an API change, or a documentation change. I'm concerned because there are two routes we can go from here:
I'm waiting for the package owners to help guide this issue, because each one has consequences. Nonetheless, this PR should be addressed in some form. |
Yep I agree. The thing is, I think the "less worse" solution is to go for the code. But this is my opinion, the call should be to the package owner. However, it is not a huge problem, because explained_variance_ratio_ is generally not used in calculations, but more as an "explanation" :) of what vectors were chosen in the dimension reduction. |
I vote this is a bugfix and the change is ok. Was the attribute introduced in 0.18 or before? |
It was merged in master on the 11 Sep 2015. Before 0.18, it's already present in the 0.17.1 |
that's too bad. I still would probably change it. |
Shall I open a PR ? |
@JPFrancoia go for it :) |
The PR will fix the issue. However we still have a problem: >>> import numpy as np
>>> from sklearn.lda import LDA
>>> from sklearn.utils.testing import assert_equal
>>>
>>> state = np.random.RandomState(0)
>>> X = state.normal(loc=0, scale=100, size=(40, 20))
>>> y = state.randint(0, 3, size=(40,))
>>>
>>> # Train the LDA classifier. Use the SVD solver
>>> lda_svd = LDA(solver='svd', n_components=5)
>>> lda_svd.fit(X, y)
>>> assert_equal(lda_svd.explained_variance_ratio_.shape, (5,))
AssertionError: Tuples differ: (3,) != (5,)
First differing element 0:
3
5
- (3,)
? ^
+ (5,)
? ^ With the svd solver,
Also, the test function for the attribute is biased: def test_lda_explained_variance_ratio():
# Test if the sum of the normalized eigen vectors values equals 1,
# Also tests whether the explained_variance_ratio_ formed by the
# eigen solver is the same as the explained_variance_ratio_ formed
# by the svd solver
state = np.random.RandomState(0)
X = state.normal(loc=0, scale=100, size=(40, 20))
y = state.randint(0, 3, size=(40,))
clf_lda_eigen = LinearDiscriminantAnalysis(solver="eigen", n_components=5)
clf_lda_eigen.fit(X, y)
assert_almost_equal(clf_lda_eigen.explained_variance_ratio_.sum(), 1.0, 3)
clf_lda_svd = LinearDiscriminantAnalysis(solver="svd", n_components=7)
clf_lda_svd.fit(X, y)
assert_almost_equal(clf_lda_svd.explained_variance_ratio_.sum(), 1.0, 3)
# print(clf_lda_svd.explained_variance_ratio_.shape[0]) -> 3 !!!
tested_length = min(clf_lda_svd.explained_variance_ratio_.shape[0],
clf_lda_eigen.explained_variance_ratio_.shape[0])
# print(tested_length) -> 3 !!!
# NOTE: clf_lda_eigen.explained_variance_ratio_ is not of n_components
# length. Make it the same length as clf_lda_svd.explained_variance_ratio_
# before comparison.
assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
clf_lda_eigen.explained_variance_ratio_[:tested_length]) |
n_components an be at most n_classes in LDA. |
Ok, I updated my PR then. It should be ok now. I also updated the non regression test. |
LinearDiscriminantAnalysis class will be of length n_components, if provided. If not provided, will have a maximum length of n_classes. The attribute will have the same length, whatever the solver (SVD, Eigen).
@jnothman @amueller, this issue is marked as closed in here, after being fixed with the PR #7632, however, it's in the list of issues without PR in here: https://github.com/scikit-learn/scikit-learn/projects/5. I found that accidentally while looking for issues to work on. |
Thanks. we've not managed to use "projects" effectively, and this one is
about a previous release. We should probably just double check and delete It
|
The docs say that LDA.explained_variance_ratio_ should have only
n_components_
. But it doesn't.It looks like this bug only exists when we use the
eigen
solver, not thesvd
solver.Looks like we fix either the docs or the code. Which one?
Pinging @JPFrancoia.
Addresses an issue in #6031.
The text was updated successfully, but these errors were encountered: