-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creation of the attribute LDA.explained_variance_ratio_, for the eige… #5216
Conversation
…n solver. It is an analogy to PCA.explained_variance_ratio, and works in the same way.
there is no test and no docstring update |
explained_variance_ratio_. Creation of the function test_lda_explained_variance_ratio to test this new attribute.
I'm very sorry, I don't contribute a lot to big projects usually. I hope this is better now. |
@@ -180,6 +180,12 @@ class LDA(BaseEstimator, LinearClassifierMixin, TransformerMixin): | |||
covariance_ : array-like, shape (n_features, n_features) | |||
Covariance matrix (shared by all classes). | |||
|
|||
explained_variance_ratio_ : array, [n_components] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explained_variance_ratio_ : array, shape (n_components,)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just replace the line with what I suggest. The formatting is not consistent with the rest of the docstring.
besides LTGM. It is consistent with PCA API. |
Yes, I'm trying...What does this note means ? Is something incorrect ? |
@@ -180,6 +180,12 @@ class LDA(BaseEstimator, LinearClassifierMixin, TransformerMixin): | |||
covariance_ : array-like, shape (n_features, n_features) | |||
Covariance matrix (shared by all classes). | |||
|
|||
explained_variance_ratio_ : array, (n_components) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you forgot the ,
(n_components) -> (n_components,)
it's a tuple of length 1
+1 for merge when travis is happy. |
Is this a standard thing for LDA? I'm not that familiar with it. |
Oh, sorry, I misread that. I thought these were the class covariances, not the eigenvectors of |
Nope, actually it is only available for solver='eigen', I already put it in the doc. I would like to do it for all the solvers, however I'm not able to do it for all of them. I'm not good enough in maths, and I don't understand well the code for the other solvers. |
Ok, looks good then. |
Do you want to add an entry to |
Creation of the attribute LDA.explained_variance_ratio_, for the eige…
thanks |
It would be nice to add it for the SVD solver as well. |
Yep, I tried, but I don't really understand the code. For the eigen solver, it was easy. Following this tutorial: http://sebastianraschka.com/Articles/2014_python_lda.html#table-of-contents, it's simple to understand how that works. But here is the interesting part of the SVD solver: n_samples, n_features = X.shape
n_classes = len(self.classes_)
self.means_ = _class_means(X, y)
if store_covariance:
self.covariance_ = _class_cov(X, y, self.priors_)
Xc = []
for idx, group in enumerate(self.classes_):
Xg = X[y == group, :]
Xc.append(Xg - self.means_[idx])
self.xbar_ = np.dot(self.priors_, self.means_)
Xc = np.concatenate(Xc, axis=0)
# 1) within (univariate) scaling by with classes std-dev
std = Xc.std(axis=0)
# avoid division by zero in normalization
std[std == 0] = 1.
fac = 1. / (n_samples - n_classes)
# 2) Within variance scaling
X = np.sqrt(fac) * (Xc / std)
# SVD of centered (within)scaled data
U, S, V = linalg.svd(X, full_matrices=False)
rank = np.sum(S > tol)
if rank < n_features:
warnings.warn("Variables are collinear.")
# Scaling of within covariance is: V' 1/S
scalings = (V[:rank] / std).T / S[:rank]
# 3) Between variance scaling
# Scale weighted centers
X = np.dot(((np.sqrt((n_samples * self.priors_) * fac)) *
(self.means_ - self.xbar_).T).T, scalings)
# Centers are living in a space with n_classes-1 dim (maximum)
# Use SVD to find projection in the space spanned by the
# (n_classes) centers
_, S, V = linalg.svd(X, full_matrices=0)
rank = np.sum(S > tol * S[0])
self.scalings_ = np.dot(scalings, V.T[:, :rank])
coef = np.dot(self.means_ - self.xbar_, self.scalings_)
self.intercept_ = (-0.5 * np.sum(coef ** 2, axis=1)
+ np.log(self.priors_))
self.coef_ = np.dot(coef, self.scalings_.T)
self.intercept_ -= np.dot(self.xbar_, self.coef_.T) I don't know if it calculates the same thing (eigen values or something equivalent). The answer should be in S, but I'm not entirely sure. The variance ratios should be the same as with the eigen solver, right ? |
Would really like this for SVD solver |
FYI, thanks to this little PR, our publication was accepted, and scikit-learn has one more citation. |
@JPFrancoia congratulations for the publication. If yes, you might find it helpful to think of singular values as sqare-rooted eigenvalues. Further, I don't think the variance ratios need to be exactly the same for both solvers because the SVD solver implicitly does regularization (as I recently learned). However, they should not be too far apart :) |
Hi @mbillingr ,
It's easy then, I just made a PR. Ex: with solver='svd' https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#L400 [ 7.53494106e-01 1.92629978e-01 5.32374179e-02 6.38497956e-04] With the PR I made, I get the exact same values with both solvers: However, I'm not good enough in maths to say he is wrong or not. He actually doesn't square S. |
Hey @JPFrancoia, Thanks for pinging me. I set the number of variables for As for the numeric differences, what you might be experiencing is a problem with the solvers themselves. (Example, one algorithm initializes it differently than another. #5970 also mentions how different initialization values yields different results.) If you could could write a regression test by using one of scikit-learn's datasets here, and post your code, that would be great. Use |
You are right on this point, then https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#337 is not consistent, and should be: self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components]
The numerical differences here are not due to the solvers. You calculate the variance ratios this way: self.explained_variance_ratio_ = S[:self.n_components] / S.sum() While I do: self.explained_variance_ratio_ = np.sort(S**2 / np.sum(S**2))[::-1][:self.n_components] EDIT: self.explained_variance_ratio_ = (S**2 / np.sum(S**2))[:self.n_components] My point is: |
Hmm, that's right... It's odd that the test cases that I wrote passed. Can you open up a new issue, and write a corresponding PR to add these test cases? |
What test cases did you write ? self.explained_variance_ratio_ = S / S.sum() You are probably checking that the sum of this array returns 1, which it does. Because you are dividing each element of S by the sum of S. EDIT: My bad, you are checking: assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
clf_lda_eigen.explained_variance_ratio_) EDIT 2: Actually, couldn't this test be right in many situations, even if the values of one of these arrays are wrong ? |
Ok, some from the tests: def test_lda_explained_variance_ratio():
# Test if the sum of the normalized eigen vectors values equals 1,
# Also tests whether the explained_variance_ratio_ formed by the
# eigen solver is the same as the explained_variance_ratio_ formed
# by the svd solver
n_features = 2
n_classes = 2
n_samples = 1000
X, y = make_blobs(n_samples=n_samples, n_features=n_features,
centers=n_classes, random_state=11)
clf_lda_eigen = LinearDiscriminantAnalysis(solver="eigen")
clf_lda_eigen.fit(X, y)
assert_almost_equal(clf_lda_eigen.explained_variance_ratio_.sum(), 1.0, 3)
print(clf_lda_eigen.explained_variance_ratio_)
clf_lda_svd = LinearDiscriminantAnalysis(solver="svd")
clf_lda_svd.fit(X, y)
print(clf_lda_svd.explained_variance_ratio_)
assert_almost_equal(clf_lda_svd.explained_variance_ratio_.sum(), 1.0, 3)
assert_array_almost_equal(clf_lda_svd.explained_variance_ratio_,
clf_lda_eigen.explained_variance_ratio_) Output: [ 1.00000000e+00 8.01933309e-16] So yeah, the tests will pass. This is with your implementation of explained_variance_ratio. |
The numbers are different, but that could be perceived as a precision error. We should find a more plausible example of differences. This seems to be failing too, but it could be perceived as a precision error than anything...
|
They might be different programmaticaly (hard to say, they are very close), but for the test, they are equal for sure. So the test will pass, while the numbers might be different. Try with this script, they are personal experimental data: (simply rename example.txt to example.py) |
I tried your script. I agree that the results are entirely different. I wanted to make a more concrete counterexample, so I wrote this:
It's not entirely clear why these results are not equal. It's not as simple as being the squared of another... |
@JPFrancoia: I opened up another issue surfacing this discussion. Let's take this discussion out of this PR. Thanks for finding these differences. |
I don't really know where I should post right now, but I will do it here, because there are no numerical differences between the solvers. Three things:
self.explained_variance_ratio_ = np.sort(evals / np.sum(evals))[::-1][:self.n_components] This is my fault, I was the one who created explained_variance_ratio for this solver. It is not really an error, the array returned contains the right values, but it does not have the right length.
self.explained_variance_ratio_ = S[:self.n_components] / S.sum() And it should be (assuming the eigen values are the singular values squared): self.explained_variance_ratio_ = (S**2 / np.sum(S**2))[:self.n_components]
I modified your script. If we calculate explained_variance_ratio_ with S**2, and if we compare arrays of the same lengths, the test PASS: import numpy as np
from sklearn.lda import LDA
from sklearn.utils.testing import assert_array_almost_equal
state = np.random.RandomState(0)
X = state.normal(loc=0, scale=100, size=(40, 20))
y = state.randint(0, 3, size=(40, 1))
# Train the LDA classifier. Use the eigen solver
lda_eigen = LDA(solver='eigen', n_components=5)
lda_eigen.fit(X, y)
lda_svd = LDA(n_components=5, solver='svd')
lda_svd.fit(X, y)
print("eigen:")
print(lda_eigen.explained_variance_ratio_[:3])
print("\n")
print("svd:")
print(lda_svd.explained_variance_ratio_)
assert_array_almost_equal(lda_eigen.explained_variance_ratio_[:3], lda_svd.explained_variance_ratio_) |
@JPFrancoia: #6031 is the PR you're going to be looking for =] Please open up a PR that implements |
Actually, I had already opened one :) #6027 However, agramfort asks me to "update the unit tests to show this issue ", and you are asking for "regression test". Could you please explain to me what they are ? I'm not familiar with this kind of things. |
Stackoverflow is your friend =] |
Nope it's ok, My bad. |
solver. See issue #5216. However, this attribute is corrected from the previous commit which, might not return the expected values.
solver. See issue scikit-learn#5216. However, this attribute is corrected from the previous commit which, might not return the expected values.
Used the script of @hlin117 in scikit-learn#5216 to generate better test data, make_blobs was not good enough.
solver. See issue scikit-learn#5216. However, this attribute is corrected from the previous commit which, might not return the expected values.
Used the script of @hlin117 in scikit-learn#5216 to generate better test data, make_blobs was not good enough.
Creation of the attribute LDA.explained_variance_ratio_, for the eigen solver. It is an analogy to PCA.explained_variance_ratio, and works in the same way.
Only one line of code has been added, there was not much to change. However, the attribute is not available for the default solver, as I'm not good enough to change this part of the code.
This is a PR to solve the issue #5210