Skip to content

Conversation

@mdbecker
Copy link
Contributor

Fixes #3047

We tested this similar to in #2663 and determined that it makes sense to calculate explained variance as part of the fit method but then we merged the _fit method with the fit_transform method to avoid doing some duplicate work. This change will cause a minor performance regression in the case where fit is called by itself separate from transform (i.e. when calling on different inputs) which I believe is not the normal use case for this estimator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetics: please remove multiple blank lines at the end of a file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mdbecker
Copy link
Contributor Author

@ogrisel I need to run lint checks before this is ready. However can you please give me an idea if this looks okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can compare the first 10:

assert_almost_equal(
            svd_10.explained_variance_ratio_,
            svd_20.explained_variance_ratio_[:10],
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f7c8bb9 on mdbecker:truncated_svd_calculate_explained_variance into b88cec5 on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Apr 15, 2014

Looks good to me.

@larsmans now the fit will always do a transform (to be able to compute the explained variance) but I don't think it's a problem in practice.

@mdbecker mdbecker changed the title [WIP] TruncatedSVD: Calculate explained variance. [MRG] TruncatedSVD: Calculate explained variance. Apr 15, 2014
@ogrisel ogrisel changed the title [MRG] TruncatedSVD: Calculate explained variance. [MRG+1] TruncatedSVD: Calculate explained variance. Apr 15, 2014
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 3784c68 on mdbecker:truncated_svd_calculate_explained_variance into b88cec5 on scikit-learn:master.

@mdbecker
Copy link
Contributor Author

@ogrisel I forgot to update the docstrings. Should I do that as part of this PR?

@ogrisel
Copy link
Member

ogrisel commented Apr 15, 2014

Sure yes.

@mdbecker
Copy link
Contributor Author

@larsmans @ogrisel I think that should do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X.getformat() not in ["csr", "csc"]

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5c43aba on mdbecker:truncated_svd_calculate_explained_variance into * on scikit-learn:master*.

@mdbecker
Copy link
Contributor Author

@larsmans Fixed. Let me know if you find anything else. Thanks!

@mdbecker
Copy link
Contributor Author

@ogrisel @larsmans Repushed. I made one more minor change to the explanation of explained_variance_ratio_ that I think will be helpful to newbies. Let me know if you don't like it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to put \ at the end of docstring lines. Please remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 248acb9 on mdbecker:truncated_svd_calculate_explained_variance into 6f6de86 on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Apr 16, 2014

For some reason, github does not show your repo as potential target for a pull request from mine... Here is an updated example to demonstrate the usage of explained_variance_ratio_ in a LSA application:

diff --git a/examples/document_clustering.py b/examples/document_clustering.py
index 1480ba8..4785d73 100644
--- a/examples/document_clustering.py
+++ b/examples/document_clustering.py
@@ -159,11 +159,14 @@ if opts.n_components:
     # Vectorizer results are normalized, which makes KMeans behave as
     # spherical k-means for better results. Since LSA/SVD results are
     # not normalized, we have to redo the normalization.
-    lsa = make_pipeline(TruncatedSVD(opts.n_components),
-                        Normalizer(copy=False))
+    svd = TruncatedSVD(opts.n_components)
+    lsa = make_pipeline(svd, Normalizer(copy=False))
     X = lsa.fit_transform(X)
-
     print("done in %fs" % (time() - t0))
+
+    explained_variance = svd.explained_variance_ratio_.sum()
+    print("Explained variance of the SVD step: {}%".format(
+        int(explained_variance * 100)))
     print()


@@ -197,7 +200,7 @@ if not (opts.n_components or opts.use_hashing):
     print("Top terms per cluster:")
     order_centroids = km.cluster_centers_.argsort()[:, ::-1]
     terms = vectorizer.get_feature_names()
-    for i in xrange(true_k):
+    for i in range(true_k):
         print("Cluster %d:" % i, end='')
         for ind in order_centroids[i, :10]:
             print(' %s' % terms[ind], end='')

Please include it in your PR.

@ogrisel ogrisel changed the title [MRG+1] TruncatedSVD: Calculate explained variance. [WIP] TruncatedSVD: Calculate explained variance. Apr 17, 2014
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need normalized whitespace anymore here.

@mdbecker mdbecker closed this Apr 17, 2014
@mdbecker mdbecker reopened this Apr 17, 2014
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling eee44de9c17bc5cc052c528b361c2995eba4c99d on mdbecker:truncated_svd_calculate_explained_variance into b70a481 on scikit-learn:master.

@mdbecker mdbecker changed the title [WIP] TruncatedSVD: Calculate explained variance. [MRG] TruncatedSVD: Calculate explained variance. Apr 17, 2014
@ogrisel ogrisel changed the title [MRG] TruncatedSVD: Calculate explained variance. [MRG+1] TruncatedSVD: Calculate explained variance. Apr 17, 2014
@ogrisel
Copy link
Member

ogrisel commented Apr 17, 2014

This looks ready for merge to me. @larsmans any other comment?

@larsmans
Copy link
Member

LGTM, feel free to merge!

@larsmans larsmans changed the title [MRG+1] TruncatedSVD: Calculate explained variance. [MRG+2] TruncatedSVD: Calculate explained variance. Apr 17, 2014
ogrisel added a commit that referenced this pull request Apr 17, 2014
…ned_variance

[MRG+2] TruncatedSVD: Calculate explained variance.
@ogrisel ogrisel merged commit 47080ec into scikit-learn:master Apr 17, 2014
@ogrisel
Copy link
Member

ogrisel commented Apr 17, 2014

Great, thanks @mdbecker right on time for the official ending of the sprints!

@mdbecker
Copy link
Contributor Author

😄 Thanks for all your help @ogrisel & @larsmans. I had a great time!

mdbecker added a commit to mdbecker/scikit-learn that referenced this pull request Apr 17, 2014
ogrisel added a commit that referenced this pull request Apr 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TruncatedSVD does not calculate explained_variance_ratio_

4 participants