New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] TruncatedSVD: Calculate explained variance. #3067

Merged
merged 1 commit into from Apr 17, 2014

Conversation

Projects
None yet
4 participants
@mdbecker
Contributor

mdbecker commented Apr 14, 2014

Fixes #3047

We tested this similar to in #2663 and determined that it makes sense to calculate explained variance as part of the fit method but then we merged the _fit method with the fit_transform method to avoid doing some duplicate work. This change will cause a minor performance regression in the case where fit is called by itself separate from transform (i.e. when calling on different inputs) which I believe is not the normal use case for this estimator.

Show outdated Hide outdated sklearn/decomposition/tests/test_truncated_svd.py
)

This comment has been minimized.

@ogrisel

ogrisel Apr 15, 2014

Member

cosmetics: please remove multiple blank lines at the end of a file.

@ogrisel

ogrisel Apr 15, 2014

Member

cosmetics: please remove multiple blank lines at the end of a file.

This comment has been minimized.

@mdbecker

mdbecker Apr 15, 2014

Contributor

👍

@mdbecker

mdbecker Apr 15, 2014

Contributor

👍

@mdbecker

This comment has been minimized.

Show comment
Hide comment
@mdbecker

mdbecker Apr 15, 2014

Contributor

@ogrisel I need to run lint checks before this is ready. However can you please give me an idea if this looks okay?

Contributor

mdbecker commented Apr 15, 2014

@ogrisel I need to run lint checks before this is ready. However can you please give me an idea if this looks okay?

Show outdated Hide outdated sklearn/decomposition/tests/test_truncated_svd.py
for svd_10, svd_20 in svds_10_v_20:
assert_almost_equal(
svd_10.explained_variance_ratio_[0],
svd_20.explained_variance_ratio_[0],

This comment has been minimized.

@ogrisel

ogrisel Apr 15, 2014

Member

You can compare the first 10:

assert_almost_equal(
            svd_10.explained_variance_ratio_,
            svd_20.explained_variance_ratio_[:10],
)
@ogrisel

ogrisel Apr 15, 2014

Member

You can compare the first 10:

assert_almost_equal(
            svd_10.explained_variance_ratio_,
            svd_20.explained_variance_ratio_[:10],
)

This comment has been minimized.

@mdbecker

mdbecker Apr 15, 2014

Contributor

👍

@mdbecker

mdbecker Apr 15, 2014

Contributor

👍

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 15, 2014

Coverage Status

Coverage remained the same when pulling f7c8bb9 on mdbecker:truncated_svd_calculate_explained_variance into b88cec5 on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling f7c8bb9 on mdbecker:truncated_svd_calculate_explained_variance into b88cec5 on scikit-learn:master.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Apr 15, 2014

Member

Looks good to me.

@larsmans now the fit will always do a transform (to be able to compute the explained variance) but I don't think it's a problem in practice.

Member

ogrisel commented Apr 15, 2014

Looks good to me.

@larsmans now the fit will always do a transform (to be able to compute the explained variance) but I don't think it's a problem in practice.

@mdbecker mdbecker changed the title from [WIP] TruncatedSVD: Calculate explained variance. to [MRG] TruncatedSVD: Calculate explained variance. Apr 15, 2014

@ogrisel ogrisel changed the title from [MRG] TruncatedSVD: Calculate explained variance. to [MRG+1] TruncatedSVD: Calculate explained variance. Apr 15, 2014

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 15, 2014

Coverage Status

Coverage remained the same when pulling 3784c68 on mdbecker:truncated_svd_calculate_explained_variance into b88cec5 on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling 3784c68 on mdbecker:truncated_svd_calculate_explained_variance into b88cec5 on scikit-learn:master.

@mdbecker

This comment has been minimized.

Show comment
Hide comment
@mdbecker

mdbecker Apr 15, 2014

Contributor

@ogrisel I forgot to update the docstrings. Should I do that as part of this PR?

Contributor

mdbecker commented Apr 15, 2014

@ogrisel I forgot to update the docstrings. Should I do that as part of this PR?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Apr 15, 2014

Member

Sure yes.

Member

ogrisel commented Apr 15, 2014

Sure yes.

@mdbecker

This comment has been minimized.

Show comment
Hide comment
@mdbecker

mdbecker Apr 16, 2014

Contributor

@larsmans @ogrisel I think that should do.

Contributor

mdbecker commented Apr 16, 2014

@larsmans @ogrisel I think that should do.

Show outdated Hide outdated sklearn/decomposition/truncated_svd.py
X = as_float_array(X, copy=False)
random_state = check_random_state(self.random_state)
# If sparse and not csr or csc, convert to csr
if sp.issparse(X) and not (
X.getformat() == 'csr' or X.getformat() == 'csc'):

This comment has been minimized.

@larsmans

larsmans Apr 16, 2014

Member

X.getformat() not in ["csr", "csc"]

@larsmans

larsmans Apr 16, 2014

Member

X.getformat() not in ["csr", "csc"]

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 16, 2014

Coverage Status

Changes Unknown when pulling 5c43aba on mdbecker:truncated_svd_calculate_explained_variance into * on scikit-learn:master*.

Coverage Status

Changes Unknown when pulling 5c43aba on mdbecker:truncated_svd_calculate_explained_variance into * on scikit-learn:master*.

@mdbecker

This comment has been minimized.

Show comment
Hide comment
@mdbecker

mdbecker Apr 16, 2014

Contributor

@larsmans Fixed. Let me know if you find anything else. Thanks!

Contributor

mdbecker commented Apr 16, 2014

@larsmans Fixed. Let me know if you find anything else. Thanks!

@mdbecker

This comment has been minimized.

Show comment
Hide comment
@mdbecker

mdbecker Apr 16, 2014

Contributor

@ogrisel @larsmans Repushed. I made one more minor change to the explanation of explained_variance_ratio_ that I think will be helpful to newbies. Let me know if you don't like it.

Contributor

mdbecker commented Apr 16, 2014

@ogrisel @larsmans Repushed. I made one more minor change to the explanation of explained_variance_ratio_ that I think will be helpful to newbies. Let me know if you don't like it.

Show outdated Hide outdated sklearn/decomposition/truncated_svd.py
@@ -61,6 +63,29 @@ class TruncatedSVD(BaseEstimator, TransformerMixin):
----------
`components_` : array, shape (n_components, n_features)
`explained_variance_` : array, [n_components]
The variance of the training samples transformed by a projection to \

This comment has been minimized.

@ogrisel

ogrisel Apr 16, 2014

Member

You don't need to put \ at the end of docstring lines. Please remove them.

@ogrisel

ogrisel Apr 16, 2014

Member

You don't need to put \ at the end of docstring lines. Please remove them.

This comment has been minimized.

@mdbecker

mdbecker Apr 16, 2014

Contributor

👍

@mdbecker

mdbecker Apr 16, 2014

Contributor

👍

Show outdated Hide outdated sklearn/decomposition/truncated_svd.py
`explained_variance_ratio_` : array, [n_components]
Percentage of variance explained by each of the selected components. \
For most common tasks, the sum of the explained_variance_ratio_ \
should be at least 90%.

This comment has been minimized.

@ogrisel

ogrisel Apr 16, 2014

Member

No, this value is very task specific. In some cases you want to extract 2D representation (e.g. for plotting) whatever the value of the explained variance (although it's good to check that value in retrospect to have an idea on how much the plot is lying to you).

Furthermore for LSA applications typically extract between 100 and 300 components. The explained variance ratio is around 25% to 50%. Despite this, similarity queries and clustering algorithms are reported to work well with such low numbers of components.

@ogrisel

ogrisel Apr 16, 2014

Member

No, this value is very task specific. In some cases you want to extract 2D representation (e.g. for plotting) whatever the value of the explained variance (although it's good to check that value in retrospect to have an idea on how much the plot is lying to you).

Furthermore for LSA applications typically extract between 100 and 300 components. The explained variance ratio is around 25% to 50%. Despite this, similarity queries and clustering algorithms are reported to work well with such low numbers of components.

This comment has been minimized.

@mdbecker

mdbecker Apr 16, 2014

Contributor

👍

@mdbecker

mdbecker Apr 16, 2014

Contributor

👍

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 16, 2014

Coverage Status

Coverage remained the same when pulling 248acb9 on mdbecker:truncated_svd_calculate_explained_variance into 6f6de86 on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling 248acb9 on mdbecker:truncated_svd_calculate_explained_variance into 6f6de86 on scikit-learn:master.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Apr 16, 2014

Member

For some reason, github does not show your repo as potential target for a pull request from mine... Here is an updated example to demonstrate the usage of explained_variance_ratio_ in a LSA application:

diff --git a/examples/document_clustering.py b/examples/document_clustering.py
index 1480ba8..4785d73 100644
--- a/examples/document_clustering.py
+++ b/examples/document_clustering.py
@@ -159,11 +159,14 @@ if opts.n_components:
     # Vectorizer results are normalized, which makes KMeans behave as
     # spherical k-means for better results. Since LSA/SVD results are
     # not normalized, we have to redo the normalization.
-    lsa = make_pipeline(TruncatedSVD(opts.n_components),
-                        Normalizer(copy=False))
+    svd = TruncatedSVD(opts.n_components)
+    lsa = make_pipeline(svd, Normalizer(copy=False))
     X = lsa.fit_transform(X)
-
     print("done in %fs" % (time() - t0))
+
+    explained_variance = svd.explained_variance_ratio_.sum()
+    print("Explained variance of the SVD step: {}%".format(
+        int(explained_variance * 100)))
     print()


@@ -197,7 +200,7 @@ if not (opts.n_components or opts.use_hashing):
     print("Top terms per cluster:")
     order_centroids = km.cluster_centers_.argsort()[:, ::-1]
     terms = vectorizer.get_feature_names()
-    for i in xrange(true_k):
+    for i in range(true_k):
         print("Cluster %d:" % i, end='')
         for ind in order_centroids[i, :10]:
             print(' %s' % terms[ind], end='')

Please include it in your PR.

Member

ogrisel commented Apr 16, 2014

For some reason, github does not show your repo as potential target for a pull request from mine... Here is an updated example to demonstrate the usage of explained_variance_ratio_ in a LSA application:

diff --git a/examples/document_clustering.py b/examples/document_clustering.py
index 1480ba8..4785d73 100644
--- a/examples/document_clustering.py
+++ b/examples/document_clustering.py
@@ -159,11 +159,14 @@ if opts.n_components:
     # Vectorizer results are normalized, which makes KMeans behave as
     # spherical k-means for better results. Since LSA/SVD results are
     # not normalized, we have to redo the normalization.
-    lsa = make_pipeline(TruncatedSVD(opts.n_components),
-                        Normalizer(copy=False))
+    svd = TruncatedSVD(opts.n_components)
+    lsa = make_pipeline(svd, Normalizer(copy=False))
     X = lsa.fit_transform(X)
-
     print("done in %fs" % (time() - t0))
+
+    explained_variance = svd.explained_variance_ratio_.sum()
+    print("Explained variance of the SVD step: {}%".format(
+        int(explained_variance * 100)))
     print()


@@ -197,7 +200,7 @@ if not (opts.n_components or opts.use_hashing):
     print("Top terms per cluster:")
     order_centroids = km.cluster_centers_.argsort()[:, ::-1]
     terms = vectorizer.get_feature_names()
-    for i in xrange(true_k):
+    for i in range(true_k):
         print("Cluster %d:" % i, end='')
         for ind in order_centroids[i, :10]:
             print(' %s' % terms[ind], end='')

Please include it in your PR.

@ogrisel ogrisel changed the title from [MRG+1] TruncatedSVD: Calculate explained variance. to [WIP] TruncatedSVD: Calculate explained variance. Apr 17, 2014

Show outdated Hide outdated sklearn/decomposition/truncated_svd.py
>>> svd.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
TruncatedSVD(algorithm='randomized', n_components=5, n_iter=5,
random_state=42, tol=0.0)
>>> print(svd.explained_variance_ratio_) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE

This comment has been minimized.

@ogrisel

ogrisel Apr 17, 2014

Member

I think you don't need normalized whitespace anymore here.

@ogrisel

ogrisel Apr 17, 2014

Member

I think you don't need normalized whitespace anymore here.

Show outdated Hide outdated sklearn/decomposition/truncated_svd.py
>>> from sklearn.random_projection import sparse_random_matrix
>>> X = sparse_random_matrix(100, 100, density=0.01, random_state=42)
>>> svd = TruncatedSVD(n_components=5, random_state=42)
>>> svd.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE

This comment has been minimized.

@ogrisel

ogrisel Apr 17, 2014

Member

you don't need ELLIPSIS here.

@ogrisel

ogrisel Apr 17, 2014

Member

you don't need ELLIPSIS here.

@mdbecker mdbecker closed this Apr 17, 2014

@mdbecker mdbecker reopened this Apr 17, 2014

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 17, 2014

Coverage Status

Coverage remained the same when pulling eee44de9c17bc5cc052c528b361c2995eba4c99d on mdbecker:truncated_svd_calculate_explained_variance into b70a481 on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling eee44de9c17bc5cc052c528b361c2995eba4c99d on mdbecker:truncated_svd_calculate_explained_variance into b70a481 on scikit-learn:master.

@mdbecker mdbecker changed the title from [WIP] TruncatedSVD: Calculate explained variance. to [MRG] TruncatedSVD: Calculate explained variance. Apr 17, 2014

@ogrisel ogrisel changed the title from [MRG] TruncatedSVD: Calculate explained variance. to [MRG+1] TruncatedSVD: Calculate explained variance. Apr 17, 2014

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Apr 17, 2014

Member

This looks ready for merge to me. @larsmans any other comment?

Member

ogrisel commented Apr 17, 2014

This looks ready for merge to me. @larsmans any other comment?

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Apr 17, 2014

Member

LGTM, feel free to merge!

Member

larsmans commented Apr 17, 2014

LGTM, feel free to merge!

@larsmans larsmans changed the title from [MRG+1] TruncatedSVD: Calculate explained variance. to [MRG+2] TruncatedSVD: Calculate explained variance. Apr 17, 2014

ogrisel added a commit that referenced this pull request Apr 17, 2014

Merge pull request #3067 from mdbecker/truncated_svd_calculate_explai…
…ned_variance

[MRG+2] TruncatedSVD: Calculate explained variance.

@ogrisel ogrisel merged commit 47080ec into scikit-learn:master Apr 17, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Apr 17, 2014

Member

Great, thanks @mdbecker right on time for the official ending of the sprints!

Member

ogrisel commented Apr 17, 2014

Great, thanks @mdbecker right on time for the official ending of the sprints!

@mdbecker

This comment has been minimized.

Show comment
Hide comment
@mdbecker

mdbecker Apr 17, 2014

Contributor

😄 Thanks for all your help @ogrisel & @larsmans. I had a great time!

Contributor

mdbecker commented Apr 17, 2014

😄 Thanks for all your help @ogrisel & @larsmans. I had a great time!

mdbecker added a commit to mdbecker/scikit-learn that referenced this pull request Apr 17, 2014

ogrisel added a commit that referenced this pull request Apr 18, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment