Skip to content

[MRG] add latent semantic analysis/sparse truncated SVD #1716

Merged
merged 7 commits into from Jun 12, 2013

4 participants

@larsmans
scikit-learn member

Reissue of #1519 with LatentSemanticAnalysis renamed TruncatedSVD. I didn't touch RandomizedPCA and I don't intend to in this PR.

Docs and 20news clustering example make it very clear that this is LSA, for NLP/IR folks looking for that transformation.

@ogrisel ogrisel commented on an outdated diff Feb 28, 2013
examples/document_clustering.py
@@ -147,17 +154,25 @@
print("n_samples: %d, n_features: %d" % X.shape)
print()
+if opts.n_components:
+ print("Performing dimensionality reduction using LSA")
+ t0 = time()
+ lsa = LatentSemanticAnalysis(opts.n_components)
@ogrisel
scikit-learn member
ogrisel added a note Feb 28, 2013

This should be rename to TruncatedSVD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
scikit-learn member

Thanks @ogrisel, fixed.

@larsmans
scikit-learn member

Current RandomizedPCA behavior is apparently confusing for users. Shall I move ahead with this PR so we can separate SVD and PCA?

@ogrisel
scikit-learn member
ogrisel commented May 21, 2013

test_common is failing (see travis output), maybe using the randomized variant would be a more robust (and faster) default in practice? Have you tried to do some benchmark?

Also it would be interesting to do a scatter plot with pairwise cosine similarities for the two variant to see if the two algorithms converge to similar transformations. If so the scatter plot should be a diagonal.

@ogrisel
scikit-learn member
ogrisel commented May 21, 2013

Other than the test_common failures which I have not investigated in details myself, +1 for merging.

@larsmans
scikit-learn member

The common test fails because the default k is 100 for the new estimator, while the common tests assume 2. The former is a sensible default for LSA on a document corpus, the latter for visualisation. Since the estimator is more general than LSA, I'll change it to 2.

I'll look into the stability of both algorithms.

@amueller
scikit-learn member

Sweet. I wouldn't oppose to adding a special case to the common tests. On the other hand I like my dimensionality reduction to go to 2 components bye default ;)

@larsmans
scikit-learn member

I imagine a vision guy would want 2-d output ;)

@ogrisel
scikit-learn member
ogrisel commented May 21, 2013

2D LSA is useful to visualize the spread and overlap of text documents classes in a corpus as well. It's actually interesting to do 2D scatter plots of selections of 2 to 5 categories of the 20 newsgroups dataset.

@larsmans
scikit-learn member

@ogrisel I'm sorry, I'm not really sure what kind of scatterplot you want. Could you elaborate?

@larsmans
scikit-learn member

There seems to be one more problem. SVD with two components causes a different sign for the second component for both algorithms, despite flip_svd being applied in both cases.

(np.abs(X_randsvd) - np.abs(X_svds)).sum() is really tiny for 20news test data, though.

@larsmans
scikit-learn member

I figured out the sign issue. The max columns of U aren't necessarily the same for both algorithms. This is annoying, but I don't immediately see how to fix this; the algorithm by Bro et al. seems to require densifying. I'll document this.

@ogrisel
scikit-learn member
ogrisel commented May 22, 2013

The scatter plot is check is the following:

  • transform 20 newsgroups dataset into 100D truncated SVD space using each algorithm in turn
  • select a random number of pairs of documents
  • for each pair compute the cosine sim for each svd representation and scatter plot the pair similarities with exact svd in x and randomized svd in y for instance
@ogrisel
scikit-learn member
ogrisel commented May 22, 2013

Don't you think we should deprecate RandomizedPCA and point users to TruncatedSVD instead, or maybe just deprecate the sparse support of RandomizedPCA (as PCA without centering in not real PCA)?

@GaelVaroquaux
scikit-learn member
@larsmans
scikit-learn member

Indeed. We discussed this in the other thread, #1519. (I was just about to ping you, Gael :)

@larsmans
scikit-learn member

I corrected the glaring bugs in my script; here's a better plot:

Cosine similarities after randomized vs. ARPACK-based SVD

There's a few negative cosine similarities due to the SVD sign issue, but that's to be expected. This looks reasonable to me.

I hardened the tests and declare this ready for merge.

@ogrisel
scikit-learn member
ogrisel commented May 23, 2013

Thanks @larsmans for the plot. It looks good to me. We could later turn it into an example to measure the impact of n_power_iterations for randomized_svd but this can be done later.

Could you please add the deprecation warning for the sparse input in RandomizedPCA and a new entry in whats_new.rst?

Then +1 for merge.

@larsmans
scikit-learn member

Done. The only thing that's going missing now is whitening support on sparse matrices. To be honest, I'm not really sure what that does and whether it makes sense to transplant that?

@ogrisel
scikit-learn member
ogrisel commented May 23, 2013

It should normalize the transformed dataset so that it how unit variances feature wise (in the target space). It might a thing to have when TruncatedSVD is used as a preprocessing step to an algorithm that expects features with normalized variances as input (e.g. a neural network). However it does throw away so information: the relative variance explained by each feature.

@larsmans
scikit-learn member

The explained variances are not currently stored on the estimator, as I suspected they wouldn't be valid on non-centered data (tf-idf inputs). I could add those as well, but then we're stretching the hack that is LSA a bit thin...

@ogrisel
scikit-learn member
ogrisel commented May 23, 2013

No need to take the explained_variance_ from RandomizedPCA as I think it's wrong for truncated svd. However the whiten=True handling should be correct and probably worth the port.

@GaelVaroquaux GaelVaroquaux commented on the diff May 24, 2013
doc/modules/decomposition.rst
+works with any (sparse) feature matrix,
+using it on tf–idf matrices is recommended over raw frequency counts
+in an LSA/document processing setting.
+In particular, sublinear scaling and inverse document frequency
+should be turned on (``sublinear_tf=True, use_idf=True``)
+to bring the feature values closer to a Gaussian distribution,
+compensating for LSA's erroneous assumptions about textual data.
+
+.. topic:: References:
+
+ * Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2008),
+ *Introduction to Information Retrieval*, Cambridge University Press,
+ chapter 18: `Matrix decompositions & latent semantic indexing
+ <http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf>`_
+
+
@GaelVaroquaux
scikit-learn member

I think that the document_clustering.py example should be linked here.

@larsmans
scikit-learn member
larsmans added a note May 24, 2013

Will do.

@larsmans
scikit-learn member
larsmans added a note May 28, 2013

@GaelVaroquaux Fixed a bug in the example, added L2 normalization and linked it. It's now much better than raw k-means. Raw:

Homogeneity: 0.376
Completeness: 0.489
V-measure: 0.425
Adjusted Rand-Index: 0.369
Silhouette Coefficient: 0.007

100 LSA features:

Homogeneity: 0.570
Completeness: 0.605
V-measure: 0.587
Adjusted Rand-Index: 0.562
Silhouette Coefficient: 0.036

(This is one of the worst runs; there's quite a bit of variance, apparently due to k-means initialization.)

@ogrisel
scikit-learn member
ogrisel added a note May 28, 2013

AFAIK the document clustering example is still not linked here.

@GaelVaroquaux
scikit-learn member

👍 for linking the document clustering example here :)

But congratulations @larsmans , these results are indeed very nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
scikit-learn member

@ogrisel I've played with the whitening a bit and I don't think we should include it; it obviously fails for non-negative data but even for Gaussian noise, the variances are typically not quite equal to one (at least not for the first component). I.e., reusing the simple whitening in RandomizedPCA is hard to test and it's hacky, so I'd rather recommend postprocessing using a Scaler, even if that's a bit more expensive.

Besides, in the LSA case, you'd normalize per sample rather than scale per feature, to make cosine similarities work.

@larsmans
scikit-learn member

@ogrisel Any further thoughts about the whitening issue?

@ogrisel ogrisel and 1 other commented on an outdated diff May 28, 2013
examples/document_clustering.py
@@ -147,17 +155,26 @@
print("n_samples: %d, n_features: %d" % X.shape)
print()
+if opts.n_components:
+ print("Performing dimensionality reduction using LSA")
+ t0 = time()
+ lsa = TruncatedSVD(opts.n_components)
+ X = lsa.fit_transform(X)
+ X = Normalizer(copy=False).fit_transform(X)
@ogrisel
scikit-learn member
ogrisel added a note May 28, 2013

Please add an inline comment to explain that row-wise normalization make the subsequent euclidean k-means behave as a spherical k-means (using cosine similarities) that is usually better for text classification.

It would also be interesting to check that using Normalizer along on the TF-IDF feature would not bring most of the LSA perf improvement by itself.

@larsmans
scikit-learn member
larsmans added a note May 28, 2013

TfidfVectorizer returns L2 normalized by default, doesn't it?

@larsmans
scikit-learn member
larsmans added a note May 28, 2013

Ok, added the remark. Tf-idf features are indeed already normalized:

Xnorm = Normalizer().fit_transform(X)
print(((X.A - Xnorm.A) ** 2).sum())

reports

1.77960702013e-28
@ogrisel
scikit-learn member
ogrisel added a note May 28, 2013

Indeed I forgot about the norm='l2' default value. Nice :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
scikit-learn member
ogrisel commented May 28, 2013

+1 for merging once the clustering example link is fixed.

@larsmans
scikit-learn member

@GaelVaroquaux the right example is linked in fac8278dd817b57ca086e6fdf27b2fe4a3981e7a. Merge?

@GaelVaroquaux
scikit-learn member
@larsmans larsmans merged commit df889f8 into scikit-learn:master Jun 12, 2013
@larsmans larsmans deleted the larsmans:truncated-svd branch Jun 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.