TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916

zachmayer · 2020-07-13T20:56:50Z

Describe the workflow you want to enable

If I'm using TruncatedSVD in a pipeline, it'd be nice to have an option to automatically set n_components < n_features, if n_components >= n_features.

For example, in the docs for sklearn.manifold.TSNE suggest using TruncatedSVD to limit the dimensionality of the input to 50. This is easy to do with a pipeline:

from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.pipeline import make_pipeline
from scipy.sparse import random

tsne = make_pipeline(TruncatedSVD(n_components=50), TSNE(n_iter=250))

wide_data = random(1000, 100)
wide_tsne = tsne.fit_transform(wide_data)

However, let's say later we get a narrower dataset:

narrow_data = random(1000, 10)
narrow_tsne = tsne.fit_transform(narrow_data)

This raises the error: n_components must be < n_features; got 50 >= 10

Describe your proposed solution

I'd like to add a parameter to the __init__ for TruncatedSVD, with a name like excess_n_components. The default will be something like excess_n_components="error" which will preserve the current behaivor.

However, if excess_n_components="reduce_n_components" (or some other good way to specify it), at fit time we'd automatically reset n_components to X.shape[1]-1. (With maybe a special case for when X.shape[1]==1?)

Describe alternatives you've considered, if relevant

Writing a wrapper for Pipeline that generates different pipelines depending on the source data. This gets difficult; however, if the pipeline contains intermediate steps that may increase or decrease the dimensionality of the data.

Additional context

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2020-07-15T16:45:09Z

Would this benefit from allowing for a float n_components similar to PCA?

zachmayer · 2020-07-15T19:53:03Z

@thomasjpfan yes, that would be useful, but I think it's a different issue: #10988

In my case, I want exactly 50 components if the data has > 50 columns, otherwise I want to set n_components = ncols-1.

I'd be happy to make a PR showing what I'm thinking.

thomasjpfan · 2020-07-16T01:50:26Z

I am +0.25 on having this feature. Let's see what everyone else thinks.

jnothman · 2020-07-16T10:46:54Z

Rather than an option, I'd prefer to just fit with a warning...

zachmayer · 2020-07-16T13:21:39Z

@jnothman I also prefer just to fit with a warning. That's also much simpler to implement. Should I make a PR?

TomDLT · 2020-07-16T17:04:01Z

+1. You are very welcome to make a PR.

NicolasHug · 2020-07-16T17:07:11Z

if we decide to warn instead of raise, we should probably preserve consistency across estimators. As far as I know most estimator that do dimensionality reduction raise an error.

Also instead of generating pipelines on the fly @zachmayer could you instead use a ColumnTransformer that would dispatch either to PCA or to a passthrough depending on the number of columns?

zachmayer · 2020-07-16T17:46:59Z

@NicolasHug could you instead use a ColumnTransformer that would dispatch either to PCA or to a passthrough depending on the number of columns? Could you show me a simple example of how to do that? I think that would solve my immediate problem!

NicolasHug · 2020-07-16T18:48:11Z

I think this does what you want:

def selector(X):
    n_features = X.shape[1]
    return np.ones(n_features, dtype=bool) if n_features >= 50 else np.zeros(n_features, dtype=bool)

ct = ColumnTransformer([('maybe_PCA', PCA(n_components=50), selector)],
                       remainder='passthrough')

NicolasHug · 2020-07-16T18:58:09Z

Note that this does not apply PCA when n_features < 50. If you always want to apply PCA, since you seem to only use fit_transform, another solution would be use set_params() on the pipeline:

tsne = make_pipeline(TruncatedSVD(n_components='will_be_set_later'), TSNE(n_iter=250))

wide_data = random(1000, 100)
tsne.set_params('TruncatedSVD__n_components'=min(wide_data.shape[1], 50))
wide_tsne = tsne.fit_transform(wide_data)

zachmayer · 2020-07-16T19:49:22Z

@NicolasHug set_params doesn't work as well for me, since in practice I actually have some preprocessing steps before the TruncatedSVD step that can add (or remove) a lot of columns, in sometimes hard-to-predict ways

zachmayer · 2020-07-17T17:08:58Z

Draft PR here, I am open to feedback as I may not be doing this the best way: #17949

NicolasHug · 2020-07-17T17:14:06Z

There's always the option to create a custom class?

class MySVD(TruncatedSVD):
	def fit_transform(self, X, y=None):
		self.n_components = min(50, X.shape[1])
		return super().fit_transform(X)

zachmayer · 2020-07-17T17:24:20Z

Yeah the custom class is what I'm doing now.

NicolasHug · 2020-07-17T18:58:10Z

@jnothman @thomasjpfan @TomDLT , if a simple custom class definition is enough to support the original use-case, do we really want to change from errors to warnings?
We've been pretty conservative about that kind of usage and I personally think that it makes more sense to keep the error, since the workaround is reasonably simple.

zachmayer · 2020-07-17T19:26:11Z

@NicolasHug I wanted to add an option, and preserve the default behaivor of raising an error. However, @jnothman said he'd prefer converting the error to a warning.

TomDLT · 2020-07-17T19:44:35Z

I think I prefer changing from error to warning compared to adding a new parameter.
But I am also fine with keeping the error, and adding an example of how to implement the workaround with custom classes.

zachmayer · 2020-07-29T13:48:49Z

I want to push back a little on the "custom classes" approach.

It's a really common workflow to do tf-idf vectorization and then truncated SVD. I've written a lot of pipelines that do tf-idf and then truncated SVD. It's a really useful tool to have in your toolkit.

However, this pipeline has a fatal flaw: If your training data happens to have <50 words (say because you're trying the pipeline on a sample of data or writing a unit test), the pipeline fails. It takes time and effort to figure out why it failed, and I think that users who struggle to split out each text column into its own TfidfVectorizer (as @amueller mentions his students struggle in #16972) will struggle to write a custom TruncatedSVD class that dynamically sets n_components.

NicolasHug · 2020-07-29T14:22:37Z

This is indeed semi-advanced usage. Hopefully these kind of solutions end up easily findable on stack overflow and similar platforms.

I would prefer switching to warnings only as a very last resort, because here this teaches users that warnings can basically just be ignored (they shouldn't).

zachmayer added the New Feature label Jul 13, 2020

zachmayer linked a pull request Jul 17, 2020 that will close this issue

[WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. #17949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916

TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916

zachmayer commented Jul 13, 2020

thomasjpfan commented Jul 15, 2020

zachmayer commented Jul 15, 2020

thomasjpfan commented Jul 16, 2020 •

edited

jnothman commented Jul 16, 2020 via email

zachmayer commented Jul 16, 2020

TomDLT commented Jul 16, 2020 •

edited

NicolasHug commented Jul 16, 2020 •

edited

zachmayer commented Jul 16, 2020

NicolasHug commented Jul 16, 2020

NicolasHug commented Jul 16, 2020

zachmayer commented Jul 16, 2020

zachmayer commented Jul 17, 2020

NicolasHug commented Jul 17, 2020

zachmayer commented Jul 17, 2020

NicolasHug commented Jul 17, 2020

zachmayer commented Jul 17, 2020

TomDLT commented Jul 17, 2020

zachmayer commented Jul 29, 2020

NicolasHug commented Jul 29, 2020

TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916

TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916

Comments

zachmayer commented Jul 13, 2020

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

thomasjpfan commented Jul 15, 2020

zachmayer commented Jul 15, 2020

thomasjpfan commented Jul 16, 2020 • edited

jnothman commented Jul 16, 2020 via email

zachmayer commented Jul 16, 2020

TomDLT commented Jul 16, 2020 • edited

NicolasHug commented Jul 16, 2020 • edited

zachmayer commented Jul 16, 2020

NicolasHug commented Jul 16, 2020

NicolasHug commented Jul 16, 2020

zachmayer commented Jul 16, 2020

zachmayer commented Jul 17, 2020

NicolasHug commented Jul 17, 2020

zachmayer commented Jul 17, 2020

NicolasHug commented Jul 17, 2020

zachmayer commented Jul 17, 2020

TomDLT commented Jul 17, 2020

zachmayer commented Jul 29, 2020

NicolasHug commented Jul 29, 2020

thomasjpfan commented Jul 16, 2020 •

edited

TomDLT commented Jul 16, 2020 •

edited

NicolasHug commented Jul 16, 2020 •

edited