Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916

Open
zachmayer opened this issue Jul 13, 2020 · 19 comments · May be fixed by #17949
Open

TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916

zachmayer opened this issue Jul 13, 2020 · 19 comments · May be fixed by #17949

Comments

@zachmayer
Copy link
Contributor

Describe the workflow you want to enable

If I'm using TruncatedSVD in a pipeline, it'd be nice to have an option to automatically set n_components < n_features, if n_components >= n_features.

For example, in the docs for sklearn.manifold.TSNE suggest using TruncatedSVD to limit the dimensionality of the input to 50. This is easy to do with a pipeline:

from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.pipeline import make_pipeline
from scipy.sparse import random

tsne = make_pipeline(TruncatedSVD(n_components=50), TSNE(n_iter=250))

wide_data = random(1000, 100)
wide_tsne = tsne.fit_transform(wide_data)

However, let's say later we get a narrower dataset:

narrow_data = random(1000, 10)
narrow_tsne = tsne.fit_transform(narrow_data)

This raises the error: n_components must be < n_features; got 50 >= 10

Describe your proposed solution

I'd like to add a parameter to the __init__ for TruncatedSVD, with a name like excess_n_components. The default will be something like excess_n_components="error" which will preserve the current behaivor.

However, if excess_n_components="reduce_n_components" (or some other good way to specify it), at fit time we'd automatically reset n_components to X.shape[1]-1. (With maybe a special case for when X.shape[1]==1?)

Describe alternatives you've considered, if relevant

Writing a wrapper for Pipeline that generates different pipelines depending on the source data. This gets difficult; however, if the pipeline contains intermediate steps that may increase or decrease the dimensionality of the data.

Additional context

@thomasjpfan
Copy link
Member

Would this benefit from allowing for a float n_components similar to PCA?

@zachmayer
Copy link
Contributor Author

@thomasjpfan yes, that would be useful, but I think it's a different issue: #10988

In my case, I want exactly 50 components if the data has > 50 columns, otherwise I want to set n_components = ncols-1.

I'd be happy to make a PR showing what I'm thinking.

@thomasjpfan
Copy link
Member

thomasjpfan commented Jul 16, 2020

I am +0.25 on having this feature. Let's see what everyone else thinks.

@jnothman
Copy link
Member

jnothman commented Jul 16, 2020 via email

@zachmayer
Copy link
Contributor Author

@jnothman I also prefer just to fit with a warning. That's also much simpler to implement. Should I make a PR?

@TomDLT
Copy link
Member

TomDLT commented Jul 16, 2020

+1. You are very welcome to make a PR.

@NicolasHug
Copy link
Member

NicolasHug commented Jul 16, 2020

if we decide to warn instead of raise, we should probably preserve consistency across estimators. As far as I know most estimator that do dimensionality reduction raise an error.

Also instead of generating pipelines on the fly @zachmayer could you instead use a ColumnTransformer that would dispatch either to PCA or to a passthrough depending on the number of columns?

@zachmayer
Copy link
Contributor Author

@NicolasHug could you instead use a ColumnTransformer that would dispatch either to PCA or to a passthrough depending on the number of columns? Could you show me a simple example of how to do that? I think that would solve my immediate problem!

@NicolasHug
Copy link
Member

I think this does what you want:

def selector(X):
    n_features = X.shape[1]
    return np.ones(n_features, dtype=bool) if n_features >= 50 else np.zeros(n_features, dtype=bool)

ct = ColumnTransformer([('maybe_PCA', PCA(n_components=50), selector)],
                       remainder='passthrough')

@NicolasHug
Copy link
Member

Note that this does not apply PCA when n_features < 50. If you always want to apply PCA, since you seem to only use fit_transform, another solution would be use set_params() on the pipeline:

tsne = make_pipeline(TruncatedSVD(n_components='will_be_set_later'), TSNE(n_iter=250))

wide_data = random(1000, 100)
tsne.set_params('TruncatedSVD__n_components'=min(wide_data.shape[1], 50))
wide_tsne = tsne.fit_transform(wide_data)

@zachmayer
Copy link
Contributor Author

@NicolasHug set_params doesn't work as well for me, since in practice I actually have some preprocessing steps before the TruncatedSVD step that can add (or remove) a lot of columns, in sometimes hard-to-predict ways

@zachmayer
Copy link
Contributor Author

Draft PR here, I am open to feedback as I may not be doing this the best way: #17949

@NicolasHug
Copy link
Member

There's always the option to create a custom class?

class MySVD(TruncatedSVD):
	def fit_transform(self, X, y=None):
		self.n_components = min(50, X.shape[1])
		return super().fit_transform(X)

@zachmayer
Copy link
Contributor Author

Yeah the custom class is what I'm doing now.

@NicolasHug
Copy link
Member

@jnothman @thomasjpfan @TomDLT , if a simple custom class definition is enough to support the original use-case, do we really want to change from errors to warnings?
We've been pretty conservative about that kind of usage and I personally think that it makes more sense to keep the error, since the workaround is reasonably simple.

@zachmayer
Copy link
Contributor Author

@NicolasHug I wanted to add an option, and preserve the default behaivor of raising an error. However, @jnothman said he'd prefer converting the error to a warning.

@TomDLT
Copy link
Member

TomDLT commented Jul 17, 2020

I think I prefer changing from error to warning compared to adding a new parameter.
But I am also fine with keeping the error, and adding an example of how to implement the workaround with custom classes.

@zachmayer
Copy link
Contributor Author

I want to push back a little on the "custom classes" approach.

It's a really common workflow to do tf-idf vectorization and then truncated SVD. I've written a lot of pipelines that do tf-idf and then truncated SVD. It's a really useful tool to have in your toolkit.

However, this pipeline has a fatal flaw: If your training data happens to have <50 words (say because you're trying the pipeline on a sample of data or writing a unit test), the pipeline fails. It takes time and effort to figure out why it failed, and I think that users who struggle to split out each text column into its own TfidfVectorizer (as @amueller mentions his students struggle in #16972) will struggle to write a custom TruncatedSVD class that dynamically sets n_components.

@NicolasHug
Copy link
Member

This is indeed semi-advanced usage. Hopefully these kind of solutions end up easily findable on stack overflow and similar platforms.

I would prefer switching to warnings only as a very last resort, because here this teaches users that warnings can basically just be ignored (they shouldn't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants