-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916
Comments
Would this benefit from allowing for a float |
@thomasjpfan yes, that would be useful, but I think it's a different issue: #10988 In my case, I want exactly 50 components if the data has > 50 columns, otherwise I want to set I'd be happy to make a PR showing what I'm thinking. |
I am +0.25 on having this feature. Let's see what everyone else thinks. |
Rather than an option, I'd prefer to just fit with a warning...
|
@jnothman I also prefer just to fit with a warning. That's also much simpler to implement. Should I make a PR? |
+1. You are very welcome to make a PR. |
if we decide to warn instead of raise, we should probably preserve consistency across estimators. As far as I know most estimator that do dimensionality reduction raise an error. Also instead of generating pipelines on the fly @zachmayer could you instead use a ColumnTransformer that would dispatch either to PCA or to a passthrough depending on the number of columns? |
@NicolasHug |
I think this does what you want: def selector(X):
n_features = X.shape[1]
return np.ones(n_features, dtype=bool) if n_features >= 50 else np.zeros(n_features, dtype=bool)
ct = ColumnTransformer([('maybe_PCA', PCA(n_components=50), selector)],
remainder='passthrough') |
Note that this does not apply PCA when n_features < 50. If you always want to apply PCA, since you seem to only use tsne = make_pipeline(TruncatedSVD(n_components='will_be_set_later'), TSNE(n_iter=250))
wide_data = random(1000, 100)
tsne.set_params('TruncatedSVD__n_components'=min(wide_data.shape[1], 50))
wide_tsne = tsne.fit_transform(wide_data) |
@NicolasHug |
Draft PR here, I am open to feedback as I may not be doing this the best way: #17949 |
There's always the option to create a custom class? class MySVD(TruncatedSVD):
def fit_transform(self, X, y=None):
self.n_components = min(50, X.shape[1])
return super().fit_transform(X) |
Yeah the custom class is what I'm doing now. |
@jnothman @thomasjpfan @TomDLT , if a simple custom class definition is enough to support the original use-case, do we really want to change from errors to warnings? |
@NicolasHug I wanted to add an option, and preserve the default behaivor of raising an error. However, @jnothman said he'd prefer converting the error to a warning. |
I think I prefer changing from error to warning compared to adding a new parameter. |
I want to push back a little on the "custom classes" approach. It's a really common workflow to do tf-idf vectorization and then truncated SVD. I've written a lot of pipelines that do tf-idf and then truncated SVD. It's a really useful tool to have in your toolkit. However, this pipeline has a fatal flaw: If your training data happens to have <50 words (say because you're trying the pipeline on a sample of data or writing a unit test), the pipeline fails. It takes time and effort to figure out why it failed, and I think that users who struggle to split out each text column into its own |
This is indeed semi-advanced usage. Hopefully these kind of solutions end up easily findable on stack overflow and similar platforms. I would prefer switching to warnings only as a very last resort, because here this teaches users that warnings can basically just be ignored (they shouldn't). |
Describe the workflow you want to enable
If I'm using TruncatedSVD in a pipeline, it'd be nice to have an option to automatically set n_components < n_features, if n_components >= n_features.
For example, in the docs for sklearn.manifold.TSNE suggest using TruncatedSVD to limit the dimensionality of the input to 50. This is easy to do with a pipeline:
However, let's say later we get a narrower dataset:
This raises the error:
n_components must be < n_features; got 50 >= 10
Describe your proposed solution
I'd like to add a parameter to the
__init__
for TruncatedSVD, with a name likeexcess_n_components
. The default will be something likeexcess_n_components="error"
which will preserve the current behaivor.However, if
excess_n_components="reduce_n_components"
(or some other good way to specify it), at fit time we'd automatically resetn_components
toX.shape[1]-1
. (With maybe a special case for when X.shape[1]==1?)Describe alternatives you've considered, if relevant
Writing a wrapper for Pipeline that generates different pipelines depending on the source data. This gets difficult; however, if the pipeline contains intermediate steps that may increase or decrease the dimensionality of the data.
Additional context
The text was updated successfully, but these errors were encountered: